CN101968798A

CN101968798A - Community recommendation method based on on-line soft constraint LDA algorithm

Info

Publication number: CN101968798A
Application number: CN 201010284218
Authority: CN
Inventors: 俞能海; 康雨洁
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2010-09-10
Filing date: 2010-09-10
Publication date: 2011-02-09

Abstract

The invention relates to a community recommendation method based on an on-line soft constraint linear discriminant analysis (LDA) algorithm, which belongs to the field of social networks. The invention aims to solve the problems of over-fitting and high computing amount due to the restriction of data observation, which are common in the current community recommendation method. The invention mainly utilizes the number of posting in each community as the soft constraint, computes the potential subject distribution in the community by using the LDA algorithm, processes the newly added users by using an increment method, and updates the model in real time, thereby achieving the effect of on-line operation. The method can be used for automatically computing the subject distribution in the community on the premise of lacking user characteristics and community characteristics, and finally estimates the community which the user is most interested; and thus, the on-line method greatly enhances the computational efficiency. The invention obviously improves the accuracy rate and speed as compared with the popular method at present.

Description

Community's recommend method based on online soft-constraint LDA algorithm

Technical field

The present invention relates to personalized recommendation method, particularly a kind of community's recommend method based on online soft-constraint LDA algorithm.

Background technology

In recent years, community network was large quantities of emerges in large numbers.These webpages provide the instrument of setting up community for the user, make users with a common goal to flock together, and share their information of interest mutually.Could can users produce fascination in the face of the community of various theme of magnanimity, how the interested community of efficient selection oneself along with increasing fast of this social network service? so a very important techniques is recommended to become gradually by community.

Common personalized recommendation method has two kinds at present: a kind of recommend method that is based on content, a kind of is collaborative the recommendation.Content-based recommend method utilizes every user that the comment or the voting behavior of object are trained a preference pattern for this user earlier, and then a new object that utilizes this preference pattern to recommend him to be most interested in to the user.Collaborative recommendation is to be based upon on the basis of following hypothesis: similar user has identical hobby.When using collaborative method of recommending to give user's recommended, only need hobby with reference to the user similar to this user.Therefore this method does not need to obtain the content information of object, can recommend under the situation of shortage to the description of object.

In the algorithm that present existing community is recommended, two kinds of well-known methods based on collaborative recommendation are arranged: the LDA method of ARM method and two-value.There are how many overlapping users to calculate intercommunal mutual relationship between the different communities of ARM method utilization.The LDA method of two-value is calculated intercommunal potential theme by community-user's co-occurrence matrix.These two kinds of methods run into easily because over-fitting phenomenon that restriction caused and the huge problem of calculated amount that data are observed.And these two kinds of methods have all been ignored the user and intercommunal relation is strong and weak, can not handle initiate user.

Summary of the invention

The objective of the invention is to, solve that existing community recommend method is run into easily because over-fitting phenomenon that restriction caused and the huge problem of calculated amount that data are observed.

For achieving the above object, the invention provides a kind of community's recommend method based on online soft-constraint LDA algorithm, comprise that calculating theme distributes, calculates best candidate community, online updating three big steps.

Described calculating theme distribution step is:

Step a for unique user, grasps its information of posting in each little community, adds up its number of times of posting respectively, as the criterion of user and community relations, with i user U _iAt j the C of community that he participated in _{I, j}On the number of times of posting as user U _iWith the C of community _{I, j}Concern degree of strength, use R _{I, j}Expression;

Step b is considered as document with the user, and the community that the user participates in is considered as the word in the document, and R _{I, j}Be exactly the word C of community _{I, j}At customer documentation U _iIn occurrence number, set up with the LDA algorithm that user's theme distributes and theme community's distributed model, and with Gibbs method of sampling solving model parameter, the detailed process of finding the solution is:

Be earlier the community's word that occurs in all customer documentations, theme set of Random assignment, as be the word C of community _{I, j}Distribute the theme set

Utilize iterative formula to upgrade all themes again, restrain up to model parameter:

P (t_{i, j, k} = t | T_{- (i, j, k),} c) &Proportional; \frac{n_{- (i, j, k), t}^{(c_{i, j})} + β}{n_{- (i, j, k), t}^{(\cdot)} + N_{C} β} \frac{n_{- (i, j, k), t}^{(u_{i})} + α}{n_{- (i, j, k), \cdot}^{(u_{i})} + N_{T} α}

Wherein, T _{-(i, j, k)}Current theme t is removed in expression _{I, j, k}All remaining afterwards themes distribute,

The expression C of community _{I, j}Be assigned to the total degree of theme t, α and β are the parameters that adopts empirical value;

Described calculating best candidate community step is:

Step c distributes with Model Calculation theme-community's distribution phi of finding the solution out and user-theme

, utilize following formula to calculate:

{\hat{φ}}_{t}^{(c)} = \frac{n_{t}^{(c)} + β}{n_{t}^{(\cdot)} + N_{C} β},

{\hat{θ}}_{t}^{(u)} = \frac{n_{t}^{(u)} +α}{n_{\cdot}^{(u)} + N_{T} α};

Steps d for each community's marking, is found out the community that the user is most interested in, and the standards of grading that sort for community are:

ξ = \underset{N_{T}}{Σ} {\hat{φ}}_{t}^{(c)} {\hat{θ}}_{t}^{(u)};

Described online updating step is:

Step e on the model basis of invariable that maintenance has trained, trains separately initiate user model, and method is specially:

The model that maintenance has trained is constant, give the community that occurs in initiate customer documentation word Random assignment theme set separately, the theme that utilizes the iterative formula iteration to upgrade in the new customer documentation again distributes, use the model that has trained in the iterative formula, and only the less number of times of iteration can significantly be raised the efficiency like this;

Step f, the merging of two parts model is as a whole, as new model, be each community's marking more again.

Beneficial effect of the present invention is, utilize the post number of times of user in each community as soft-constraint, use the LDA algorithm, can be under the prerequisite that lacks user characteristics and community's feature, the potential theme that calculates automatically in the community distributes, and finally extrapolates the community that the user is most interested in.Utilize the method for an increment to handle initiate user, the real-time update model reaches the effect of on-line operation, has improved counting yield greatly.

In order to check the validity of our method, we as data set, have collected the information of 409093 10814 communities that user and they participated in MySpace altogether.We choose one at random and form test set from all communities that each user participated in, and all remaining data are as training set.Use the LDA method of two-value and our method simultaneously, carry out the recommendation of community to the user.See then whether the rank in recommendation results of the community in the test set is forward, rank is high more, and ecbatic is good more.Experimental result shows that the present invention has significantly improved accuracy rate and speed that community is recommended.As Fig. 1.

Description of drawings

Fig. 1 compares for the result of S-LDA of the present invention and two-value B-LDA method.

Fig. 2 carries out the system schematic that community is recommended for using the present invention to the user.

Fig. 3 is for calculating the process flow diagram that theme distributes.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is a part of embodiment of the present invention, rather than whole embodiment.Based on embodiments of the invention, the every other embodiment that those of ordinary skills are obtained under the prerequisite of not making creative work belongs to the scope of protection of the invention.

As shown in Figure 1, for using the present invention the user is carried out the system schematic that community is recommended.Described system comprises foreground reptile and backstage arithmetic element.On the foreground, reptile is responsible for obtaining the information that all these systems need handle from network.In the backstage arithmetic element, the method that the present invention comprised is used to the data that the foreground obtains are analyzed and calculated.

Before calculating process of the present invention is elaborated, this example is carried out certain description.The data that this example is related are all come the friend-making website MySpace of the famous community of automatic network.Realize grasping the reptile design of these data and the management method of these data, not within the scope of the invention.

For the user who obtains-community relations matrix, the target of our method is that the potential theme that will calculate in the community distributes, and utilizes this distribution to community's ordering of giving a mark, and recommends the highest community of score to the user at last.

The present invention need preestablish the number of potential theme.When calculating potential theme distribution, the number of topics purpose is chosen the accuracy that can influence final recommendation results, adjusts within the specific limits according to actual conditions in the application.

Below calculating process of the present invention is elaborated.

As shown in Figure 2, the present invention includes the calculating theme and distribute, calculate best candidate community and online updating three big steps.

Be elaborated for calculating the theme distributed process below.

As shown in Figure 3, the flow process of calculating theme distribution is:

Step 101: initialization, be the community's word that occurs in all customer documentations, theme set of Random assignment, as be the word C of community _{I, j}Distribute the theme set

The size of theme set is R _{I, j}, representative of consumer u _iWith the c of community _{I, j}Concern power, count initialized array N1, N2, N3, N4 is used for storage respectively With

Step 102: initial value i=1 is set, illustrates that current preparation handles first customer documentation;

Step 103: initial value j=1 is set, current preparation process user u is described _iThe first property word;

Step 104: initial value k=1 is set, the current preparation processing c of community is described _{I, j}First theme in the theme set;

Step 105:, k with N1[j], N2[i, k], N3[k] value subtract 1 respectively;

Step 106: utilize formula Selection has the theme t of maximum probability, as new theme, replaces original theme;

Step 107:, k with N1[j], N2[i, k], N3[k] value add 1 respectively;

Step 108: if k＜R _Ij, then k=k+1 forwards step 105 to, otherwise continues;

Step 109: if c _{I, j}Not user u _iLast community's word, then j=j+1 forwards step 104 to, otherwise continues;

Step 110: if u _iBe not last user, then i=i+1 forwards step 103 to, otherwise finishes.

Be described in detail for calculating the best candidate community process below.

Step 201: distribute with Model Calculation theme-community's distribution phi of finding the solution out and user-theme

, utilize following formula to calculate:

{\hat{φ}}_{t}^{(c)} = \frac{n_{t}^{(c)} + β}{n_{t}^{(\cdot)} + N_{C} β},

{\hat{θ}}_{t}^{(u)} = \frac{n_{t}^{(u)} +α}{n_{\cdot}^{(u)} + N_{T} α};

Step 202: for each community's marking, find out the community that the user is most interested in, the standards of grading that sort for community are:

ξ = \underset{N_{T}}{Σ} {\hat{φ}}_{t}^{(c)} {\hat{θ}}_{t}^{(u)};

Be described in detail for the online updating process below.

Step 301: for initiate customer documentation, the model that maintenance has trained is constant, distributes the theme set for separately the community's word that occurs in the initiate customer documentation;

Step 302: utilize iterative formula, upgrade the set of newly assigned theme separately, can use the model that has trained in the iterative formula, and the fewer number of times of iteration only;

Step 303: the merging of two parts model is as a whole, as new model, forward step 201 then to.

The above description of this invention is illustrative, and nonrestrictive, and those skilled in the art is understood, and can carry out many modifications, variation or equivalence to it within spirit that claim limits and scope, but they will fall within the scope of protection of the present invention all.

Claims

1. the community's recommend method based on online soft-constraint LDA algorithm is characterized in that, comprises that calculating theme distributes, calculates best candidate community, online updating three big steps:

Described calculating theme distribution step is:

Step a for unique user, grasps its information of posting in each little community, adds up its number of times of posting respectively, with this power that concerns as measurement user and community;

Step b sets up with the LDA algorithm that user's theme distributes and theme community's distributed model, and with Gibbs method of sampling solving model parameter;

Described calculating best candidate community step is:

Steps d for each community's marking, is found out the community that the user is most interested in;

Described online updating step is:

Step e on the model basis of invariable that maintenance has trained, trains separately initiate user model;

2. calculating theme distribution step according to claim 1 is characterized in that, among the described step a, selects the post number of times of user in community as the criterion of user and community relations, with i user U _iAt j the C of community that he participated in _{I, j}On the number of times of posting as user U _iWith the C of community _{I, j}Concern degree of strength, use R _{I, j}Expression.

3. calculating theme distribution step according to claim 1 is characterized in that, among the described step b, the user is considered as document, and the community that the user participates in is considered as the word in the document, and R _{I, j}Be exactly the word C of community _{I, j}At customer documentation U _iIn occurrence number.

4. calculating theme distribution step according to claim 1 is characterized in that, among the described step b, the solving model parametric procedure is specially:

P (t_{i, j, k} = t | T_{- (i, j, k),} c) &Proportional; \frac{n_{- (i, j, k), t}^{(c_{i, j})} + β}{n_{- (i, j, k), t}^{(\cdot)} + N_{C} β} \frac{n_{- (i, j, k), t}^{(u_{i})} + α}{n_{- (i, j, k), \cdot}^{(u_{i})} + N_{T} α}

The expression C of community _{I, j}Be assigned to the total degree of theme t, α and β are the parameters that adopts empirical value.

5. calculating best candidate according to claim 1 community step is characterized in that, among the described step c, theme-community's distribution phi and user-theme distributes

Calculate with following formula:

{\hat{φ}}_{t}^{(c)} = \frac{n_{t}^{(c)} + β}{n_{t}^{(\cdot)} + N_{C} β},

{\hat{θ}}_{t}^{(u)} = \frac{n_{t}^{(u)} +α}{n_{\cdot}^{(u)} + N_{T} α} .

6. calculating best candidate according to claim 1 community step is characterized in that, in the described steps d, for the standards of grading of community's ordering are:

ξ = \underset{N_{T}}{Σ} {\hat{φ}}_{t}^{(c)} {\hat{θ}}_{t}^{(u)} .

7. online updating step according to claim 1 is characterized in that, among the described step e, the method that initiate user model is trained separately is specially:

The model that maintenance has trained is constant, give the community that occurs in initiate customer documentation word Random assignment theme set separately, the theme that utilizes the iterative formula iteration to upgrade in the new customer documentation again distributes, use the model that has trained in the iterative formula, and only the less number of times of iteration can significantly be raised the efficiency like this.