CN113326433B

CN113326433B - Personalized recommendation method based on ensemble learning

Info

Publication number: CN113326433B
Application number: CN202110629501.6A
Authority: CN
Inventors: 段勇; 杨堃
Original assignee: Shenyang University of Technology
Current assignee: Shenyang University of Technology
Priority date: 2021-03-26
Filing date: 2021-06-07
Publication date: 2023-10-10
Anticipated expiration: 2041-06-07
Also published as: CN113326433A

Abstract

The invention relates to the field of machine learning and recommendation systems, in particular to an integrated learning-based personalized recommendation method. The data preprocessing module is mainly responsible for reintegrating the data characteristics, and solves the problem of difficult extraction of complex characteristics by constructing new characteristics and reducing the popular learning dimension; the model establishment and optimization module is mainly responsible for establishing a personalized integrated learning prediction model based on the fused data, and performing Bayesian optimization on the basis of the establishment of the prediction model, so that the accuracy of personalized recommendation is improved; the personalized recommendation module is mainly responsible for acquiring the data result of the prediction model, acquiring the personalized recommendation result through a Top N recommendation method and verifying the personalized recommendation result. The method can improve the accuracy of personalized recommendation through ensemble learning; in addition, the popular learning is fused to reduce the dimension so as to realize the fusion of the data characteristics, and further solve the problem of difficult extraction of complex characteristics.

Description

Personalized recommendation method based on ensemble learning

Technical Field

The invention relates to the field of machine learning and recommendation systems, in particular to a personalized recommendation method based on popular learning LPP (local retention projection algorithm, locally Preserving Projections) and integrated learning GBDT (gradient lifting decision tree, gradient Boosting Decision Tree).

Background

In recent years, with continuous updating of internet technology and computer technology, the internet brings huge information data volume, and meanwhile, the phenomenon of information overload is also aggravated. Although the selection range of the information resource is expanded for the user, how to quickly and effectively screen the useful information from huge data, and improving the utilization efficiency of the information becomes a great difficulty in the development of the contemporary internet. Many existing web applications (e.g., web portals and search engines, etc.) are essentially one way to help users filter information. However, these methods can only meet the mainstream demands of users, the problem of individualization is not considered, and the problem of information overload is not solved well yet. Personalized recommendation is an effective method for solving the information overload problem as an important information filtering means.

With the development of the machine learning age, the application of the machine learning method in the field of recommendation algorithms has become a great trend. Personalized recommendations have also resorted to many machine learning methods such as support vector machines, decision trees, neural networks, deep learning, clustering, dimension reduction, regression prediction, ensemble learning, etc. The personalized recommendation method based on machine learning can effectively solve the problems that a similarity calculation method is monotonous, the similarity calculation complexity is high, potential interests of users are difficult to mine, user tag information and demographic information are difficult to utilize, commodity feature extraction is difficult, and the like, however, the user tag information, the demographic information and commodity feature information are poor in effect in solving the cold start problem, and are indispensable information for acquiring the potential interests of the users.

Disclosure of Invention

Object of the Invention

The invention provides a personalized recommendation method based on a local reservation projection algorithm and ensemble learning, which aims to solve the problem of information overload in a recommendation system and improve the efficiency and the precision of personalized recommendation.

Technical proposal

A personalized recommendation method based on ensemble learning, the method comprising:

step 1: analyzing dimension attributes of the personalized recommendation data, and dividing the personalized recommendation data into user-object-grading data; data association is performed on the associated user-item-scoring dimension;

step 2: after the processing is finished, analyzing the data type of each dimension attribute of the user-object-grading, and converting the data type into the data type required by the ensemble learning;

step 3: generating characteristic attributes according to scoring attributes in the dimension attributes of user-object scoring;

step 4: all the obtained data are subjected to standardized processing, and the calculation mode is as follows:

wherein v represents an original value of the data, v' represents a value after normalization processing, min represents a minimum value of a column in which v is located, and max represents a maximum value of a column in which v is located;

step 5: let "user-item-score" dataset A in original space have m sample points x ₁ ,x ₂ ,...,x _m Sample point x _i Is an l-dimensional vector, i is an integer from 1 to m, and the matrix formed by m samples according to columns is X; performing dimension reduction processing on the data set A by using a popular learning LPP method, wherein the dimension-reduced data set B is a sample point y ₁ ,y ₂ ,...,y _m Composition, sample point y _i Is an n-dimensional vector, the m samples are in a matrix of columns Y, where l > n;

step 6: the dimension-reduced data set B is processed according to 8:2 is divided into a training set Train and a Test set Test, wherein a data matrix corresponding to the training set Train is Y';

step 7: establishing a personalized recommendation model by adopting an integrated learning GBDT method;

step 8: optimizing GBDT model parameters by adopting a Bayesian method;

step 9: the GBDT personalized recommendation model is retrained by selecting the optimal super-parameter combination obtained through Bayes optimization;

step 10: and (3) Top N recommendation and effect verification are carried out according to the prediction result of the finally obtained personalized recommendation model on the test set.

In step 3, the number of times of scoring the article by each user is counted, and the formula is as follows:

b represents the b-th user in the "user-item-scoring" dataset a, where there are d total users in dataset a, R (b) is the score of user b for each item, countRating refers to "sum of number of items per user comment".

The step 5 specifically includes the following steps:

step 5.1: constructing a graph, calculating a sample x in a user-object-score data set A _i And sample x _j The euclidean distance of all samples except for, the formula is as follows:

taking a mean value of samples, wherein epsilon is a manually set threshold value, m is the total number of samples in a data set, and if the Euclidean distance is smaller than the value epsilon, two samples are considered to be very close, and one side is established between a node i and a node j of the graph;

step 5.2: determining weights, if the node i is connected with the node j, the weights of edges between the node i and the node j are calculated as follows by a nuclear heat function:

ω _ij representation ofWeights, x, between node i and node j _i And x _j For the samples in "user-item-score" dataset a, t is a real number greater than 0 set by human;

step 5.3: the projection matrix is calculated as follows:

XLX ^T a＝λXDX ^T a (5)

let the solution in the formula be a ₀ ,a ₁ ,...,a _l-1 And their corresponding eigenvalues λ are ordered from small to large, with the projective transformation matrix being c= (a) ₀ ,a ₁ ,...,a _l-1 ) Then the sample point y after dimension reduction _i ＝C ^T x _i ；

Wherein X is the matrix X mentioned in step 5, and the adjacent matrix W is represented by the weight omega in step two _ij Constructing; the main diagonal of the diagonal matrix D is the weighted degree of each vertex of the graph constructed in step one, wherein the weighted degree of node i is the sum of the weights of all the edges associated with that node, i.e. the sum of each row of elements of the adjacency matrix W; the laplace matrix L is defined as l=d-W.

The step 7 comprises the following steps:

step 7.1: the GBDT model is defined as follows:

y 'is Y' mentioned in step 6, K represents the round of the score prediction learner, and K represents the total round of the score prediction learner; f (f) _k (Y') represents the score prediction learner of the kth round, h _k (Y') represents the kth CART (Classification and Regression Trees, categorical regression tree) decision regression tree;

step 7.2: constructing a CART decision regression tree, namely h (Y') in the step 7.1;

step 7.3: the scoring prediction learner adopts a forward step-by-step algorithm; the model of the kth step is formed by the model of the kth-1 step, namely the kth step of the score prediction learner is closely related to the score prediction learner of the previous k-1 step, and the formula is as follows:

f _k (Y′)＝f _k-1 (Y′)+β _k (7)

f _k (Y') is a kth round of score prediction learner, f _k-1 (Y') is the k-1 th round of score prediction learner, beta _k Representing the residual error generated by the kth round;

step 7.4: and continuing iteration until the iteration is completed, and completing model establishment.

The step 7.2 includes the following steps:

step 7.21: dividing the preprocessed data set B into H ₁ ,H ₂ ,...H _o The output value of each region is respectively: p is p ₁ ,p ₂ ,...,p _o ；

Step 7.22: recursively dividing each region into two sub-regions and determining an output value on each sub-region; selecting an optimal segmentation variable q and a segmentation point s according to the following formula;

p ₁ for the region H divided in step 7.21 ₁ Output of p ₂ For the region H divided in step 7.21 ₂ Output of u _v And w _v Respectively representing the characteristic attribute and the score of the data in the corresponding region, wherein the value of the vmax is the number of samples of the divided region; traversing the variable q, scanning the fixed segmentation variable q for a segmentation point s, and selecting a pair (q, s) enabling the upper expression to reach a minimum value; dividing the region with the selected pair (q, s) and determining a corresponding output value;

step 7.23: continuing to call the steps 7.21 and 7.22 for the two sub-areas until a stop condition is met;

step 7.24: repartitioning the input space into o regions H' ₁ ,H′ ₂ ,...H′ _o Generating a scoring prediction CART decision regression tree, wherein the formula is as follows:

h (u) is a partial prediction CART decision regression tree, H' _v For the divided areas, O is represented as a divided area subscript, and O is represented as the number of divided total areas; p is p _o For a fixed output value of the region divided in step 7.21, q 'and s' are optimal solutions iterated through steps 7.21 and 7.22.

The step 8 includes the following steps:

step 8.1: initializing a data set D '= (x' ₁ ,y′ ₁ ),...,(x′ _n ,y′ _n ) Wherein y' _i ＝f′(x′ _i ) The method comprises the steps of carrying out a first treatment on the surface of the f '(x') is the mapping relation from the dimension attribute in the data to the score;

step 8.2: GBDT model uses selected super-ginseng combinations x' _i Training and calculating f '(x' _i )；

Step 8.3: calculating the next super-ginseng combination to super-ginseng x 'by adopting an acquisition function' _i+1 ；

Step 8.4: repeating the steps 8.2 and 8.3, and iterating for T' times;

step 8.5: outputting the super-parametric combination of the optimized objective function f '(x').

The step 10 includes the following steps:

step 10.1: setting an N value, namely recommending N articles to users, and defining the number of the users as count;

step 10.2: for each user, the real recommendation list generated on the Test set Test is marked as T (All), and scoring and predicting are carried out on the Test set Test according to the GBDT recommendation model completed by the Bayesian optimization, and the obtained result is defined as Test evaluation set;

step 10.3: grading and sorting the Test evaluation sets, recommending the first N articles to users, and marking a Top N recommendation list obtained by each user as T (Test);

step 10.4: verifying the accuracy and recall rate results of the test evaluation groups;

step 10.5: calculating the length of T (Test);

step 10.6: calculating the length of T (All);

step 10.7: calculating an intersection T (U) between the Top N recommendation list of each user and T (Test);

step 10.8: calculating accuracy:accumulating the accuracy rate generated by each user, and dividing the sum by count to obtain average accuracy rate;

step 10.9: calculating recall rate:and accumulating the recall rates generated by each user, and dividing the sum by count to obtain the average recall rate.

Advantages and effects

1. The invention utilizes the related technology in the machine learning field, solves the problem of information overload in the current society through popular learning, reduces the dimension information of the data characteristic attribute, reduces the model training time, improves the model learning capability, and greatly improves the recommending efficiency.

2. Personalized recommendation is performed through ensemble learning, and a Bayesian recommendation model is optimized, so that recommendation accuracy is improved, useful information can be more rapidly and effectively screened out from huge data, and information utilization efficiency is improved.

Drawings

FIG. 1 is a general flow chart of the present invention;

FIG. 2 is a flow chart of data feature preprocessing;

FIG. 3 is a personalized recommendation flow chart.

Detailed Description

The following description of the embodiments of the invention is presented in conjunction with the accompanying drawings to provide a better understanding of the invention to those skilled in the art.

A personalized recommendation method based on popular learning LPP and integrated learning GBDT can improve the accuracy of personalized recommendation through integrated learning; in addition, the popular learning is fused to reduce the dimension so as to realize the fusion of the data characteristics, and further solve the problem of difficult extraction of complex characteristics.

FIG. 1 is a general flow chart of the present invention, comprising the following 10 steps, wherein steps 1-6 are the recommended data preprocessing section of FIG. 1; step 7, constructing a personalized recommendation model part in the figure 1; step 8 and step 9 are the optimized model parts in fig. 1; step 10 is the personalized recommendation component of fig. 1.

The data preprocessing module is mainly responsible for reintegrating the data characteristics, and solves the problem of difficult extraction of complex characteristics by constructing new characteristics and reducing the popular learning dimension; the model establishment and optimization module is mainly responsible for establishing a personalized integrated learning prediction model based on the fused data, and performing Bayesian optimization on the basis of the establishment of the prediction model, so that the accuracy of personalized recommendation is improved; the personalized recommendation module is mainly responsible for acquiring the data result of the prediction model, acquiring the personalized recommendation result through a Top N recommendation method and verifying the personalized recommendation result.

The specific detailed steps are as follows:

recommended data preprocessing section:

FIG. 2 is a flow chart of the characteristic data preprocessing of the present invention, and the specific implementation steps are as follows:

step 1: analyzing dimension attributes of the personalized recommendation data, and dividing the personalized recommendation data into user-object-grading data; data association is performed on the associated "user-item-score" dimension.

Step 2: after the processing is completed, the data type of each dimension attribute is analyzed and converted into the data type required by the ensemble learning.

Step 3: and generating characteristic attributes according to a scoring attribute in the dimension attributes of the user-object scoring, wherein the formula is as follows:

b represents the b-th user in the "user-item-score" dataset a, where there are d total users, R (b) is the score of user b for each item.

where v denotes an original value of the data, v' denotes a value after normalization processing, min denotes a minimum value of the column in which v is located, and max denotes a maximum value of the column in which v is located.

Step 5: let "user-item-score" dataset A in original space have m sample points x ₁ ,x ₂ ,...,x _m Sample point x _i Is an l-dimensional vector, i is an integer from 1 to m, and the m samples are arranged in a matrix of columns to form X. Performing dimension reduction processing on the data set A by using popular learning LPP, wherein the dimension reduced data set B is a sample point y ₁ ,y ₂ ,...,y _m Composition, sample point y _i Is an n-dimensional vector, and the m samples are arranged in columns to form a matrix of Y, where l > n. The method comprises the following specific steps:

wherein epsilon is a manually set threshold value, the average value of samples is generally taken, m is the total number of samples in the data set, and if the distance is smaller than a certain value epsilon, two samples are considered to be very close, and one side is established between a node i and a node j of the graph.

ω _ij representing the weights, x, between node i and node j _i And x _j For the samples in the "user-item-score" dataset a, t is a real number greater than 0 set by human.

Step 5.3: the projection matrix is calculated and the formula for calculating the projection matrix is as follows.

XLX ^T a＝λXDX ^T a

Let the solution in the formula be a ₀ ,a ₁ ,...,a _l-1 And their corresponding eigenvalues λ are ordered from small to large, with the projective transformation matrix being c= (a) ₀ ,a ₁ ,...,a _l-1 ) Then the sample point y after dimension reduction _i ＝C ^T x _i 。

Wherein the adjacency matrix W is formed by the weight omega in the second step _ij The composition is formed. The main diagonal of the diagonal matrix D is the weighted degree of each vertex of the graph constructed in step one, where the weighted degree of node i is the sum of the weights of all the edges associated with that node, i.e. the sum of each row of elements of the adjacency matrix W. The laplace matrix L is defined as l=d-W.

Step 6: the dimension-reduced data set B is processed according to 8: the ratio of 2 is divided into a training set Train and a Test set Test, wherein the data matrix corresponding to the training set Train is Y'.

Constructing a personalized recommendation model part:

step 7: an integrated learning GBDT method is adopted to establish a personalized recommendation model, a process schematic diagram is shown in figure 3, and the specific steps are as follows:

step 7.1: the GBDT model is defined as follows:

y 'is Y' mentioned in step 6, K represents the round of the score prediction learner, and K represents the total number of iterations of constructing the score prediction learner. f (f) _k (Y') representsScore prediction learner of the kth round, h _k (Y') represents the kth CART decision regression tree.

Step 7.2: constructing a CART decision regression tree, namely h (Y') in the step 7.1, wherein the specific steps are as follows:

step 7.21: dividing the preprocessed data set B into H ₁ ,H ₂ ,...H _o The output value of each region is respectively: p is p ₁ ,p ₂ ,...,p _o 。

Step 7.22: each region is recursively divided into two sub-regions and the output value on each sub-region is determined. According to the following formula, the optimal segmentation variable q and the segmentation point s are selected.

p ₁ For the region H divided in step 7.21 ₁ Output of p ₂ For the region H divided in step 7.21 ₂ Output of u _v And w _v And respectively representing the characteristic attribute and the score of the data in the corresponding region, wherein the value of the vmax is the number of samples of the divided region. Traversing the variable q, scanning the fixed segmentation variable q for the segmentation point s, selecting the pair (q, s) that minimizes the above equation. The regions are divided by the selected pairs (q, s) and the corresponding output values are determined.

Step 7.23: steps 7.21 and 7.22 are continued to be invoked on both sub-areas until a stop condition is met.

h (u) is a partial prediction CART decision regression tree, H' _v For the divided areas, O is denoted as the divided area index, and O is denoted as the total number of divided areas. P is p _o In step 7.21The fixed output values of the divided regions, q 'and s', are the optimal solutions iterated through steps 7.21 and 7.22.

Step 7.3: the score prediction learner employs a forward step algorithm. The model of the kth step is formed by the model of the kth-1 step, namely the kth step of the score prediction learner is closely related to the score prediction learner of the previous k-1 step, and the formula is as follows:

f _k (Y′)＝f _k-1 (Y′)+β _k

f _k (Y') is a kth round of score prediction learner, f _k-1 (Y') is the k-1 th round of score prediction learner, beta _k Representing the residual error produced by the kth wheel.

Optimization model part:

step 8: the GBDT model parameters are optimized by adopting a Bayesian method, and the specific steps are as follows:

step 8.1: initializing a data set D '= (x' ₁ ,y′ ₁ ),...,(x′ _n ,y′ _n ) Wherein y' _i ＝f′(x′ _i ) The method comprises the steps of carrying out a first treatment on the surface of the The objective function f '(x') is the mapping of dimension attributes in the data to scores.

Step 8.4: repeating the steps 8.2 and 8.3, and iterating T' times.

Step 9: and (3) selecting an optimal super-parameter combination obtained through Bayesian optimization to retrain the GBDT personalized recommendation model.

Personalized recommendation part:

step 10: top N recommendation and effect verification are carried out according to the prediction result of the finally obtained personalized recommendation model on the Test set Test, and the specific steps are as follows:

step 10.1: setting an N value, namely recommending N articles to the user, and defining the number of the users as count.

Step 10.2: for each user, the real recommendation list generated on the Test set Test is marked as T (All), and scoring and prediction are carried out on the Test set Test according to the GBDT recommendation model completed by the Bayesian optimization, and the obtained result is defined as Test evaluation set.

Step 10.3: and grading and sorting the Test evaluation groups, recommending the first N articles to users, and marking the Top N recommendation list obtained by each user as T (Test).

Step 10.4: and verifying the accuracy and recall rate results of the test evaluation sets.

Step 10.5: the T (Test) length size is calculated.

Step 10.6: and calculates the T (All) length size.

Step 10.7: and calculating an intersection T (U) between the Top N recommendation list of each user and T (Test).

Step 10.8: calculating accuracy:and accumulating the accuracy rate generated by each user, and dividing the sum by count to obtain the average accuracy rate.

The technical characteristics form the embodiment of the invention, have stronger adaptability and implementation effect, and can increase or decrease unnecessary technical characteristics according to actual needs so as to meet the requirements of different situations.

Claims

1. A personalized recommendation method based on ensemble learning, the method comprising:

step 5: let "user-item-score" dataset A in original space have m sample points x ₁ ,x ₂ ,...,x _m Sample point x _i Is an l-dimensional vector, i is an integer from 1 to m, and the matrix formed by m samples according to columns is X; carrying out dimension reduction processing on the data set A by using a popular learning local reservation projection algorithm, wherein the dimension-reduced data set B is a sample point y ₁ ,y ₂ ,...,y _m Composition, sample point y _i Is an n-dimensional vector, the m samples are in a matrix of columns Y, where l > n;

step 7: establishing a personalized recommendation model by adopting an integrated learning gradient lifting decision tree method;

step 8: optimizing the parameters of the gradient lifting decision tree model by adopting a Bayesian method;

step 9: selecting an optimal super-parameter combination retraining gradient lifting decision tree personalized recommendation model obtained through Bayesian optimization;

2. The personalized recommendation method based on ensemble learning according to claim 1, wherein: in step 3, the number of times of scoring the article by each user is counted, and the formula is as follows:

3. The personalized recommendation method based on ensemble learning according to claim 1, wherein: the step 5 specifically includes the following steps:

step 5.1: constructing a graph, calculating a sample x in a user-object-score data set A _i And sample x _i All samples x except _j The equation is as follows:

ω _ij representing the weights, x, between node i and node j _i And x _j For the samples in "user-item-score" dataset a, t is a real number greater than 0 set by human;

step 5.3: the projection matrix is calculated as follows:

XLX ^T a＝λXDX ^T a (5)

4. The personalized recommendation method based on ensemble learning according to claim 1, wherein: the step 7 comprises the following steps:

step 7.1: the gradient lifting decision tree model is defined as follows:

y 'is Y' mentioned in step 6, K represents the round of the score prediction learner, and K represents the total round of the score prediction learner; f (f) _k (Y') represents the score prediction learner of the kth round, h _k (Y') represents the kth classification regression decision tree;

step 7.2: constructing a classification regression decision tree, namely h (Y') in the step 7.1;

f _k (Y′)＝f _k-1 (Y′)+β _k (7)

5. The personalized recommendation method based on ensemble learning according to claim 4, wherein: the step 7.2 includes the following steps:

step 7.24: repartitioning the input space into o regions H' ₁ ,H′ ₂ ,...H′ _o Generating a scoring prediction classification regression decision tree with the following formula:

h (u) is a branch prediction classification regression decision tree, H' _v For the divided areas, O is represented as a divided area subscript, and O is represented as the number of divided total areas; p is p _o For a fixed output value of the region divided in step 7.21, q 'and s' are optimal solutions iterated through steps 7.21 and 7.22.

6. The personalized recommendation method based on ensemble learning according to claim 1, wherein: the step 8 includes the following steps:

step 8.2: gradient-lifting decision tree model using selected superparameter combinations x' _i Training and calculating f '(x' _i )；

Step 8.4: repeating the steps 8.2 and 8.3, and iterating for T' times;

7. The personalized recommendation method based on ensemble learning according to claim 1, wherein: the step 10 includes the following steps:

step 10.2: for each user, the real recommendation list generated on the Test set Test is marked as T (All), and scoring and predicting are carried out on the Test set Test according to the gradient lifting decision tree recommendation model completed by the Bayesian optimization, and the obtained result is defined as Test evaluation set;

step 10.5: calculating the length of T (Test);

step 10.6: calculating the length of T (All);