CN117056595A

CN117056595A - Interactive project recommendation method and device and computer readable storage medium

Info

Publication number: CN117056595A
Application number: CN202310966515.6A
Authority: CN
Inventors: 魏文国; 陈俊儒
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2023-08-02
Filing date: 2023-08-02
Publication date: 2023-11-14

Abstract

The application discloses an interactive project recommendation method, a device and a computer readable storage medium, wherein the method comprises the following steps: acquiring user embedded features and project embedded features from randomly extracted sample batches; the user embedded feature and the project embedded feature are enabled to pass through a trained GMF network, and a collaborative feature vector obtained by splicing the user embedded feature vector and the project embedded feature vector is obtained; confirming a strategy function in an actor network according to an action cost function obtained through a commentator network and an entropy item regulated by a self-updating temperature control factor; obtaining a current state according to the cooperative feature vector; and obtaining a current action according to the current state and the strategy function, and converting the current action into a project recommendation list. The application enhances the utilization rate of similar information between the user and the project through the GMF network, and reduces the influence of user preference change on recommended content through the continuously updated critic network and actor network.

Description

Interactive project recommendation method and device and computer readable storage medium

Technical Field

The application relates to the field of deep learning oriented to big data, in particular to an interactive project recommendation method, device and medium.

Background

An interactive recommendation system based on deep reinforcement learning can learn its preferences through interactions with users and recommend related items for it. Interactive recommendation systems based on deep reinforcement learning are more sensitive to perceived changes in user preferences from the environment than traditional recommendation systems.

The main problems faced by deep reinforcement learning algorithms and recommendation systems modeled based on such algorithms are similar, namely complexity problems of action exploration and decision making in high-dimensional state space, and convergence problems in function policy optimization. While the traditional recommendation system based on deep learning can improve the prediction capability of the system for the fixed preference of the user by analyzing the relevant characteristics of the user and the project, the traditional recommendation system based on deep learning cannot accurately recommend the user with the preference changing continuously along with the time.

Disclosure of Invention

The embodiment of the application provides an interactive project recommending method, device and computer readable storage medium, which are used for enhancing the utilization rate of similar information between a user and a project through a GMF (gateway mobile communication) network and reducing the influence of deviation items caused by user preference change on recommended contents of the user through a continuously updated critic network and an actor network.

To achieve the above object, a first aspect of an embodiment of the present application provides an interactive item recommendation method, including:

acquiring user embedded features and project embedded features from randomly extracted sample batches;

the user embedded feature and the project embedded feature are enabled to pass through a trained GMF network, and a collaborative feature vector obtained by splicing the user embedded feature vector and the project embedded feature vector is obtained;

confirming a strategy function in an actor network according to an action cost function obtained through a commentator network and an entropy item regulated by a self-updating temperature control factor;

obtaining a current state according to the cooperative feature vector;

and obtaining a current action according to the current state and the strategy function, and converting the current action into a project recommendation list.

In a possible implementation manner of the first aspect, the training procedure of the GMF network is:

acquiring real user preference from the history record;

obtaining a plurality of user embedded features and a plurality of item embedded features through sampling;

in the GMF layer, each user embedded feature is subjected to feature intersection with a corresponding item embedded feature to obtain a plurality of intersection features;

fitting the plurality of cross features by randomly selecting and closing off portions of neurons at a discard layer;

and in the prediction layer, after the user prediction preference is obtained according to one cross characteristic each time, updating a prediction function according to the difference between the user prediction preference and the user real preference.

In a possible implementation manner of the first aspect, the obtaining the user prediction preference according to a cross feature specifically includes:

constructing a prediction function by adopting a full connection layer mode;

and obtaining user prediction preference according to the prediction function.

In a possible implementation manner of the first aspect, the updating the primary prediction function according to the difference between the predicted preference of the user and the actual preference of the user specifically includes:

updating the gradient of the loss function by comparing the difference between the user predicted preference and the user actual preference;

and updating the parameters of the prediction function and the discarding layer according to the loss function.

In a possible implementation manner of the first aspect, the obtaining a current action according to the current state and the policy function specifically includes:

confirming a set of selectable items in the current state according to the current state;

confirming the value of the shielding vector according to the recommended record of each item;

and acquiring the current action according to the selectable item set in the current state, the shielding vector and the strategy function.

In a possible implementation manner of the first aspect, the converting the current action into an item recommendation list specifically includes:

acquiring recommendation probability corresponding to each item;

according to the recommendation probability, all the items are arranged in a descending order;

a preset number of items arranged in a descending order in front are made into an item recommendation list.

In a possible implementation manner of the first aspect, after the obtaining the current action and converting the current action into the item recommendation list, the method further includes:

obtaining a current reward through real-time interaction with a user environment;

obtaining a historical reward through historical interactions with the user's environment;

obtaining a target state according to the historical rewards and the current rewards, and storing the current state, the current action, the current rewards and the target state into a priority experience playback technical pool in a four-tuple mode;

sampling a quadruple from the preferential experience playback technology pool, inputting the sampling result into the interview network, and obtaining an action value item from an action value function in the interview network;

and updating the strategy function according to the action value item and an entropy updating item obtained through the alpha function.

In a possible implementation manner of the first aspect, after the updating the policy function, the method further includes:

obtaining a strategy function evaluation item according to the action cost function and the strategy function;

updating the action cost function and the alpha function according to the strategy function evaluation item and the real action cost item; the real action value item is obtained according to the current rewards and a target state value, the target state value comprises a target action value item of an attenuation factor and an entropy item obtained through the alpha function, and the target action value item is obtained through evaluation of the target state through the strategy function.

A second aspect of an embodiment of the present application provides an interactive item recommendation apparatus, including:

the random acquisition module is used for acquiring user embedded features and project embedded features from randomly extracted sample batches;

the vector acquisition module is used for enabling the user embedded feature and the project embedded feature to acquire a collaborative feature vector which is obtained by splicing the user embedded feature vector and the project embedded feature vector through a trained GMF network;

the function confirmation module is used for confirming a strategy function in the actor network according to the action cost function obtained through the commentator network and the entropy item regulated by the self-updating temperature control factor;

the state acquisition module is used for acquiring the current state according to the cooperative feature vector;

and the project recommending module is used for obtaining the current action according to the current state and the strategy function and converting the current action into a project recommending list.

A third aspect of embodiments of the present application provides a computer readable storage medium storing a computer program which when executed by a processor implements an interactive item recommendation method as described above.

Compared with the prior art, the interactive project recommendation method, device and computer readable storage medium provided by the embodiment of the application have the advantages that the recommendation process is completed by the GMF network, the commentator network, the actor network, the simulated user environment and the priority experience playback pool. Training cross features between the user and the project through the GMF network, so that the utilization rate of similar information between the user and the project by the recommendation agent is enhanced; the commentator network and the actor network are continuously updated through the alpha function containing the self-updating temperature control factor alpha, so that the influence of the partial difference on the recommendation result is reduced.

Because the GMF network learns the users with the preferences changing along with time, a new user embedded feature update GMF network is obtained to adapt to the change of the users; when the critics network and the actor network are updated, the current rewards obtained by real-time interaction with the user are used as update basis, and new critics network and actor network are obtained to adapt to the change of the user.

Drawings

FIG. 1 is a flow chart of an interactive project recommendation method according to an embodiment of the application;

FIG. 2 is a schematic diagram of an update of a GMF network, an actor network and a reviewer network according to an embodiment of the present application;

fig. 3 is a schematic diagram of a GMF network according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a user environment according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, an embodiment of the present application provides an interactive project recommendation method, which includes:

s10, acquiring user embedded features and project embedded features from randomly extracted sample batches;

s11, enabling the user embedded feature and the project embedded feature to obtain a collaborative feature vector obtained by splicing the user embedded feature vector and the project embedded feature vector through a trained GMF network;

s12, confirming a strategy function in the actor network according to an action cost function obtained through the commentator network and an entropy item regulated by the self-updating temperature control factor;

s13, obtaining a current state according to the cooperative feature vector;

s14, obtaining a current action according to the current state and the strategy function, and converting the current action into a project recommendation list.

The present embodiment adopts Generalized Matrix Factorization (GMF) algorithm in S10-S11 to train the cross feature between the user and the item, thereby enhancing the utilization of similar information between the user and the item by the recommendation agent. Meanwhile, a Critic network and an Actor network are adopted in S13-S14 to realize Soft Actor-Critic (SAC) algorithm modeling. As shown in fig. 2, the community formed by the GMF network, the simulated user environment, the priority experience playback pool, the commentator network and the actor network can be regarded as a recommendation agent, the recommendation agent performs item recommendation so as to reduce the influence of bias on the system, and finally, the prediction capability of the recommendation agent on the user preference is improved through policy optimization. The training process of the recommendation agent mainly comprises five parts, namely a GMF network, an actor network and a reviewer network based on a SAC algorithm, a simulated user environment and a preferential experience playback technology.

Wherein in S10, the user-embedded features and the item-embedded features are obtained from randomly extracted batches, each batch containing a certain number of user-item pairs (i.e. preferences of a certain user for a certain item), one part being positive sample pairs formed by the user and favorite items, and the other part being negative sample pairs formed by the user and dislike items.

It should be noted that, the feature/feature vector in the present application refers to a vector that can be used to represent some characteristics of things; the user characteristics and the item characteristics are used for respectively indicating what items the user likes and what items the user likes, wherein the items can be movies on video websites, on-demand music on music websites, specific commodities on shopping websites, same city services on same city platforms and the like, and the websites/platforms all use recommended services for the user, so that the items are a general term of various content services, and for convenience of illustration principles, the application of the movie items is taken as an example.

The GMF network architecture is shown in fig. 3. The GMF network may predict user preferences through analysis of the embedded features of the user and the item and train network parameters through comparison of the predicted preferences and the actual preferences to obtain an embedded vector containing the user and item related features. The embedded vectors are user characteristics and project characteristics trained through the GMF network, and comprise crossing information of the user and the project, which is also called crossing characteristics.

The input of the GMF network is the embedded characteristics of the user and the project obtained through sampling and the real preference of the user to the project, and the output is the predicted preference of the user to the project. The training method is to update the gradient by comparing the prediction preference with the real preference through the cross loss function, and finally obtain the embedded vector containing the relevant characteristics of the user and the project.

Illustratively, the training procedure of the GMF network is:

acquiring real user preference from the history record;

The GMF network may obtain the most direct representation of the user's movie preferences from the history, i.e. represent the user preferences through an implicit feedback matrix, with the following formula:

wherein U and I represent user set and project set, each element in the matrixThe preference of user u for item i is indicated, 1 being consistent with the user preference, 0 being non-consistent with the user preference.

The rows and columns in the matrix represent initialized user features and project features, respectively.

The specific formula of the GMF network for feature crossing is as follows:

g _u,i ＝d(e _u ⊙e _i )

where d represents the parameters of the discard layer, which reduces the influence of overfitting on the algorithm effect by randomly selecting and closing a part of neurons, e _u And e _i Representing the embedded features of the user and the embedded features of the project respectively, and obtaining the crossed features g of the user and the project through the operation of the element product _u,i 。

Illustratively, the obtaining the user prediction preference according to a cross feature specifically includes:

constructing a prediction function by adopting a full connection layer mode;

and obtaining user prediction preference according to the prediction function.

The prediction layer of the GMF network usually adopts a fully-connected layer mode, and is assumed to be g= [ g ] ₁ ,g ₂ ,…,g _M ]And p= [ p ] ₁ ,p ₂ ,…,p _N ]Respectively representing the input and the output, the specific formula of the full connection layer is as follows:

p＝σ(W _p g+b _p )

wherein σ represents a sigmoid activation function, W _p Representing a connection weight matrix, b _p Representing the deviation term.

Illustratively, the updating the primary prediction function according to the difference between the predicted preference of the user and the actual preference of the user specifically includes:

And obtaining the prediction of the user on the project preference through the full connection layer, obtaining the loss function through comparison with the real user preference, and realizing the training of the GMF model through gradient updating of the loss function. The specific formula of the loss function is as follows:

l _n ＝-w _n (t _n logp _n +(1-t _n )log(1-p _n ))

where n.epsilon.N represents the batch size, l _n Represents cross entropy loss, w _n Representing loss function weights, p _n And t _n Representing the predicted and actual values, respectively.

Combining the user embedded features trained through the GMF network with the project embedded features to form user-project cooperative features, and storing the user-project cooperative features and the user-project cooperative features into a temporary file, wherein the temporary file is applied to the state embedding of a follow-up recommended agent training part (actor network).

The SAC algorithm was at the earliest a deep reinforcement learning algorithm for robot skill learning. The main elements in the algorithm include states, actions, rewards, policies, etc.

The application uses the SAC algorithm of the alpha function (comprising the self-updating temperature control factor alpha) in the recommendation agent, and the main flow of the recommendation agent for generating recommendation and learning user preference is as follows:

(1) Splicing the user embedded feature vector and the project embedded feature vector trained by the GMF network to obtain a current state, and selecting a current action by a strategy function in the actor network;

(2) Converting the current action into a recommendation list, and obtaining a current reward through interaction with the user environment;

(3) Obtaining a target state according to the historical rewards and the current rewards, and storing the current state, the current action, the current rewards and the target state into a priority experience playback technical pool in a four-element mode;

(4) Sampling the quadruple from the preferential experience playback technical pool, and obtaining a Q value (action value) through a Q function (action value function) in the commentator network;

(5) The Q function and the alpha function in the policy network reviewer network in the actor network are updated.

Illustratively, the obtaining the current action according to the current state and the policy function specifically includes:

The specific formula for obtaining the current action is as follows:

wherein A is _u,t Representing a set of selectable items in the current state, a _u,t Representing the current policy function pi (a|s _u,t-1 ) Mask vectorAction of selecting current recommended content for user u from current state, when t>Epsilon is a _u,t Representing that the action selects the content with the highest probability in the optional actions, and when t is less than or equal to epsilon, a _u,t Representing randomly selected content from the current recommendation list. For mask vector->When the value is 0, it indicates that the item i has been recommended to the user u, and when the value is 1, it indicates that the item i has not been recommended to the user u. In addition, t-1 appearing in the formula represents the current value of each variable in the present training, for example, the value of training 1 in training 2 will be used as the current value.

Illustratively, the converting the current action into the item recommendation list specifically includes:

acquiring recommendation probability corresponding to each item;

In the present embodiment, the conversion criterion is a recommendation probability for each item, for example, in top10 recommendation, the current action obtained by the agent is a recommendation probability for each item in the movie item list, and the recommendation list is an item generation list with a recommendation probability of 10.

Illustratively, after the obtaining the current action and converting the current action into the item recommendation list, the method further comprises:

Policy function pi in actor network:

in the formula (i) the formula (ii),indicating that the temperature control factor is self-updated>(alpha function) adjusted entropy term, Q _ω (s _u,t ,a′ _u,t ) A 'representing a bonus item evaluated using an action cost function of a critique section' _u,t Representing all possible actions predicted by the current policy function.

Illustratively, after updating the policy function, the method further includes:

Current Q function in the reviewer network:

the formula contains two calculation terms related to the Q value. Wherein Q is _ω (s _u,t ,a _u,t ) The prediction action value item is expressed, and the main function of the prediction action value item is to directly participate in evaluation operation of part of strategy functions of an actor so as to realize guidance of optimizing and updating the strategy network.The representation is based on the prize r(s) _u,t ,a _u,t ) And a true action value term for the target state value, the specific formula of the term is as follows:

target state price in formulaThe value is represented as a target action value term including an attenuation factor gamma and a self-updating temperature control factorIs a term of entropy of (c). The parameter updating method of the target action cost function is a flexible updating method based on a smoothing factor tau, and the specific updating formula is as follows:

the self-updating temperature control factor α in the reviewer network is denoted as α network:

in the formulaIs the minimum entropy constant, i.e. the opposite number of motion space dimensions.

In the above embodiments, the user environment simulation technique and the preferential experience playback technique are also used. Referring to fig. 4, in simulating the user environment, the above embodiment uses the offline data set to simulate the user behavior, and determines whether the user likes a certain item according to the specific scoring value of the user for the item in the history record. Whereas the above-described embodiments use binary tree based data structures that can store experience priorities for historical experience storage and extraction, this section is not an embodiment implementation focus and is not developed in detail.

Compared with the prior art, the interactive project recommendation method provided by the embodiment of the application has the advantages that the recommendation process is completed by the GMF network, the commentator network, the actor network, the simulated user environment and the priority experience playback pool. Training cross features between the user and the project through the GMF network, so that the utilization rate of similar information between the user and the project by the recommendation agent is enhanced; the commentator network and the actor network are continuously updated through the alpha function containing the self-updating temperature control factor alpha, so that the influence of the partial difference on the recommendation result is reduced.

A second aspect of an embodiment of the present application provides an interactive item recommendation apparatus, including: the system comprises a random acquisition module, a vector acquisition module, a function confirmation module, a state acquisition module and an item recommendation module.

And the random acquisition module is used for acquiring the user embedded features and the project embedded features from the randomly extracted sample batch.

And the vector acquisition module is used for enabling the user embedded feature and the project embedded feature to acquire a collaborative feature vector which is obtained by splicing the user embedded feature vector and the project embedded feature vector through a trained GMF network.

And the function confirmation module is used for confirming the strategy function in the actor network according to the action cost function obtained through the commentator network and the entropy item regulated by the self-updating temperature control factor.

And the state acquisition module is used for acquiring the current state according to the cooperative feature vector.

It will be clear to those skilled in the art that for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method embodiments for the specific working procedure of the above-described system, which is not further described herein.

Compared with the prior art, the interactive project recommending device provided by the embodiment of the application has the advantages that the recommending process is completed by the GMF network, the commentator network, the actor network, the simulated user environment and the priority experience playback pool. Training cross features between the user and the project through the GMF network, so that the utilization rate of similar information between the user and the project by the recommendation agent is enhanced; the commentator network and the actor network are continuously updated through the alpha function containing the self-updating temperature control factor alpha, so that the influence of the partial difference on the recommendation result is reduced.

An embodiment of the application provides a computer device. The computer device of this embodiment includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, the processor implementing the steps of any of the method embodiments described above when the computer program is executed.

The computer device can be a smart phone, a tablet computer, a desktop computer, a cloud server and other computing devices. The computer device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the figures are merely examples of computer devices and are not limiting of computer devices, and may include more or fewer components than shown, or may combine certain components, or different components, such as may also include input and output devices, network access devices, etc.

The processor may be a central processing unit (Central Processing Unit, CPU), it may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may in some embodiments be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The memory may in other embodiments also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device. Further, the memory may also include both internal storage units and external storage devices of the computer device. The memory is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs, etc., such as program code for the computer program, etc. The memory may also be used to temporarily store data that has been output or is to be output.

In addition, the embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the steps in any of the above-mentioned method embodiments.

An embodiment of the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements an interactive item recommendation method as described above.

Embodiments of the present application provide a computer program product which, when run on a computer device, causes the computer device to perform the steps of the method embodiments described above.

In several embodiments provided by the present application, it will be understood that each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the application, such changes and modifications are also intended to be within the scope of the application.

Claims

1. An interactive item recommendation method, comprising:

obtaining a current state according to the cooperative feature vector;

2. The interactive item recommendation method of claim 1, wherein the training process of the GMF network is:

acquiring real user preference from the history record;

3. The interactive method of item recommendation according to claim 2, wherein said obtaining user prediction preferences based on a cross feature comprises:

constructing a prediction function by adopting a full connection layer mode;

and obtaining user prediction preference according to the prediction function.

4. The interactive item recommendation method according to claim 2, wherein said updating a primary prediction function based on a difference between said user predicted preference and said user actual preference, comprises:

5. The interactive item recommendation method of claim 1, wherein said obtaining a current action based on said current state and said policy function comprises:

6. The interactive method of item recommendation according to claim 1, wherein said converting said current action into a list of item recommendations, in particular comprises:

acquiring recommendation probability corresponding to each item;

7. The interactive item recommendation method of claim 1, wherein after obtaining a current action and converting the current action into an item recommendation list, further comprising:

8. The interactive item recommendation method of claim 7, wherein after said updating said policy function, further comprising:

9. An interactive item recommendation device, comprising:

10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the interactive item recommendation method according to any one of claims 1 to 8.