CN114386507A

CN114386507A - Training method of content recommendation model, content recommendation method and device

Info

Publication number: CN114386507A
Application number: CN202210032538.5A
Authority: CN
Inventors: 高崇铭; 雷文强; 何向南; 李师军; 李彪; 张元�; 江鹏
Original assignee: Beijing Zhongke Research Institute; Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Zhongke Research Institute; Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-04-22

Abstract

The disclosure relates to a training method of a content recommendation model, a content recommendation method and a device, wherein the method comprises the following steps: acquiring first sample data; the first sample data comprises interaction information between a sample object and sample content; acquiring target content recommended to a target object by a content recommendation model to be trained based on the first sample data; simulating feedback information of the target object to the target content through the trained prediction model; the prediction model is used for predicting initial feedback information of the target object on the target content, and adjusting the predicted initial feedback information based on recommendation frequency information corresponding to the target content to obtain feedback information of the target object on the target content; and adjusting the content recommendation model to be trained based on the feedback information to obtain a target content recommendation model. The method can solve the problem of the information cocoon room on the basis of ensuring the accuracy of the recommendation result.

Description

Training method of content recommendation model, content recommendation method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for training a content recommendation model, an electronic device, a storage medium, and a program product.

Background

The information cocoon house refers to a phenomenon that the information field concerned by people is guided by preference of people, so that the people can be helped to live in the cocoon house like silkworm cocoons. In the field of recommendation systems, the information cocoon room problem means that a recommendation model can be recommended based on various preferences of a user in an initial stage, and with further parameter updating and strategy iteration of the recommendation model, a recommendation result is gradually dominated by the mainstream preference of the user, and other preferences of the user are ignored.

For the information cocoon room problem in the recommendation system, the current solutions include actively helping the user to improve the perception of diversified viewpoints, or improving at the model end to improve the diversity and fairness of the recommendation result, however, these methods are static recommendation strategies, and all of the methods trade the goal of avoiding trapping in the information cocoon room at the expense of the recommendation accuracy. Therefore, the accuracy and diversity of the recommendation result are difficult to be considered in the current method for breaking the information cocoon room.

Disclosure of Invention

The present disclosure provides a training method of a content recommendation model, a content recommendation method, a method, an apparatus, an electronic device, a storage medium, and a program product, to at least solve a problem that it is difficult to compromise accuracy and diversity of recommendation results in a method of breaking an information cocoon in the related art.

The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a method for training a content recommendation model, including:

acquiring first sample data; the first sample data comprises interaction information between a sample object and sample content;

acquiring target content recommended to a target object by a content recommendation model to be trained based on the first sample data; the target object is any one of the sample objects, and the target content belongs to the sample content in the first sample data;

simulating feedback information of the target object to the target content through the trained prediction model; the prediction model is used for predicting initial feedback information of the target object on the target content, and adjusting the predicted initial feedback information based on recommendation frequency information corresponding to the target content to obtain feedback information of the target object on the target content; the recommendation frequency information represents the frequency of recommending associated content aiming at the target object, wherein the associated content is content matched with the target content on at least one attribute;

and adjusting the content recommendation model to be trained based on the feedback information to obtain a target content recommendation model.

In an exemplary embodiment, the prediction model is obtained by training as follows:

acquiring second sample data; the second sample data comprises interaction information between the sample object and the sample content;

obtaining sample recommendation frequency information corresponding to each sample content of the same sample object based on a plurality of sample contents corresponding to the same sample object in the second sample data; the sample recommendation frequency information characterizes a number of associated sample contents with a timestamp before the respective sample contents, the timestamp characterizing a time at which the respective sample contents are sampled to train the prediction model;

and training the prediction model to be trained based on the sample recommendation frequency information and the interaction information between the sample object and the sample content in the second sample data to obtain the trained prediction model.

In an exemplary embodiment, the obtaining, based on a plurality of sample contents corresponding to the same sample object in the second sample data, sample recommendation frequency information corresponding to each sample content of the same sample object includes:

determining associated sample content with a timestamp before target sample content based on difference information between a plurality of sample contents corresponding to the same sample object; the target sample content belongs to the plurality of sample contents corresponding to the same sample object;

acquiring a time difference between the time stamps corresponding to the target sample content and the associated sample content;

and determining sample recommendation frequency information corresponding to the target sample content based on the difference information and the time difference.

In an exemplary embodiment, the training a prediction model to be trained based on the sample recommendation frequency information and the interaction information between the sample object and the sample content in the second sample data to obtain the trained prediction model includes:

sampling positive sample content and negative sample content corresponding to each sample object from the second sample data based on the interaction information in the second sample data; the positive sample content represents the content of the sample object which is fed back positively, and the negative sample content represents the content of the sample object which is fed back negatively;

for each sample object, obtaining first feedback information of the sample object to the positive sample content and second feedback information of the sample object to the negative sample content according to the sample recommendation frequency information of the positive sample content and the sample recommendation frequency information of the negative sample content respectively;

and adjusting model parameters of the prediction model to be trained based on the loss value between the first feedback information and the second feedback information until preset training times are reached or the loss value is converged, so as to obtain the trained prediction model.

In an exemplary embodiment, the adjusting the predicted initial feedback information based on the recommendation frequency information corresponding to the target content to obtain the feedback information of the target object on the target content includes:

determining adjustment amplitude information for the initial feedback information based on the recommended frequency information; the adjustment amplitude information and the recommendation frequency information form a positive correlation;

and adjusting the initial feedback information according to the adjustment amplitude information to obtain feedback information which is reversely changed with the recommended frequency information and is used as the feedback information of the target object to the target content.

In an exemplary embodiment, the content recommendation model to be trained determines the target content recommended to the target object by:

acquiring state information associated with the target object; the state information is obtained based on object feature information of the target object, historical recommended content recommended for the target object, weight information of the historical recommended content, and feedback information of the target object for the historical recommended content;

and determining the target content recommended to the target object according to the content recommendation model to be trained and the state information.

In an exemplary embodiment, the adjusting the content recommendation model to be trained based on the feedback information to obtain a target content recommendation model includes:

based on the feedback information, adjusting the model parameters of the content recommendation model to be trained to obtain a new content recommendation model;

updating the state information associated with the target object according to the target content and the feedback information corresponding to the target content;

and determining new target content recommended to the target object again through the new content recommendation model and the updated state information associated with the target object, and acquiring feedback information aiming at the new target content through the trained prediction model until the content recommendation accuracy rate aiming at the target object reaches a threshold value to obtain a target content recommendation model.

In an exemplary embodiment, after obtaining the target content recommendation model, the method further includes:

acquiring performance parameters for recommending the content of the sample object by the target content recommending model according to the first sample data; the performance parameters are obtained under the condition that recommendation frequency information of the target content recommendation model in a preset time period reaches a threshold value;

and evaluating the performance of the target content recommendation model based on the performance parameters.

According to a second aspect of the embodiments of the present disclosure, there is provided a content recommendation method including:

acquiring a plurality of contents to be recommended and objects to be recommended;

selecting target recommended content from the plurality of contents to be recommended through a target content recommendation model, and pushing the target recommended content to the object to be recommended; the target content recommendation model is obtained by training through the content recommendation model training method.

According to a third aspect of the embodiments of the present disclosure, there is provided a training apparatus for a content recommendation model, including:

a first acquisition unit configured to perform acquisition of first sample data; the first sample data comprises interaction information between a sample object and sample content;

the recommending unit is configured to acquire target content recommended to a target object by a content recommending model to be trained on the basis of the first sample data; the target object is any one of the sample objects, and the target content belongs to the sample content in the first sample data;

a simulation unit configured to execute a prediction model completed by training, and simulate feedback information of the target object to the target content; the prediction model is used for predicting initial feedback information of the target object to the target content, and adjusting the predicted initial feedback information based on recommended frequency information corresponding to the target content to obtain feedback information of the target object to the target content; the recommendation frequency information represents the number of times of recommending associated content for the target object, wherein the associated content is content matched with the target content on at least one attribute;

and the adjusting unit is configured to adjust the content recommendation model to be trained based on the feedback information to obtain a target content recommendation model.

In an exemplary embodiment, the apparatus further comprises:

a second acquisition unit configured to perform acquisition of second sample data; the second sample data comprises interaction information between the sample object and the sample content;

a frequency determining unit configured to execute obtaining sample recommendation frequency information corresponding to each sample content of the same sample object based on a plurality of sample contents corresponding to the same sample object in the second sample data; the sample recommendation frequency information characterizes a number of associated sample contents with a timestamp before the respective sample contents, the timestamp characterizing a time at which the respective sample contents are sampled to train the prediction model;

and the training unit is configured to execute training of the prediction model to be trained based on the sample recommendation frequency information and the interaction information between the sample object and the sample content in the second sample data to obtain the trained prediction model.

In an exemplary embodiment, the frequency determining unit is further configured to perform determining, based on difference information between a plurality of sample contents corresponding to the same sample object, an associated sample content having a timestamp before a target sample content; the target sample content belongs to the plurality of sample contents corresponding to the same sample object; acquiring a time difference between the time stamps corresponding to the target sample content and the associated sample content; and determining sample recommendation frequency information corresponding to the target sample content based on the difference information and the time difference.

In an exemplary embodiment, the training unit is configured to perform sampling of positive sample content and negative sample content corresponding to each sample object from the second sample data based on the interaction information in the second sample data; the positive sample content represents the content of the sample object which is fed back positively, and the negative sample content represents the content of the sample object which is fed back negatively; for each sample object, obtaining first feedback information of the sample object to the positive sample content and second feedback information of the sample object to the negative sample content according to the sample recommendation frequency information of the positive sample content and the sample recommendation frequency information of the negative sample content respectively; and adjusting model parameters of the prediction model to be trained based on the loss value between the first feedback information and the second feedback information until preset training times are reached or the loss value is converged, so as to obtain the trained prediction model.

In an exemplary embodiment, the training unit is further configured to perform determining adjustment magnitude information for the initial feedback information based on the recommended frequency information; the adjustment amplitude information and the recommendation frequency information form a positive correlation; and adjusting the initial feedback information according to the adjustment amplitude information to obtain feedback information which is reversely changed with the recommended frequency information and is used as the feedback information of the target object to the target content.

In an exemplary embodiment, the recommending unit is further configured to perform obtaining the state information associated with the target object; the state information is obtained based on object feature information of the target object, historical recommended content recommended for the target object, weight information of the historical recommended content, and feedback information of the target object for the historical recommended content; and determining the target content recommended to the target object according to the content recommendation model to be trained and the state information.

In an exemplary embodiment, the adjusting unit is further configured to perform, based on the feedback information, adjusting model parameters of the content recommendation model to be trained to obtain a new content recommendation model; updating the state information associated with the target object according to the target content and the feedback information corresponding to the target content; and determining new target content recommended to the target object again through the new content recommendation model and the updated state information associated with the target object, and acquiring feedback information aiming at the new target content through the trained prediction model until the content recommendation accuracy rate aiming at the target object reaches a threshold value to obtain a target content recommendation model.

In an exemplary embodiment, the apparatus further includes an evaluation unit configured to perform obtaining, according to the first sample data, a performance parameter of the target content recommendation model for content recommendation on the sample object; the performance parameters are obtained under the condition that recommendation frequency information of the target content recommendation model in a preset time period reaches a threshold value; and evaluating the performance of the target content recommendation model based on the performance parameters.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a content recommendation apparatus including:

an acquisition unit configured to perform acquisition of a plurality of contents to be recommended and objects to be recommended;

the recommending unit is configured to execute the steps of selecting target recommended content from the plurality of contents to be recommended through a target content recommending model and pushing the target recommended content to the object to be recommended; the target content recommendation model is obtained by training through the content recommendation model training method.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of the above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of any one of the above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the method as defined in any one of the above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

after the first sample data is obtained, the target content recommended to the target object is determined through the content recommendation model to be trained, then the feedback information of the target object to the target content is simulated through the trained prediction model, and finally the content recommendation model to be trained is adjusted based on the feedback information to obtain the target content recommendation model. Because the feedback information of the target object simulated by the prediction model to the target content is the feedback information obtained by considering the recommendation frequency information, and the recommendation frequency information can represent the degree of the target content falling into the information cocoon house, the content recommendation model to be trained is adjusted based on the feedback information, so that the obtained target content recommendation model can learn the capability of jumping out of the information cocoon house, and the problem of the information cocoon house is solved on the basis of ensuring the accuracy of the recommendation result.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1(a) is a schematic diagram illustrating a reinforcement learning based recommendation framework in accordance with an exemplary embodiment.

Fig. 1(b) is a schematic diagram illustrating an information cocoon room problem according to an exemplary embodiment.

Fig. 1(c) is a schematic flow chart illustrating a training method of a content recommendation model according to an exemplary embodiment.

FIG. 2 is a flowchart illustrating a method of training a content recommendation model according to an example embodiment.

Fig. 3 is a diagram illustrating a relationship between first sample data and second sample data according to an example embodiment.

Fig. 4 is a flowchart illustrating a content recommendation method according to an exemplary embodiment.

FIG. 5 is a flowchart illustrating a method of training a content recommendation model according to another example embodiment.

FIG. 6 is a diagram illustrating an overall model framework of a causal inference model based on counterfactual reasoning combined with an offline reinforcement learning framework, according to an exemplary embodiment.

FIG. 7(a) is a causal diagram illustrating a common usage of a recommendation system model according to an exemplary embodiment.

FIG. 7(b) is a causal diagram illustrating combining repeated recommendation effects according to an exemplary embodiment.

FIG. 7(c) is a causal graph employed by an interactive recommendation strategy shown in accordance with an exemplary embodiment.

Fig. 8 is a block diagram illustrating a structure of a training apparatus for a content recommendation model according to an exemplary embodiment.

Fig. 9 is a block diagram illustrating a configuration of a content recommendation apparatus according to an exemplary embodiment.

FIG. 10 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. It should also be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are both information and data that are authorized by the user or sufficiently authorized by various parties.

Interactive recommendation refers to the fact that a system can obtain real-time feedback information (such as praise, forward, and watching duration) of a user in an interactive process with the user, an online algorithm policy (online policy) can be learned by fully utilizing the feedback information to dynamically adjust a recommendation scheme, and a static recommendation algorithm under a manually specified rule is replaced, and the interactive recommendation algorithm is a mainstream algorithm in an application scene of content recommendation. For example, reinforcement learning (reinforcement learning) is a branch of technology supporting such interactive recommendation, and the work flow thereof is as shown in fig. 1(a), and the technical principle thereof is that an intelligent agent learns the automatic decision-making manner under different conditions in the interaction with the user, so as to pursue the specific optimal long-term benefit. However, reinforcement learning is often difficult to go online, the root cause of which is that a large amount of interaction data is required for training, and most application scenarios do not accept a model training process that requires real users to engage in interactions. For example, the platform side cannot directly obtain the preference of a user for all the billions of contents.

In current content recommendation systems, the results of the recommendation system are usually derived from the integrated results of multiple static recommendation models. For each static recommendation model, the principle is to perform model training by using the content that has been interacted by the user to estimate the preference of the user for the content that has not been interacted, which can be regarded as a special case of an interactive recommendation system, that is, the policy module is static and monotonous and performs uniform online parameter updating after a specified time period.

However, almost all of the proposed strategies currently face the problems of "narrower and narrower push" and information cocoon houses (filter bunbles). As shown in FIG. 1(b), there are many user preferences, and the recommendation system can initially grasp more than one preference for recommendation, such as sports and gourmet videos. But as the parameters of the recommendation system are further updated and the strategy is iterated, the recommended result is slowly dominated by the main preference of the user, such as sports food in the figure, and other preferences of the user are ignored. Such a gradual and monotonous recommendation result may make the user tired, thereby generating distrust and boredom to the recommendation system. This phenomenon is widespread in current recommendation systems, so current recommendation strategies are a compromise and not optimal.

In order to solve the problems, the content recommendation model training method for solving the information cocoon room problem in the current recommendation strategy is provided, and referring to fig. 1(c), the method introduces a reinforcement learning strategy model into a production environment, and utilizes a causal reasoning technology based on a counter-fact model to explicitly model a repeated recommendation effect in user preference, so that the problem of 'pushing more and narrowing more' is effectively avoided while a decision process is automatically iterated and updated. The basic idea is that a prediction model capable of estimating user preference is learned in historical interactive data based on a counterfactual model in causal reasoning, then feedback information is generated by the prediction model to plan and train a content recommendation model based on a recommendation strategy (RL policy) of reinforcement learning, and finally the learned content recommendation model is put on line, so that the self-adaptive capacity of the current content recommendation system is improved. In order to effectively evaluate the influence brought by the information cocoon house, the method is verified in an interactive recommendation system, and the method is also suitable for a common static recommendation strategy.

Referring to fig. 2, a flowchart of a method for training a content recommendation model according to an exemplary embodiment is shown, where this embodiment is illustrated by applying the method to a terminal, and it is understood that this method may also be applied to a server, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. The terminal can be but not limited to various personal computers, notebook computers, smart phones, tablet computers, internet of things equipment and portable wearable equipment, and the internet of things equipment can be smart sound boxes, smart televisions, smart air conditioners, smart vehicle-mounted equipment and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers. In this embodiment, the method includes the steps of:

in step S210, first sample data is acquired; the first sample data includes interaction information between the sample object and the sample content.

The interactive information may be understood as information that can represent a preference condition of the sample object for the sample content, for example, the interactive information may be viewing duration, click rate, whether to view or send a comment, whether to like, whether to collect, and the like.

Wherein the first sample data may include interaction information between each sample object and each sample content.

In specific implementation, a plurality of sample objects and a plurality of sample contents can be determined, and interaction information of all sample objects to all sample contents is obtained and used as first sample data, so that accuracy of subsequent testing and evaluation of the content recommendation model is improved.

In step S220, based on the first sample data, obtaining target content recommended to the target object by the content recommendation model to be trained; the target object is any one of the sample objects, and the target content belongs to the sample content in the first sample data.

The content recommendation model may be a model obtained based on a reinforcement learning interactive recommendation Policy (RL Agent), for example, a model obtained based on a PPO algorithm (near Policy Optimization) Policy, a model obtained based on a DDPG algorithm (Deep Deterministic Policy Gradient algorithm) Policy, and the like.

Among them, Reinforcement learning (Reinforcement learning) is a field of machine learning, which emphasizes how to act based on the environment to achieve the maximum expected benefit. Reinforcement learning does not require tagged input-output pairs and does not require accurate correction of non-optimal solutions. The focus is to find a balance of exploration (for unknown domains) and exploitation (for known domains), strengthening the "exploration-exploitation" exchange in learning.

In a specific implementation, the content recommendation model to be trained may learn the preference information of the target object step by step in the training process, and select the target content recommended to the target object from the sample contents included in the first sample data based on the learned preference information of the target object. More specifically, in the training process, historical recommended content recommended to the target object, feedback information of the target object to the historical recommended content, and the like may also be recorded, and the target content recommended to the target object may be determined by the content recommendation model on the basis of the historical recommended content and the feedback information of the target object to the historical recommended content.

In step S230, simulating feedback information of the target object to the target content through the trained prediction model; the prediction model is used for predicting initial feedback information of the target object on the target content and adjusting the predicted initial feedback information based on the recommendation frequency information corresponding to the target content to obtain feedback information of the target object on the target content; the recommendation frequency information represents the number of times of recommending associated content for the target object, the associated content being content matched with the target content in at least one attribute.

The prediction model can be a causal inference model based on counter-fact reasoning.

The feedback information may be in the form of a score of a preference degree of the target object for the target content, or may be a preference feedback result and a non-preference feedback result, which may be specifically determined according to actual needs, and the disclosure does not specifically limit this.

The recommendation frequency information can be understood as the recommendation frequency in unit time, and the recommendation frequency information can reflect the degree of the recommended target content sinking into the information cocoon room.

The attribute may be the creator or publisher of the content, the category of the content, the tag of the content, etc.

In a specific implementation, the prediction model may include a prediction unit and an adjustment unit, the prediction unit is configured to predict initial feedback information of the target object on the target content, and the adjustment unit is configured to adjust the initial feedback information obtained by the prediction unit based on the recommendation frequency information of the target content, where the recommendation frequency information may reflect a degree of sinking of the target content into the information cocoon, so that after the initial feedback information is adjusted by the recommendation frequency information, the prediction model may solve the information cocoon problem and simulate real feedback information of the target object on the target content.

In step S240, the content recommendation model to be trained is adjusted based on the feedback information, so as to obtain a target content recommendation model.

In this step, the training process of the content recommendation model is performed through interaction with the prediction model, in this interaction process, the prediction model serves as a user simulator, and provides a timely and accurate reward signal for the content recommendation model as feedback information. The repeated recommendation behaviors of the content recommendation model with excessive times of recommending the same or similar content in a short time can be adjusted through the feedback information, so that the content recommendation model can learn the capability of automatically avoiding repeated recommendation and jumping out of the information cocoon.

In the concrete implementation, after the feedback information of the target object to the target content is obtained through the simulation of the prediction model, model parameters of the content recommendation model to be trained can be adjusted based on the feedback information to obtain a new content recommendation model, recommending new target content to the target object again through the new content recommendation model, acquiring feedback information aiming at the new target content through the prediction model until the content recommendation accuracy aiming at the target object reaches a threshold value, obtaining a first content recommendation model after interactive training aiming at the target object, then aiming at the next target object, through the interaction between the prediction model and the first content recommendation model, and training the first content recommendation model to obtain a second content recommendation model, and by analogy, obtaining a target content recommendation model after each sample object participates in the training of the content recommendation model to be trained.

In the method for training the content recommendation model, after the first sample data is obtained, the target content recommended to the target object is determined through the content recommendation model to be trained, then the feedback information of the target object to the target content is simulated through the trained prediction model, and finally the content recommendation model to be trained is adjusted based on the feedback information to obtain the target content recommendation model. Because the feedback information of the target object simulated by the prediction model to the target content is the feedback information obtained by considering the recommendation frequency information, and the recommendation frequency information can represent the degree of the target content falling into the information cocoon house, the content recommendation model to be trained is adjusted based on the feedback information, so that the obtained target content recommendation model can learn the capability of jumping out of the information cocoon house, and the problem of the information cocoon house is solved on the basis of ensuring the accuracy of the recommendation result.

In an exemplary embodiment, the predictive model is trained by:

step S310, acquiring second sample data; the second sample data comprises interaction information between the sample object and the sample content;

step S320, obtaining sample recommendation frequency information corresponding to each sample content of the same sample object based on a plurality of sample contents corresponding to the same sample object in second sample data; the sample recommendation frequency information represents the number of associated sample contents of the timestamp in front of each sample content, and the timestamp represents the time of each sample content sampled to train a prediction model;

and step S330, training the prediction model to be trained based on the sample recommendation frequency information and the interaction information between the sample object and the sample content in the second sample data to obtain the trained prediction model.

And the second sample data is data used for training the prediction model.

In the present disclosure, since the first sample data is used for subsequent testing and evaluation of the content recommendation model, all sample objects in the first sample data have performed at least one interactive recording on all contents, that is, the first sample data is full exposure data. The second sample data is used for training the prediction model, and therefore, the interaction information of each sample object for at least part of the sample content may be included, that is, the second sample data may not be full exposure data, for example, referring to fig. 3, the second sample data is a schematic diagram of a relationship between the first sample data and the second sample data, a horizontal axis in the diagram represents the sample content, a vertical axis represents the sample object, each color block may represent the interaction information of one sample object and one sample content, where a deep color block represents that there is interaction data, and a color block without color represents that no interaction data is collected, correspondingly, the region 31 may represent the first sample data, and the region 32 may represent the second sample data. And if the interactive information density represents the ratio of the interactive information with the interactive value in the first sample data or the second sample data to all the sample data, the interactive information density of the first sample data is greater than or equal to that of the second sample data.

In a specific implementation, in order to enable the prediction model to simulate real feedback information of a target object on target content, when the prediction model is trained, influence of a monotonous and repeated recommendation result on user experience needs to be considered, therefore, in the embodiment, when the prediction model is trained, for a same sample object, based on a plurality of sample contents which are acquired from second sample data and have interaction information with the same sample object, as a plurality of sample contents corresponding to the same sample object, recommendation frequency information of each sample content in the plurality of sample contents is calculated to reflect a degree that each sample content falls into an information cocoon. And further adjusting initial feedback information of the same sample object to each sample content based on the recommended frequency information to obtain real feedback information of the same sample object to each sample content, and performing iterative training on the prediction model to be trained based on the feedback information to obtain the trained prediction model.

In the embodiment, when the prediction model is trained, the influence of a monotonous and repeated recommendation result on user experience is considered, so that recommendation frequency information is introduced to determine feedback information of each sample object on sample content, the prediction model obtained based on the recommendation frequency information training can correctly depict the repeated recommendation effect in the recommendation result, and detection and correction are performed at the germination stage of the information cocoon.

In an exemplary embodiment, the step S320 may be specifically implemented by the following steps:

step S320a, determining associated sample content with a timestamp before the target sample content based on the difference information between the plurality of sample contents corresponding to the same sample object; the target sample content belongs to a plurality of sample contents corresponding to the same sample object;

step S320b, obtaining a time difference between the time stamps corresponding to the target sample content and the associated sample content;

in step S320c, based on the difference information and the time difference, sample recommendation frequency information corresponding to the target sample content is determined.

The difference information may be determined according to a preset attribute, for example, the difference information may be a characteristic distance between the characterization vectors of the two sample contents, may also be difference information between creators or publishers of the two sample contents, and may also be difference information between categories or tags of the two sample contents.

Wherein the associated sample content represents content that matches the target sample content in at least one attribute.

Wherein the time stamp characterizes a time at which the respective sample content was sampled to train the predictive model.

In a specific implementation, taking a target sample content of a plurality of sample contents corresponding to the same sample object as an example, the corresponding sample recommendation frequency information is determined by determining a sample content with a timestamp before the target sample content from the plurality of sample contents, marking the sample content as a candidate sample content, acquiring difference information between the target sample content and each candidate sample content, and determining a content matched with the target sample content on at least one attribute from the candidate sample content as a related sample content of the target sample content based on the difference information. And further acquiring a time difference between the time stamp of the target sample content and the time stamp of the associated sample content, and determining sample recommendation frequency information corresponding to the target sample content based on the time difference between the target sample content and the associated sample content and the difference information between the target sample content and the associated sample content.

For example, taking the characteristic distance as the difference information as an example, for a plurality of sample contents corresponding to the same sample object, the determination method of the sample recommendation frequency information corresponding to any target sample content may be: the method comprises the steps of obtaining a characteristic distance between target sample content and other sample content (namely sample content except the target sample content in a plurality of sample content), determining sample content with a timestamp before the target sample content and matched with the target sample content in at least one attribute from the other sample content based on the characteristic distance as associated sample content of the target sample content, and then determining sample recommendation frequency information corresponding to the target sample content based on the characteristic distance and time difference between the target sample content and the associated sample content.

Further, in order to improve the accuracy of the obtained sample recommendation frequency information, in an exemplary embodiment, the step S320c may further include: acquiring the sensitivity of the same sample object to repeated recommendation effects (which can be understood as excessive times of recommending the same or similar contents in a short time) and the classical degree information or tolerance degree information of the target sample contents; and determining sample recommendation frequency information corresponding to the target sample content based on the sensitivity of the same sample object to repeated recommendation effect, the classical degree information or tolerance degree information of the target sample content, the difference information between the target sample content and the associated sample content, and the time difference between the target sample content and the associated sample content.

More specifically, for the recommended frequency information (denoted as e)_t) Can be represented by the following relation:

where u denotes the sample object, i denotes the target sample content, α_uRepresents the degree of sensitivity, β, of the sample object u to the repetitive recommendation effect_iRepresenting the degree of classics or tolerance, i, of the target sample content i_lRepresenting associated sample content with a timestamp before the target sample content i, t_lRepresenting associated sample content i_lT denotes the time stamp of the target sample content i,

for one set, all sample contents recommended so far by the prediction model and the sample object u in the current round of interactive recommendation process are recorded, for example, any triple in the set

Representing the l-th content i pushed to the sample object u_lWith a time stamp of t_l. τ is a temperature coefficient, a hyper-parameter that requires manual setting adjustments. dist (i, i)_l) Representing the distance between the target sample content i and the characterization vector of the associated sample content. Recommended frequency information e_tThe intuitive meaning in the definition of (u, i) can be understood as: if the current round of interaction is, the system is in a short time (time interval t-t)_lSmaller) recommended other content (distance function dist (i, i) associated with the target sample content_l) Smaller), the recommended frequency information e corresponding to the target sample content is obtained_t(u, i) will be larger.

In the embodiment, the sample recommendation frequency information corresponding to the target sample content is determined according to the difference information and the time difference between the target sample content and the associated sample content, so that the obtained recommendation frequency information can represent the repeated recommendation degree of the target sample content, and the accurate depiction of the repeated recommendation effect in the recommendation result is realized.

In an exemplary embodiment, the step S330 may be specifically implemented by the following steps:

step S330a, based on the interaction information in the second sample data, sampling positive sample content and negative sample content corresponding to each sample object from the second sample data; the positive sample content represents the content of the sample object subjected to positive feedback, and the negative sample content represents the content of the sample object subjected to negative feedback;

step S330b, aiming at each sample object, respectively obtaining first feedback information of the sample object to the positive sample content and second feedback information of the sample object to the negative sample content according to the sample recommendation frequency information of the positive sample content and the sample recommendation frequency information of the negative sample content;

step S330c, based on the loss value between the first feedback information and the second feedback information, adjusting the model parameter of the prediction model to be trained until reaching the preset training times or the loss value converges, so as to obtain the trained prediction model.

In a specific implementation, the positive sample content may be understood as a content preferred by the sample object, the second sample data may be a sample content having interaction data with the sample object, and the second sample data may be a sample content having no interaction data with the sample object, and the second sample data may be determined as a content not preferred by the sample object and may be a negative sample content. For example, in fig. 3, in the area 31 representing the second sample data, the sample content corresponding to the dark color block may be used as the positive sample content of the sample object corresponding to the color block, and the sample content corresponding to the color block without the interactive data may be used as the negative sample content of the sample object corresponding to the color block. Therefore, a plurality of positive sample contents and a plurality of negative sample contents corresponding to each sample object can be sampled from the second sample data based on the interaction information in the second sample data, wherein the negative sample contents and the positive sample contents appear along with each other.

For each sample object, after obtaining a plurality of positive sample contents and a plurality of negative sample contents corresponding to the sample object, for each pair of positive sample contents and negative sample contents, the sample recommendation frequency information corresponding to the positive sample contents and the negative sample contents may be determined based on the methods described in steps S320a to S320c, and then the initial feedback information of the positive sample contents obtained through the prediction model is adjusted according to the sample recommendation frequency information corresponding to the positive sample contents to obtain the first feedback information of the positive sample contents, and the initial feedback information of the negative sample contents obtained through the prediction model is adjusted according to the sample recommendation frequency information corresponding to the negative sample contents to obtain the second feedback information of the negative sample contents. And obtaining a difference value between the first feedback information and the second feedback information as a loss value, adjusting model parameters of the prediction model to be trained based on the loss value until preset training times or loss value convergence is reached, and obtaining the trained prediction model.

More specifically, for example, a classical MSE (mean-square error) loss function or a BPR (Bayesian Personalized Ranking) loss function may be used as a loss function of the prediction model to guide the training, and the BPR loss function is given by the following relation:

wherein,

is an activation function for mapping data from a range of (- ∞, ∞) to between ranges of (0, 1). Where i represents positive sample content, j represents negative sample content,

first feedback information representing the sample object u versus the sample content i,

the process of optimizing the BPR loss function is a learning process of traversing all 7176 sample objects in the second sample data to known interactive samples in the interacted 10729 sample contents, wherein the second feedback information represents the sample object u to the sampled negative sample content j.

In this embodiment, the prediction model is trained according to the loss value between the first feedback information of the positive sample content and the second feedback information of the negative sample content, so as to increase the prediction score between the positive sample content and the negative sample content, so that the score of the feedback information of the positive sample content is as high as possible, and the score of the feedback information of the negative sample content is as low as possible, so that the prediction model can learn the relative preference sequence of the sample object to different sample contents, thereby improving the prediction effect of the prediction model, and avoiding the problem that the feedback information of the prediction model for all sample contents is less in difference and the preference of the sample object is difficult to distinguish accurately because the traditional method only trains the loss value between the actual interactive data of a single sample content and the prediction value.

In an exemplary embodiment, in step S230, the prediction model adjusts the predicted initial feedback information based on the recommendation frequency information corresponding to the target content, so as to obtain the feedback information of the target object for the target content, and the method may be implemented as follows: determining adjustment amplitude information for the initial feedback information based on the recommended frequency information; adjusting the amplitude information to form a positive correlation with the recommended frequency information; and adjusting the initial feedback information according to the adjustment amplitude information to obtain feedback information which is reversely changed with the recommendation frequency information and is used as the feedback information of the target object to the target content.

In specific implementation, the purpose of adjusting the initial feedback information predicted by the prediction model through the recommendation frequency information is to enable the target object to give feedback information smaller than the initial feedback information under the condition of excessively high recommendation frequency information on the target content, that is, the higher the value of the recommendation frequency information is, the smaller the score reflected by the adjusted feedback information is. Therefore, the initial feedback information can be adjusted based on the characteristic that the recommendation frequency information and the feedback information are changed reversely, and when the initial feedback information is adjusted, adjustment amplitude information of the initial feedback information can be set, and the adjustment amplitude information and the recommendation frequency information are in positive correlation, that is, the higher the value of the recommendation frequency information is, the larger the amplitude of the initial feedback information can be adjusted to be smaller.

More specifically, the following relation may be used to define the influence of the recommendation frequency information on the feedback information of the target object:

wherein, y_uiCan represent initial feedback information e of the target object u to the target content i_t(u, i) may represent recommendation frequency information corresponding to the target content i,

may represent feedback information adjusted by the recommended frequency information.

In this embodiment, the initial feedback information is adjusted through the adjustment amplitude information in a proportional relationship with the recommendation frequency information to obtain feedback information which changes in a reverse direction with the recommendation frequency information and is used as feedback information of the target object for the target content, so that the target object can give feedback information smaller than the initial feedback information for the target content under excessively high recommendation frequency information, and therefore correction at the germination stage of the information cocoon house can be achieved, and the problem of the information cocoon house is solved.

In an exemplary embodiment, in step S220, the content recommendation model to be trained may determine the target content recommended to the target object by:

step S220a, acquiring state information associated with the target object; the state information is obtained based on the object characteristic information of the target object, the historical recommendation content recommended for the target object, the weight information of the historical recommendation content, and the feedback information of the target object for the historical recommendation content;

step S220b, determining target content recommended to the target object according to the content recommendation model to be trained and the state information.

Wherein the object feature information may be understood as a representation vector characterizing the preferences of the target object.

Wherein the weight information may be used to characterize a recommendation probability for the content.

In a specific implementation, when determining the target content recommended to the target object from the first sample data, the content recommendation model to be trained may refer to, in addition to the model itself, object feature information representing a preference of the target object, the historical recommendation content recommended for the target object, weight information of the historical recommendation content, feedback information of the target object for the historical recommendation content, and other state information associated with the target object, which may affect a recommendation result. And determining the target content recommended to the target object by combining the content recommendation model to be trained with the state information associated with the target object.

In this embodiment, the target content recommended to the target object by the content recommendation model to be trained is determined by combining the state information associated with the target object, so that the accuracy of the recommended target content can be improved.

In an exemplary embodiment, in the step S240, adjusting the content recommendation model to be trained based on the feedback information to obtain the target content recommendation model includes:

step S240a, based on the feedback information, adjusting the model parameters of the content recommendation model to be trained to obtain a new content recommendation model;

step S240b, updating the state information associated with the target object according to the target content and the feedback information corresponding to the target content;

step S240c, determining new target content recommended to the target object again through the new content recommendation model and the updated state information associated with the target object, and obtaining feedback information for the new target content through the trained prediction model until the content recommendation accuracy for the target object reaches a threshold value, so as to obtain a target content recommendation model.

In the concrete implementation, in the training process of the content recommendation model to be trained, along with the interaction between the content recommendation model and the target object, the content recommendation model gradually learns the preference of the target object, and in the process, the state information associated with the target object is changed, so that after the target content is recommended to the target object once, on one hand, the model parameters of the content recommendation model to be trained can be adjusted based on the feedback information of the target object simulated by the prediction model to the target content to obtain a new content recommendation model, so that the content recommendation accuracy rate of the new content recommendation model to the target object can be improved compared with the content recommendation model to be trained, on the other hand, the state information associated with the target object can be updated in time according to the target content recommended each time and the feedback information of the target object to the target content, and determining new target content recommended to the target object again through the new content recommendation model and the updated state information associated with the target object, and acquiring feedback information of the target object aiming at the new target content through the trained prediction model until the content recommendation accuracy aiming at the target object reaches a threshold value to obtain the target content recommendation model so as to improve the adaptation degree of the recommendation result of the content recommendation model and the recommended object.

More specifically, for the training of the content recommendation model, the objective function of the training may be the expectation function in the maximized PPO algorithm as follows

The optimization objective is to obtain a content recommendation model θ that maximizes the cumulative revenue, i.e., at the current state information s_tRecommending target content a_tProbability of (n)_θ(a_t|s_t) Wherein the epsilon is a hyper-parameter and is used for controlling the maximum single step updating amplitude of the content recommendation model theta by using a function clip (x, a, b) to cut the value of a variable x in an interval [ a, b ]]And (4) the following steps. Theta in the formula_oldRepresenting the content recommendation model that generated the interaction data, i.e. the version of the current content recommendation model before the update.

Is a merit function representing the cumulative revenue, which is implemented as:

wherein,

is defined as

It represents the time-series residual of the value function V, whereas γ is a hyperparameter representing the discounting factor, λ ∈ [0, 1 ]]Is a hyper-parameter that balances bias and variance, and the value function V is defined as follows:

the key point of optimizing the obtained target content recommendation model is that the pre-trained prediction model is used for answering each question provided by the content recommendation model to be trained, namely feedback information of the target object in 1411 sample objects belonging to the first sample data to the target content in 3327 sample contents is provided, namely the satisfaction score r_t. Due to r_tNot by the real user, but by the predictive model, this score is also defined as the counterfactual reward score.

In this embodiment, in the training process of the content recommendation model, feedback information is given by the prediction model, and after the training is finished, knowledge and information in the prediction model are transferred to the target content recommendation model, so that the target content recommendation model performs automatic inference decision in interaction with a real object. Therefore, the trained target content recommendation model can grasp the preferences of all objects, and meanwhile, the target content recommendation model can intelligently avoid repeated recommendation behaviors, so that the problem of information cocoon rooms in the recommendation system is solved fundamentally.

In order to evaluate whether the trained target content recommendation model can characterize and eliminate the information cocoon problem in the recommendation system, in an exemplary embodiment, after the step S240 of obtaining the target content recommendation model, the method further includes:

step S250, acquiring performance parameters of a target content recommendation model for recommending the content of the sample object according to the first sample data; the performance parameters are obtained under the condition that recommendation frequency information of the target content recommendation model in a preset time period reaches a threshold value;

step S260, based on the performance parameters, the performance of the target content recommendation model is evaluated.

The performance parameter may be an interaction duration of content recommendation performed by the target content recommendation model and a sample object.

In specific implementation, in order to evaluate whether the trained target content recommendation model can depict and eliminate the information cocoon room problem in the recommendation system, the embodiment is further improved based on the first sample data, and provides a mechanism for enabling a real object to be tired and quit interaction under repeated recommendation. More specifically, 1411 sample objects in the collected first sample data are randomly sampled, the sampled target objects interact with a trained target content recommendation model, after the target content recommendation model returns a recommendation result, that is, a target content is recommended from 3327 sample contents in the first sample data, real interaction data of the target object in the first sample data on the target content is inquired and returned to the target content recommendation model as a supervision signal, and the process is executed in a circulating manner until an interaction termination condition is reached: if the currently recommended target content and the historical recommended content recommended in the latest time period have associated sample content exceeding a threshold value, the interaction quits, the interaction duration of the current interaction process is obtained and used as a performance parameter, and the performance of the target content recommendation model is evaluated. Under the evaluation method, the target content recommendation model ensures that the recommendation accuracy of each round is high, and ensures that continuous recommendation cannot make recommended contents repeated, so that the automatic balance capability of the target content recommendation model is examined, and the final evaluation target is to make the target content recommendation model obtain the maximum accumulated benefit in interactive recommendation, namely the sum of the scores of the feedback information of the target object to each recommended content is the highest.

In the embodiment, the performance parameters are obtained and the target content recommendation model is evaluated under the condition that the recommendation frequency information of the target content recommendation model in the preset time period reaches the threshold value, the evaluation method provided by the embodiment can be visually and transversely compared with the currently common recommendation strategy and model, and the expression results of different methods under the information cocoon problem can be visually and efficiently obtained.

Referring to fig. 4, a flowchart of a content recommendation method according to an exemplary embodiment is shown, where the method includes the following steps:

step S410, obtaining a plurality of contents to be recommended and objects to be recommended;

step S420, selecting target recommended content from a plurality of contents to be recommended through a target content recommendation model, and pushing the target recommended content to an object to be recommended; the target content recommendation model is obtained by training through the method of the embodiment.

In the specific implementation, after a plurality of contents to be recommended and objects to be recommended are obtained, the target content recommendation model obtained through training by the method of the embodiment is used for processing the contents to be recommended and the objects to be recommended to obtain the interactive prediction information of the objects to be recommended on the contents to be recommended, the target recommended content with the highest value of the interactive prediction information is selected from the contents to be recommended based on the interactive prediction information, and the target recommended content is pushed to the objects to be recommended.

In the embodiment, the target content recommendation model obtained through the training by the method can accurately depict the repeated recommendation effect of the recommendation result, so that the problem of information cocoon room is solved, and therefore the target recommendation content pushed to the object to be recommended can avoid fatigue of the object to be recommended and better accords with the preference of the object to be recommended.

In an exemplary embodiment, to facilitate those skilled in the art to understand the embodiment of the present disclosure, the following will take recommended content as a video, and the method is described with reference to specific examples of fig. 5 to 7, where fig. 5 is an overall flowchart of a training method of a content recommendation model, and the specific steps are as follows:

(1) step S510, collect the first sample data and the second sample data.

In order to measure the performance of a content recommendation model and simultaneously measure the influence of an information cocoon room effect, namely a monotonous and repeated recommendation result on user experience, the real preference of all users on all videos in a set can be obtained, namely all values in a user-video matrix are known and serve as first sample data. In addition, in order to train the prediction model, sample data other than the full exposure matrix may be obtained as second sample data.

For example, referring to fig. 3, 1411 users and 3327 representative videos were collected as first sample data of full exposure, and 7176 users' interaction data with 10729 videos were further collected as second sample data, and since the second sample data is used to train the prediction model, full exposure may not be needed.

(2) And S520, constructing a model framework combining a causal inference model based on the counter-fact reasoning and an offline reinforcement learning framework.

Referring to fig. 6, the overall model framework of the causal inference model based on the counter-fact reasoning and the offline reinforcement learning framework is schematically illustrated, and includes four modules: a prediction Model (also called Causal User Model), a State Tracker (State Tracker) based on a transform Model, an interactive recommendation strategy (RL Agent) based on reinforcement learning, and a Real Environment for evaluation (Real Environment), wherein the functions of the modules are described as follows:

the prediction model further comprises a prediction unit and an adjusting unit, the prediction unit is used for predicting initial feedback information of the sample object to the sample content, and the adjusting unit is used for adjusting repeated recommendation results in dynamic interactive recommendation, namely giving a negative score reward signal.

A state tracker, based on a Transformer model, which can automatically derive a representation vector e representing user preferences_uAnd a series of historically recommended video representation vectors

Automatically extracting the information most relevant to the current recommended video. The representation vector of the state tracker in fig. 6 has a transition, user vector e_uAfter passing through a simple feed-forward neural network (FFN), the new vector expression e'_uThe processing of the representation vector of the video is defined as the operator:

wherein the [ ] symbol represents an element-level dot product, g_tThe gating vector is used for automatically controlling the weighting of the embedding of the corresponding recommended video according to a prediction model or feedback information given by a real user. It is specifically defined by the formula:

where W and b are parameters that need to be learned during training, and Concat (a, b) is a vector concatenation symbol that functions to concatenate vectors a and b.

Wherein, a_tRepresenting actions taken by the interactive recommendation strategy at time t, one action representing a recommendation of a video i, and thus a vector e representing an action a_aAnd video e of the action recommendation_iIs equivalent, i.e. e_a＝e_i。

Indicating the interaction state at time t, which should contain as much interaction information as possible, including an indication vector e representing preference information of the user_uAnd the video information recommended to the user in the whole interactive track process (namely from the time 1 to the time t-1) is represented as:

the role of the state tracker is to extract the really useful part from this information. r is_tIndicating that the system made the recommended action a at time t_tThen, feedback information given by the user, namely a scalar score representing the satisfaction degree, is obtained, and the aim of optimizing the whole interactive strategy is to maximize the accumulated reward of the whole track sequence. The reward signal can be a satisfaction score given by a real user on the line or a counterfactual reward given by a trained prediction model.

(3) Training model

In step S530, the first stage of model training is to train the prediction model by using the second sample data, so that the trained prediction model can grasp the preferences of 7176 users for 10729 videos.

In the training process of the prediction model, the prediction model is performed according to the causal structure diagram shown in fig. 7, wherein U node represents user preference, I node represents the characteristic of video, R node represents feedback information given by the user, Y node represents the real preference of the user, and E node represents the real preference of the user_tRepresenting the currently recommended recommendation frequency information, representing the degree of trapping in the information cocoon,

is a random variable E_tIs a specific value of. Intuitively, if a certain video or a certain type of video is repeatedly recommended, the recommendation frequency information E_tWill be larger and the user will feel tired and then give a negative feedback signal R with respect to the real taste Y. The shaded nodes in the graph represent hidden variables that cannot be observed directly from historical data.

In FIG. 7The first diagram, the causal diagram in fig. 7(a), is a common assumption for the current recommendation system model, i.e. the user feedback information is determined only by the user preferences and product characteristics. Based on conventional assumptions, the present disclosure explicitly models recommendation frequency information E_tTo obtain the causal graph shown in fig. 7(b), it is assumed that the final feedback signal of the user is determined by two paths:

(U, I) → Y → R: the path depicts the influence of the real interest of the user on the final feedback information, and can be realized by a traditional Deep frequency modulation (Deep factor mechanisms) recommendation model or other recommendation models.

②I→E_t→ R: this path characterizes the impact of the recommended frequency information on the final feedback information of the user.

In step S540, the second stage is training of the reinforcement learning-based interactive recommendation strategy.

In the second stage of training, the reward signal of the reinforcement learning-based interactive recommendation strategy is given by the prediction model obtained by the first stage of training, at this time, the recommendation frequency information of the video cannot be obtained from the originally collected historical data, but is obtained by means of counter-fact reasoning, as shown in fig. 7(c), the original recommendation frequency information E needs to be cut off_tThe path of influence Y on the user feedback is added with new real-time reasoning

Replacing the original E_t。

In the stage, interactive information in sample data is not adopted in the training process, a prediction model obtained by the first stage training is used for carrying out a large amount of efficient interaction with an interactive recommendation strategy, in the process, the prediction model plays a role of a user simulator, and the repeated recommendation behavior of the interactive recommendation strategy can be adjusted by the prediction model through a counterfactual reasoning module, so that the interactive recommendation strategy module learns the capacity of automatically avoiding repeated recommendation and jumping out of an information cocoon.

The method provided by the present disclosure is directed to explicitly modeling and evaluating information cocoon houses in an interactive recommendation environment. Specifically, by adopting a causal reasoning mode, user preference in interactive recommendation is more carefully modeled in an offline reinforcement learning framework, and influences of real user preference and repeated recommendation effect on final user experience are clearly distinguished, so that the information cocoon room problem is fundamentally avoided and solved, and the method disclosed by the invention can be popularized to a more common traditional static model.

Experiments prove that the method provided by the disclosure has a good effect on solving the problem of the information cocoon room in the recommendation system, and the specific conclusion is as follows: (1) the predication model based on the counter-fact causal inference can accurately depict the repeated recommendation effect in the recommendation result, so that the detection and correction are carried out at the germination stage of the information cocoon house. (2) The interactive recommendation strategy based on reinforcement learning can well master counterfactual reward signals given by the prediction model, so that self-adaptive scheme adjustment is carried out in real-time interaction with real users, and experimental results prove that the method provided by the disclosure can work well no matter how the environment changes, and the effect is far beyond a comparison algorithm.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

It is understood that the same/similar parts between the embodiments of the method described above in this specification can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and it is sufficient that the relevant points are referred to the descriptions of the other method embodiments.

Based on the same inventive concept, the embodiment of the present disclosure further provides a content recommendation model training device for implementing the content recommendation model training method, and a content recommendation device for implementing the content recommendation method.

Fig. 8 is a block diagram illustrating a structure of a training apparatus for a content recommendation model according to an exemplary embodiment. Referring to fig. 8, the apparatus includes: a first obtaining unit 810, a recommending unit 820, a simulating unit 830 and an adjusting unit 840, wherein,

a first acquisition unit 810 configured to perform acquisition of first sample data; the first sample data comprises interaction information between the sample object and the sample content;

a recommending unit 820 configured to perform acquiring target content recommended to a target object by a content recommendation model to be trained based on the first sample data; the target object is any one of the sample objects, and the target content belongs to the sample content in the first sample data;

a simulation unit 830 configured to execute the trained prediction model to simulate feedback information of the target object to the target content; the prediction model is used for predicting initial feedback information of the target object on the target content and adjusting the predicted initial feedback information based on the recommendation frequency information corresponding to the target content to obtain feedback information of the target object on the target content; the recommendation frequency information represents the frequency of recommending associated content aiming at the target object, and the associated content is the content matched with the target content on at least one attribute;

an adjusting unit 840 configured to perform adjustment on the content recommendation model to be trained based on the feedback information, resulting in a target content recommendation model.

In an exemplary embodiment, the apparatus further includes:

the frequency determining unit is configured to execute obtaining of sample recommendation frequency information corresponding to each sample content of the same sample object based on a plurality of sample contents corresponding to the same sample object in the second sample data; the sample recommendation frequency information represents the number of associated sample contents of the timestamp in front of each sample content, and the timestamp represents the time of each sample content sampled to train a prediction model;

In an exemplary embodiment, the frequency determining unit is further configured to perform determining associated sample content with a timestamp before the target sample content based on difference information between a plurality of sample contents corresponding to the same sample object; the target sample content belongs to a plurality of sample contents corresponding to the same sample object; acquiring a time difference between timestamps corresponding to the target sample content and the associated sample content; and determining sample recommendation frequency information corresponding to the target sample content based on the difference information and the time difference.

In an exemplary embodiment, the training unit is configured to perform sampling of positive sample content and negative sample content corresponding to each sample object from the second sample data based on the interaction information in the second sample data; the positive sample content represents the content of the sample object subjected to positive feedback, and the negative sample content represents the content of the sample object subjected to negative feedback; for each sample object, obtaining first feedback information of the sample object to the positive sample content and second feedback information of the sample object to the negative sample content according to the sample recommendation frequency information of the positive sample content and the sample recommendation frequency information of the negative sample content; and adjusting model parameters of the prediction model to be trained based on the loss value between the first feedback information and the second feedback information until the preset training times are reached or the loss value is converged, so as to obtain the trained prediction model.

In an exemplary embodiment, the training unit is further configured to perform determining adjustment magnitude information for the initial feedback information based on the recommended frequency information; adjusting the amplitude information to form a positive correlation with the recommended frequency information; and adjusting the initial feedback information according to the adjustment amplitude information to obtain feedback information which is reversely changed with the recommendation frequency information and is used as the feedback information of the target object to the target content.

In an exemplary embodiment, the recommending unit 820 is further configured to perform obtaining the state information associated with the target object; the state information is obtained based on the object characteristic information of the target object, the historical recommendation content recommended for the target object, the weight information of the historical recommendation content, and the feedback information of the target object for the historical recommendation content; and determining the target content recommended to the target object according to the content recommendation model to be trained and the state information.

In an exemplary embodiment, the adjusting unit 840 is further configured to perform adjusting model parameters of the content recommendation model to be trained based on the feedback information, so as to obtain a new content recommendation model; updating the state information associated with the target object according to the target content and the feedback information corresponding to the target content; and determining new target content recommended to the target object again through the new content recommendation model and the updated state information associated with the target object, and acquiring feedback information aiming at the new target content through the trained prediction model until the content recommendation accuracy aiming at the target object reaches a threshold value to obtain the target content recommendation model.

Fig. 9 is a block diagram illustrating a configuration of a content recommendation apparatus according to an exemplary embodiment. Referring to fig. 9, the apparatus includes: an obtaining unit 910 and a recommending unit 920, wherein,

an obtaining unit 910 configured to perform obtaining a plurality of contents to be recommended and objects to be recommended;

a recommending unit 920, configured to select a target recommended content from a plurality of contents to be recommended through a target content recommending model, and push the target recommended content to an object to be recommended; the target content recommendation model is trained by the method of any one of claims 1 to 8.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 10 is a block diagram of an electronic device 1000 for implementing a training method of a content recommendation model or a content recommendation method according to an example embodiment. For example, the electronic device 1000 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, and so forth.

Referring to fig. 10, electronic device 1000 may include one or more of the following components: processing component 1002, memory 1004, power component 1006, multimedia component 1008, audio component 1010, interface to input/output (I/O) 1012, sensor component 1014, and communications component 1016.

The processing component 1002 generally controls the overall operation of the electronic device 1000, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 1002 may include one or more processors 1020 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 1002 may include one or more modules that facilitate interaction between processing component 1002 and other components. For example, the processing component 1002 may include a multimedia module to facilitate interaction between the multimedia component 1008 and the processing component 1002.

The memory 1004 is configured to store various types of data to support operations at the electronic device 1000. Examples of such data include instructions for any application or method operating on the electronic device 1000, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1004 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, optical disk, or graphene memory.

The power supply component 1006 provides power to the various components of the electronic device 1000. The power components 1006 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 1000.

The multimedia component 1008 includes a screen that provides an output interface between the electronic device 1000 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1008 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 1000 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 1010 is configured to output and/or input audio signals. For example, the audio component 1010 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 1000 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 1004 or transmitted via the communication component 1016. In some embodiments, audio component 1010 also includes a speaker for outputting audio signals.

I/O interface 1012 provides an interface between processing component 1002 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 1014 includes one or more sensors for providing various aspects of status assessment for the electronic device 1000. For example, the sensor assembly 1014 may detect an open/closed state of the electronic device 1000, the relative positioning of components, such as a display and keypad of the electronic device 1000, the sensor assembly 1014 may also detect a change in the position of the electronic device 1000 or components of the electronic device 1000, the presence or absence of user contact with the electronic device 1000, orientation or acceleration/deceleration of the device 1000, and a change in the temperature of the electronic device 1000. The sensor assembly 1014 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 1014 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1014 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1016 is configured to facilitate wired or wireless communication between the electronic device 1000 and other devices. The electronic device 1000 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 1016 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 1016 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 1000 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components for performing the above-described methods.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 1004 comprising instructions, executable by the processor 1020 of the electronic device 1000 to perform the above-described method is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which includes instructions executable by the processor 1020 of the electronic device 1000 to perform the above-described method.

It should be noted that the descriptions of the above-mentioned apparatus, the electronic device, the computer-readable storage medium, the computer program product, and the like according to the method embodiments may also include other embodiments, and specific implementations may refer to the descriptions of the related method embodiments, which are not described in detail herein.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training a content recommendation model, comprising:

2. The method of claim 1, wherein the predictive model is derived by training as follows:

3. The method of claim 1, wherein the content recommendation model to be trained determines the target content recommended to the target object by:

4. The method of claim 3, wherein the adjusting the content recommendation model to be trained based on the feedback information to obtain a target content recommendation model comprises:

5. A content recommendation method, comprising:

selecting target recommended content from the plurality of contents to be recommended through a target content recommendation model, and pushing the target recommended content to the object to be recommended; the target content recommendation model is trained by the method of any one of claims 1 to 4.

6. An apparatus for training a content recommendation model, comprising:

7. A content recommendation apparatus characterized by comprising:

the recommending unit is configured to execute the steps of selecting target recommended content from the plurality of contents to be recommended through a target content recommending model and pushing the target recommended content to the object to be recommended; the target content recommendation model is trained by the method of any one of claims 1 to 4.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 5.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-5.

10. A computer program product comprising instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the method of any one of claims 1 to 5.