CN114036388A

CN114036388A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN114036388A
Application number: CN202111355773.8A
Authority: CN
Inventors: 师敏花
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2022-02-11

Abstract

The disclosure provides a data processing method and device, electronic equipment and a storage medium, and relates to the technical field of data processing, in particular to the fields of artificial intelligence, reinforcement learning and intelligent recommendation. The specific implementation scheme is as follows: acquiring a recall data set; constructing a search tree corresponding to the recall data set, wherein the search tree comprises: the data retrieval system comprises a root node and a plurality of data nodes positioned at different levels, wherein each data node is used for representing recall data in a recall data set, each data node is used for storing the push value and the search times of corresponding recall data, and the push value is used for representing the value of a feedback result received after the corresponding recall data are pushed to a target object; and determining target data in the recall data set based on the push value of each recall data in the recall data set, wherein the target data is data pushed to a target object.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of data processing, particularly to the fields of artificial intelligence, reinforcement learning and intelligent recommendation, and particularly provides a data processing method and device, electronic equipment and a storage medium.

Background

In an intelligent recommendation scene, recall data are often processed through a sorting algorithm, however, when a new pushing requirement is faced by a commonly used sorting algorithm at present, the new pushing requirement may affect exposure sorting of commodities, so that clicking and conversion of recommended commodities cannot achieve expected effects, and even recommendation fairness and click rate are reduced integrally.

Disclosure of Invention

The present disclosure provides a method and apparatus for data processing, an electronic device, and a storage medium.

According to a first aspect of the present disclosure, there is provided a data processing method, including: acquiring a recall data set; constructing a search tree corresponding to the recall data set, wherein the search tree comprises: the data retrieval system comprises a root node and a plurality of data nodes positioned at different levels, wherein each data node is used for representing recall data in a recall data set, each data node is used for storing the push value and the search times of corresponding recall data, and the push value is used for representing the value of a feedback result received after the corresponding recall data are pushed to a target object; and determining target data in the recall data set based on the push value of each recall data in the recall data set, wherein the target data is data pushed to a target object.

According to a second aspect of the present disclosure, there is provided a data processing apparatus comprising: the acquisition module is used for acquiring a recall data set; the construction module is used for constructing a search tree corresponding to the recall data set, wherein the search tree comprises: the data retrieval system comprises a root node and a plurality of data nodes positioned at different levels, wherein each data node is used for representing recall data in a recall data set, each data node is used for storing the push value and the search times of corresponding recall data, and the push value is used for representing the value of a feedback result received after the corresponding recall data are pushed to a target object; and the decision module is used for determining target data in the recall data set based on the push value of each recall data in the recall data set, wherein the target data is data pushed to a target object.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the above-described method.

According to a fifth aspect of the disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flowchart of a related art commodity recommendation flow;

FIG. 2 is a flow chart of a merchandise recommendation process according to the present disclosure;

FIG. 3 is a flow chart of a data processing method according to the present disclosure;

FIG. 4 is a schematic diagram of an alternative MDP tree according to the present disclosure;

FIG. 5 is a schematic illustration of an optional user step size according to the present disclosure;

FIG. 6 is a diagram of a merchandise recommendation scenario in which embodiments of the present disclosure may be implemented;

FIG. 7 is a schematic diagram of a data processing apparatus according to the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing a data processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

As shown in fig. 1, the current product recommendation process mainly includes: adopting different recall algorithms (including but not limited to collaborative filtering, vectorization recall, category recall, label recall, novelty recall, popularity recall and the like) to carry out recall processing on the candidate commodities in the candidate commodity library to obtain a recall commodity set; then, estimating the click rate and the conversion rate of the recalled commodities through a click rate estimation model and a conversion rate estimation model, and sequencing the recalled commodities according to the click rate and the conversion rate to obtain sequenced recalled commodities; and further screening the sorted recalled commodities according to supplementary strategies such as a pushing intervention rule, a diversity strategy and a repeated recommendation strategy to obtain a final recommended commodity list.

Currently, various sorting algorithms are provided in the related art to implement a sorting process of recalled commodities: the first is traditional collaborative filtering, LR (Logistic Regression model) + GBDT (Gradient Boosting Decision Tree), FM (Factorization Machines), etc.; the second is a Deep learning based method such as Wide & Deep, Deep fm, etc.; the third is a Deep reinforcement Learning model, such as Policy Gradient, DQN (Deep Q-Learning), Actor-Critic (action-evaluation).

However, when the above sorting algorithm faces a new pushing requirement, the new pushing requirement will affect the exposure sorting of the commodities, so that the click and conversion of the recommended commodities cannot achieve the expected effect, and even the recommendation fairness and the click rate are reduced as a whole.

For a recommendation scene considering commodity click and conversion values, many recommendation methods differentiate the values by uniformly giving a certain score to commodities of different groups at one time or periodically through static grouping, however, the method fails to consider the dynamic characteristics of the commodity values, and the commodity values may change in real time along with recommendation strategies and user feedback (such as exposure, click and conversion).

In order to solve the above problems, the present disclosure improves a ranking layer in the related art, as shown in fig. 2, an MCTS (Monte Carlo Tree Search) is mainly used to Search for a recalled commodity, strategies such as a commodity value (such as a value estimation model in fig. 2), a repeated recommendation penalty (such as a cost estimation model in fig. 2), and a hit penalty, a conversion rate, a user step size, and the like are integrated in the searching process to perform recommendation, and the ranked recalled commodity is further screened according to a diversity strategy to obtain a final recommended commodity list.

According to an open embodiment of the present disclosure, the present disclosure provides a data processing method.

Fig. 3 is a flow chart of a data processing method according to an embodiment of the present disclosure, as shown in fig. 3, the method including the steps of:

step S302, a recall data set is acquired.

The recall data set in the above steps may be a set of recall data determined by a recall algorithm in the data recommendation process, where the recall data may be candidate data successfully matched with the user in a candidate data set, and the type of the recall data is different in different data recommendation scenarios, for example, for a product recommendation scenario as shown in fig. 2, the recall data set may be a set of recall products determined by a recall algorithm.

Step S304, constructing a search tree corresponding to the recall data set, wherein the search tree comprises: the data retrieval system comprises a root node and a plurality of data nodes located at different levels, wherein each data node is used for representing recall data in a recall data set, each data node is used for storing the push value and the search times of corresponding recall data, and the push value is used for representing the value of a feedback result received after the corresponding recall data are pushed to a target object.

The search tree in the above steps may be a tree constructed by simulating a push process of the recall data for a limited time. The search tree may include a root node representing a request to push data to a user and a plurality of data nodes at different hierarchical levels, each data node representing a recall data. In the simulation process, the selection of the nodes can be performed based on the pushing value and the search times of the nodes, the estimated click rate and the conversion rate are used as the state transition probability when the nodes are expanded, the pushing of each recall data to a target object (namely a user) and the feedback result fed back by the target object aiming at the pushed recall data are simulated, the feedback result can be the recall data pushed by the user clicking, the recall data pushed by consulting, the recall data pushed by the user leaving and the like, and the determination can be performed according to a specific data pushing scene.

In order to ensure that the click rate and the conversion rate of target data pushed to a target object can obtain expected effects, a value estimation model can be constructed to estimate rewarded corresponding to different feedback results, wherein for the behavior of leaving of a user, the negative number of the estimated rewarded can be used as cost; for the behaviors of 'clicking' and 'consulting' of the user, the click and conversion value can be estimated and used as rewarded. Meanwhile, recommendation fairness is guaranteed, certain punishment can be given to recall data with more exposure, and a higher value is given to some potential recall data. It should be noted that the higher value does not directly expose the recall data, but only pushes the recall data in a short time, but if the click rate/conversion rate of the recall data is not increased, the probability of the recall data being selected is reduced, and the recall data is stopped from being pushed.

In addition, considering the influence of the repeated recommendation punishment strategy on the sequencing, a cost estimation module can be constructed for the recall data repeatedly pushed in a short time, and the cost repeatedly pushed each time can be estimated according to the characteristics of the target object, the background information and the like. The context information herein may refer to a time interval of the recall data from the current show, a historical show location, a page, a user's preference or acceptance of the repeated recommendation data, and the like.

In some alternative embodiments, an MDP (Markov Decision Process) tree may be constructed using MCTS, with evaluated state values stored at nodes of the tree with each simulation, which may be accumulated through a cyclic iteration of the four steps of selecting, expanding, simulating, and back-propagating, wherein each data node of the tree stores the following two state values v(s): push value and number of searches. The push value here is determined jointly based on reward and cost.

The parameters required to be used in the expression of each node, the state transition process and the actual operation of the MDP tree are as follows:

the state is as follows: s_tMay be represented by user preferences, user requests, system status;

and (4) Action: a, the system selects a recall data from a candidate recommendation list to recommend to a user;

Transitions：P_a(s | s') can be the state transition probability, and the success states are obtained through feedback from the user. The probability equation of Transition is generally equivalent to the probability of user behavior, and p is estimated through a probability network (comprising a click rate estimation model and a conversion rate estimation model);

Reward：r(s_t,a,s_t+1) The accuracy of recommendation can be determined after the user takes action&The exposure conversion condition meets the measurable index. The index is given by a value estimation model, and the model can estimate rewarded by combining the exposure, click and conversion conditions of the recall data and considering the recommendation fairness and the values brought by click and conversion;

Cost：c(s_t,a,s_t+1) The cost can be a cost which enables the user to take certain action, such as repeated recommendation in a short term and the like, and the cost can be estimated according to the repeated occurrence time interval, the page type and the position and the acceptance degree of the user to the repeated recommendation;

the countdown rate: gamma is used to measure the contribution rate of long-term rewarded to the current value, and it is generally considered that the larger the speculation depth, the higher the uncertainty of the behavior, the lower the contribution to the current value.

Step S306, target data in the recall data set is determined based on the push value of each recall data in the recall data set, wherein the target data is data pushed to a target object.

In this embodiment, the push value of each recall data reflects the click and conversion value of the recall data, and in order to achieve the purpose of maximizing the overall value, all recall data may be sorted according to the push value, and a plurality of recall data with the highest sorting are selected as target data, and then the target data is pushed to a target user according to a certain push rule, where the push rule may be set according to different application scenarios and push requirements, and this disclosure does not specifically limit this.

By means of the scheme, after the recall data set is obtained, a search tree corresponding to the recall data set can be constructed, then target data pushed to a target object are determined based on the pushing value of each recall data in the search tree, and the purpose of sequencing the recall data is achieved. It is easy to note that the target data pushed to the target object is determined based on the push value of each recall data, and the push value is used for representing the value of the feedback result received after the corresponding recall data is pushed to the target object, so that the push value can reflect the click and conversion effects of the recall data, thereby ensuring that the click and conversion of the target data pushed to the target object can achieve the expected effect, in addition, the push value of the recall data is determined in the process of constructing the search tree, therefore, the push values of different recall data can be adjusted according to the actual push requirements, the sequencing results of all recall data can be adjusted, the newly pushed intervention rules can not affect the push values of different recall data, thereby ensuring that the target data pushed to the target object can meet the push requirements, and further solving the problem that the sequencing algorithm in the related technology is easily interfered by different push requirements, the click and conversion of the recommended commodity cannot achieve the expected effect.

Optionally, constructing a search tree corresponding to the recall data set includes: step A, determining a target node to be searched, wherein the target node is used for representing a root node or recall data in a recall data set; b, expanding new child nodes below the target node, and determining the reward of the new child nodes by using a value estimation model; step C, simulating a target child node in the new child node, and determining the push value of the last child node when the simulation is finished; step D, performing reverse iteration on the search tree based on the push value of the last child node, updating the push values of the target node and each layer of child nodes below the target node, and updating the search times of the target node; and D, repeatedly executing the step A to the step D until the exploration time of the search tree reaches the preset exploration time or the exploration depth reaches the preset exploration depth.

The preset exploration time in the above steps may be simulation building time of the search tree, and the preset exploration depth may be simulation exploration depth of the search tree, and may be set according to actual needs, which is not specifically limited by the present disclosure.

In the embodiment of the present disclosure, the search tree needs to be explored from the root node, and after the exploration of the root node is completed, the search of other data nodes can be continued, so that, each time of loop iteration starts, the selection step (i.e., the step a described above) is executed first, and the target node that needs to be explored this time is determined; then, executing an expanding step (namely the step B), expanding the target node through the action, and expanding all feedback results generated by the action, wherein a new child node can be created according to the new feedback results, and meanwhile, the feedback results can be evaluated by using a value evaluation model to obtain the reward of the child node; further executing a simulation step (i.e. the step C), randomly selecting a child node for simulation until the simulation is finished (the simulation finishing condition can be set according to the requirement, and can be set to reach the set simulation time or the set simulation depth), and then giving the push value (including reward, cost, etc.) of the simulation termination state, i.e. giving the push value of the last child node; and finally, executing a back propagation step (namely the step D), carrying out back propagation on the simulated push value upwards, updating the push value of each layer of child nodes in a recursive mode, and updating the search times of the target node.

In some alternative embodiments, for the MDP tree shown in fig. 4, during the first iteration of the loop, the root node may be selected as the target node, and then an action (assuming push Item1 is selected), as shown by the solid circle in fig. 4; then, the child nodes corresponding to the feedback result (including clicking, leaving and converting) after Item1 is pushed according to the state transition probability expansion respectively correspond to three ellipses as shown in the upper left part of fig. 4, and the reward corresponding to each child node is evaluated by using a value evaluation model, and it is assumed that the rewards corresponding to the three child nodes are respectively 3, 0 and 100; randomly selecting a child node for simulation, supposing that a child node corresponding to 'clicking' is selected, as shown by a solid line ellipse in fig. 4, and the unselected child node is shown by a dotted line ellipse in fig. 4, wherein the simulation process can be to select a result (supposing that Item k is selected to be pushed) after the node according to the state transition probability, then expanding the child node after Item k is pushed according to the state transition probability, including clicking, leaving and converting, and at this time, the simulation is finished, and the pushed value of the last child node can be obtained directly according to the reward estimated by the value estimation module; and finally, reversely propagating the push value of the last child node according to the simulation depth, so that the push value of each layer of child nodes corresponding to the parent node can be updated in a recursive mode, and meanwhile, adding 1 to the search times of the target node, wherein the reward of all three child nodes is 0 because the child nodes after Item k is pushed are not subjected to simulation before.

It should be noted that, for the push value of the previous-layer child node, the maximum value of the total value (i.e., the sum of the push value and the formaldehyde awarded) of the next-layer child node is obtained, but the method is not limited thereto.

Through the scheme, the pushing value of each recall data can be continuously updated in the loop iteration through the steps of loop execution selection, expansion, simulation, back propagation and the like, so that the pushing value is ensured to meet the pushing requirement and truly reflect the clicking and converting values of the recall data, the accuracy of the pushing value is improved, and the data pushing accuracy is improved.

Optionally, determining the target node to be searched includes: traversing from the root node to determine whether an unexpanded node exists; in response to the existence of the unexpanded node, determining the unexpanded node as a target node; in response to the absence of unexpanded nodes, a target node is determined based on the push value and the number of searches of each data node.

The unexpanded node in the above embodiment may refer to that there is an action that has not been simulated or there is a child node that has not been simulated.

In the embodiment of the disclosure, the construction process of the search tree is to start to expand the next node after one node is expanded, so that after each iteration is started, whether the unexpanded node exists is determined firstly, if so, the node can be directly used as a target node, and the subsequent steps of expansion, simulation and back propagation are executed; if not, the selection of the target node may be implemented by a UCT (Upper Confidence Bound application to Tree), and the calculation formula of the target node is as follows:

where Q (s, a) is the push value of state s, N(s) is the number of searches, N (s, a) is the number of times action a is performed in state s, where Q (s, a) is the incentive "to utilize",

it is an incentive to "explore".

In some optional embodiments, for the MDP tree shown in fig. 4, in the first loop iteration process, it may be directly determined that the root node is the target node, and at this time, both the push value and the search frequency of the root node are 0; in the second loop iteration process, because the root node is not completely expanded, the root node can still be determined as the target node, and at the moment, the pushing value and the searching times of the root node are not 0 any more and are updated in the last loop iteration; in the third iteration process, if the root node is completely expanded, the traversal can be continued, and the node corresponding to Item1 is determined to be the target node.

By the scheme, the judgment result is obtained by determining whether the unextended nodes exist or not, and the target object is determined by adopting different modes according to different judgment results, so that the effect of improving the determination efficiency and accuracy of the target node is achieved.

Optionally, the expanding the new child node below the target node includes: acquiring state transition probability corresponding to a target node; determining a target execution operation corresponding to the target node based on the state transition probability; based on the target execution operation, a new child node is created below the target node.

The state transition probability in the above steps may be based on the click rate estimated by the click rate estimation model and the probability determined by the conversion rate estimated by the conversion rate estimation model, and is used for judging possible operations of the user on the recall data; the target execution operation may be an execution operation corresponding to a maximum probability based on the state transition probabilities.

In some optional embodiments, for the MDP tree shown in fig. 4, in the first loop iteration process, after determining a target node, an action that is not expanded, that is, push Item1, may be selected, then a target execution operation that may be executed by a user is determined according to a state transition probability, that is, clicking, leaving, and consulting, respectively, and three child nodes are created on the next level of the target node, where push values and search times of the three child nodes are both 0; in the second loop iteration process, after the target node is determined, an action which is not expanded, namely push Item2, is selected, then target execution operations which are possibly executed by a user are determined according to the state transition probability, namely clicking, leaving and consulting, and three child nodes are created on the next level of the target node, wherein the push value and the search times of the three child nodes are both 0.

By means of the scheme, the purpose of expanding the child nodes is achieved through the state transition probability, the click and conversion values of the expanded child nodes can truly reflect the recall data, and the effects of improving the accuracy of the pushing value and improving the data pushing accuracy are achieved.

Optionally, simulating a target child node in the new child node, and determining a push value corresponding to a last child node when the simulation is finished includes: determining a target child node based on the probability corresponding to the new child node; simulating a target child node; determining that the simulation is finished under the condition that the simulation time reaches the preset simulation time or the simulation depth reaches the preset simulation depth; the push value of the last child node is determined.

The preset simulation time in the above steps may be the time for performing simulation on the preset child node, and may be set according to actual needs, which is not specifically limited by the present disclosure. The preset simulation depth in the above steps may be a preset depth of performing simulation on the child node, and the preset simulation depth may be determined according to a habit of a user or an average step length. In most data push scenarios, the user step size tends to be small, as shown in fig. 5, which is mostly concentrated below 5 steps. Thus, the exploration depth may be determined by making a small increase in the user history step size. It should be noted that, for a new user, the history step length of the user may be estimated according to the total-station user average step length, or the average history step length of the same type of user may be directly used as the history length of the user.

In the embodiment of the disclosure, after the child nodes are expanded, one child node may be selected, simulation may be performed according to the MDP until a set simulation time or simulation depth is reached, then the reward of the last child node may be estimated according to a value estimation model, and then the push value of the child node is updated based on the estimated reward. If the recall data is repeatedly pushed, the cost of the last node can be estimated through a cost estimation model, and then the pushing value of the child node is updated based on the estimated reward and the cost. For the child nodes of different hierarchies, discount rates corresponding to different hierarchies may be set, and the discount rate is lower as the hierarchy is deeper, so as to update the corresponding push value by obtaining the product of the reward and the discount rate, for example, the discount rate may be 0.9, but is not limited thereto.

In some optional embodiments, for the MDP tree shown in fig. 4, in the first iteration of the loop, after three child nodes are created, one child node (i.e., clicking the corresponding child node) may be selected for simulation, Item k is then selected according to the state transition probability, three child nodes of the next hierarchy are created, and the corresponding child node is selected to be clicked, at this time, it is determined that the exploration depth is reached, and therefore, the push value of clicking the corresponding child node may be determined, and assuming that the predicted reward of the child node is 10 and the simulation depth is 3, the push value v (y) of the node is updated by γ ═ γ { (y) } γ } of the node³×reward＝0.9³X 10 is 7, where γ represents the discount rate, and the deeper the search depth, the lower the discount rate; in the second iteration, after three child nodes are created, one child node (i.e., the child node corresponding to the click) may be selected for simulation, Item k' is selected according to the state transition probability, three child nodes of the next hierarchy are created, and the corresponding child node is selected to be clicked, and at this time, it is determined that the exploration depth is reached, so that the push value of the child node corresponding to the click may be determined, and assuming that the predicted reward of the child node is 100 and the simulation depth is 3, the push value v (n) ═ γ of the node is updated³×reward＝0.9³×100＝70。

According to the scheme, the target child node is determined based on the probability corresponding to the new child node, the target child node is simulated, whether simulation is finished or not is determined according to the preset simulation time or the preset simulation depth, and the pushing value of the last child node is determined, so that the effects of adjusting the pushing value in real time and improving the pushing accuracy are achieved.

Optionally, before determining the push value of the last child node, the method further includes: determining whether the associated data corresponding to the last child node is repeated data repeatedly pushed in a target time period; responding to the fact that the associated data are repeated data, processing the last child node by using a cost estimation model and a value estimation model, and obtaining the pushing value of the last child node; and responding to the fact that the associated data are not repeated data, and processing the last child node by using a value estimation model to obtain the push value of the last child node.

The target time period in the above steps may be a preset short time period, for example, the whole construction process of the search tree may be performed, or a period of historical time before the search tree is constructed may be performed, but is not limited thereto.

In the embodiment of the disclosure, for recall data repeatedly pushed in a short time, in order to avoid poor experience brought to a user by repeated pushing, whether the recall data corresponding to the last child node is repeated data or not can be determined at first, and if so, the push value needs to be determined by combining the estimate results of the cost estimate model and the value estimate model; if not, only the predictive result of the price predictive model is needed to determine the push value.

By the scheme, different determination processes are provided for the repeated data and the non-repeated data, and the effects of improving the determination accuracy of the pushing value and further improving the accuracy of data pushing are achieved.

Optionally, the reverse iteration is performed on the search tree based on the push value of the last child node, and updating the push values of the target node and each layer of child nodes below the target node includes: step a, obtaining the total value of an extended node based on the push value, the reward and the state transition probability of at least one child node positioned on the current layer; step b, updating the push value of the father node positioned at the upper layer based on the maximum value of the total value of the expansion nodes; and repeating the steps a to b until the push value of the root node is updated.

Here, the update may be to update the current push value of the parent node to the maximum value of the total value of the extension node, or to superimpose the current push value of the parent node and the maximum value of the total value of the extension node. In the embodiment of the present disclosure, the current push value of the parent node is updated to the maximum value of the total value of the extension node.

In the back propagation process, the push value of each child node is discounted, so that the product of the push value and the discount rate can be obtained, and then the product is obtainedAccumulating with the reward of the child node, and finally multiplying with the state transition probability of the child node to obtain the final total value. In some optional embodiments, for the MDP tree shown in fig. 4, during the first iteration of the loop, the pushed value v (t) max (P (click | t) × [ r (y) + γ v (y)) of the node t may be updated as follows]+ P (away | t) × [ r (y ') + γ V (y')]+ P (conversion | t) × [ r (y ') + γ V (y')]) Max (0.1 × (0+0.9 × 7) +0.79 × (0+0) +0.01 × (0+0)) -0.63, and then the push value v(s) ═ max of the node s may be updated according to the following formula_{a∈{1,2,…，k}}(P_a(t|s)×[r(t,a,s′)+γV(s′)]) Max (0.1 × (3+0.9 × 0.63) +0.89 × (0+0) +0.01 × (100+0)) -action Item1 ═ 1.356, at which time the first loop iteration process ends, the number of searches N for the root node is updated to 1, and the push value is updated to 1.356. During the second iteration of the loop, the push value of node m, V (m) ═ max (P click | n) × [ r (n) + γ V (n), can be updated as follows]+ P (away | n) × [ r (n ') + γ V (n')]+ P (conversion | n) × [ r (n ") + γ V (n')]) Max (0.1 × (0+0.9 × 70) +0.78 × (0+0) +0.02 × (0+0)) -6.3, and then the push value v(s) ═ max of the node s may be updated according to the following formula_{a∈{1,2,…,k}}(P_a(t|s)×[r(t,a,s′)+γV(s′)]) Max (0.1 × (3+0.9 × 0.63) +0.89 × (0+0) +0.01 × (100+0) -action Item1,0.15 × (5+0.9 × 6.3) +0.8 × (0+0) +0.05 × (80+0) -action Item2) ═ max (1.356,4.8505) ═ 4.8505, at which time the second iteration of the loop ends, the number of searches N for the root node is updated to 2, and the push value is updated to 4.8505.

Note that in the above case, the ordering of Item2 > Item1 can be selected to give the final result.

By the scheme, the total value is determined through the push value, the reward and the state transition probability, and then the push value of each node is updated through back propagation, so that the push value of each node is accurately determined, the determination accuracy of target data is improved, and the effect of improving the push accuracy is achieved.

A preferred embodiment of the present disclosure is described in detail below with reference to fig. 4 and 6 by taking a product recommendation scenario as an example. First receiving a user request, where the user request may be different requests for different scenes, for example, in a search scene, the user request may be search text searched by a user; in a list page recommendation scenario, the user request may be a search text for a user search; in the detail page recommendation scenario, the user request may be that the user clicks on behavior information of the current item detail page. In addition, search results and user historical browsing, clicking behavior may be used as background information.

The specific flow of the scheme is as follows: as shown in fig. 6, the user may retrieve information that needs to be queried in the search box, and assuming that the user searches for "imposter alliance" in the search box, it may be determined that "imposter alliance" is a user request, and after receiving the request, a process of recommendation result decision may be started, and recall data may be sorted in conjunction with the MCTS tree, and a recommendation result, such as "xingfija blanch", "on-line rice noodles", etc., may be given below the retrieval result, as shown by the solid circles in fig. 4. It can then be simulated, assuming that if Item1 (e.g. XINGFUJIA spicy) is recommended, the user would click on the Item to go to the details page, or go directly to the "consult" button to leave the thread transition, or go away at a glance. Suppose that the user clicks on the item in the recommendation and enters the details page, at which point the recommendation is displayed again on the details page. Further, it is assumed that the user clicks on data such as "member diary", "content property", "authentication rating", and "recommendation", and searches for a series of candidate lists (Item a-Item k), and after observing these items as list page recommendation Item1, it is determined whether or not sufficient follow-up value is to be provided after the user clicks.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

According to an embodiment of the present disclosure, the present disclosure provides a data processing apparatus, which is used to implement the above embodiment and a preferred real-time manner, and the description of which has been already made is omitted. As used herein, the term "module" may include any combination of software and/or hardware for performing the specified function, and although the apparatus described in the embodiments below is preferably implemented in software, hardware, or a combination of software and hardware are also possible and contemplated.

Fig. 7 is a schematic diagram of a data processing apparatus according to the present disclosure, as shown in fig. 7, the apparatus comprising: an obtaining module 72, configured to obtain a recall data set; a building module 74, configured to build a search tree corresponding to the recall data set, where the search tree includes: the data retrieval system comprises a root node and a plurality of data nodes positioned at different levels, wherein each data node is used for representing recall data in a recall data set, each data node is used for storing the push value and the search times of corresponding recall data, and the push value is used for representing the value of a feedback result received after the corresponding recall data are pushed to a target object; and the decision module 76 is configured to determine target data in the recall data set based on a push value of each recall data in the recall data set, where the target data is data pushed to a target object.

Optionally, the building block includes: the device comprises a first determining unit, a second determining unit and a searching unit, wherein the first determining unit is used for determining a target node to be searched, and the target node is used for representing a root node or recalling data in a recalling data set; the extension unit is used for extending a new child node below the target node and determining the reward of the new child node by using the value estimation model; the simulation unit is used for simulating a target child node in the new child node and determining the push value of the last child node when the simulation is finished; the second determining unit is used for carrying out reverse iteration on the search tree based on the push value of the last child node and determining the push values of the target node and each layer of child nodes below the target node; and the execution unit is used for repeatedly executing the functions of the determination unit, the expansion unit, the simulation unit and the second determination unit until the exploration time of the search tree reaches the preset exploration time or the exploration depth reaches the preset exploration depth.

Optionally, the first determining unit includes: the traversal subunit is used for starting traversal from the root node and determining whether an unexpanded node exists or not; the first node determining subunit is used for responding to the existence of the unexpanded node and determining the unexpanded node as a target node; and the second node determining subunit is used for responding to the absence of the unexpanded node, and determining a target node based on the push value and the search times of each data node.

Optionally, the extension unit includes: the probability obtaining subunit is used for obtaining the state transition probability corresponding to the target node; the operation determining subunit is used for determining a target execution operation corresponding to the target node based on the state transition probability; and the creating subunit is used for creating a new child node below the target node based on the target execution operation.

Optionally, the simulation unit includes: a probability determination subunit, configured to determine a target child node based on a probability corresponding to the new child node; the simulation subunit is used for simulating the target child node; the simulation determining subunit is used for determining that the simulation is finished under the condition that the simulation time reaches the preset simulation time or the simulation depth reaches the preset simulation depth; and the value determining subunit is used for determining the push value corresponding to the last child node.

Optionally, the simulation unit further includes: the data determining subunit is used for determining whether the associated data corresponding to the last child node is repeated data repeatedly pushed in a target time period; the first processing subunit is used for responding to the fact that the associated data are repeated data, processing the last child node by using the cost estimation model and the value estimation model, and obtaining the pushing value of the last child node; and the second processing subunit is used for responding to the fact that the associated data is not the repeated data, and processing the last child node by using the value estimation model to obtain the push value of the last child node.

Optionally, the second determining unit includes: the value acquisition subunit is used for acquiring the total value of the expansion node based on the push value, the reward and the state transition probability of at least one child node positioned on the current layer; the value updating child unit is used for updating the push value of the parent node positioned at the upper layer based on the maximum value of the total value of the expansion nodes, wherein the target child node is a child node which has an incidence relation with at least one child node; and the execution subunit is used for repeatedly executing the functions of the value acquisition subunit and the value updating subunit until the pushed value of the root node is updated.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM 803 and executed by the computing unit 801, a computer program may perform one or more steps of the data processing method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the data processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of data processing, comprising:

acquiring a recall data set;

constructing a search tree corresponding to the recall data set, wherein the search tree comprises: the data nodes are used for representing recall data in the recall data set, each data node is used for storing the push value and the search times of corresponding recall data, and the push value is used for representing the value of a feedback result received after the corresponding recall data is pushed to a target object;

and determining target data in the recall data set based on the push value of each recall data in the recall data set, wherein the target data is data pushed to the target object.

2. The method of claim 1, wherein constructing the search tree to which the recall data set corresponds comprises:

step A, determining a target node to be searched;

b, expanding new child nodes below the target node, and determining the reward of the new child nodes by using a value estimation model;

step C, simulating a target child node in the new child node, and determining the push value of the last child node when the simulation is finished;

step D, performing reverse iteration on the search tree based on the push value of the last child node, updating the push values of the target node and each layer of child nodes below the target node, and updating the search times of the target node;

and repeating the step A to the step D until the exploration time of the search tree reaches the preset exploration time or the exploration depth reaches the preset exploration depth.

3. The method of claim 2, determining a target node to search comprises:

traversing from the root node to determine whether an unexpanded node exists;

in response to the presence of the unexpanded node, determining the unexpanded node to be the target node;

in response to the absence of the unexpanded node, determining the target node based on the push value and the number of searches of each data node.

4. The method of claim 2, extending new child nodes below the target node comprises:

acquiring state transition probability corresponding to the target node;

determining a target execution operation corresponding to the target node based on the state transition probability;

creating the new child node under the target node based on the target execution operation.

5. The method of claim 2, wherein simulating a target child node of the new child nodes, and determining the push value corresponding to the last child node when simulation ends comprises:

determining the target child node based on the probability corresponding to the new child node;

simulating the target child node;

determining that the simulation is finished under the condition that the simulation time reaches the preset simulation time or the simulation depth reaches the preset simulation depth;

determining the push value corresponding to the last child node.

6. The method of claim 5, further comprising, prior to determining the push value corresponding to the last child node:

determining whether the associated data corresponding to the last child node is repeated data repeatedly pushed in a target time period;

responding to the fact that the associated data are the repeated data, and processing the last child node by using a cost estimation model and a value estimation model to obtain the push value of the last child node;

and responding to the fact that the associated data is not the repeated data, and processing the last child node by using the value estimation model to obtain the push value of the last child node.

7. The method of claim 2, the reverse iteration of the search tree based on the push value of the last child node, updating the push values of the target node and each level of child nodes below the target node comprising:

step a, obtaining the total value of an extended node based on the push value, the reward and the state transition probability of at least one child node positioned on the current layer;

step b, obtaining the maximum value of the total value of the expansion nodes to obtain the push value of the father node positioned at the upper layer;

and repeating the steps a to b until the push value of the root node is updated.

8. A data processing apparatus comprising:

the acquisition module is used for acquiring a recall data set;

a building module, configured to build a search tree corresponding to the recall data set, where the search tree includes: the data nodes are used for representing recall data in the recall data set, each data node is used for storing the push value and the search times of corresponding recall data, and the push value is used for representing the value of a feedback result received after the corresponding recall data is pushed to a target object;

and the decision module is used for determining target data in the recall data set based on the push value of each recall data in the recall data set, wherein the target data is data pushed to the target object.

9. The apparatus of claim 8, wherein the building module comprises:

a first determining unit, configured to determine a target node to be searched, where the target node is used to represent the root node or recall data in the recall data set;

the extension unit is used for extending a new child node below the target node and determining the reward of the new child node by using a value estimation model;

the simulation unit is used for simulating a target child node in the new child node and determining the push value of the last child node when the simulation is finished;

a second determining unit, configured to perform reverse iteration on the search tree based on the push value of the last child node, and determine the push values of the target node and each layer of child nodes below the target node;

and the execution unit is used for repeatedly executing the functions of the determination unit, the expansion unit, the simulation unit and the second determination unit until the exploration time of the search tree reaches a preset exploration time or the exploration depth reaches a preset exploration depth.

10. The apparatus of claim 9, the first determination unit comprising:

the traversal subunit is used for starting traversal from the root node and determining whether an unexpanded node exists or not;

a first node determining subunit, configured to determine, in response to existence of the unexpanded node, that the unexpanded node is the target node;

a second node determining subunit, configured to determine the target node based on the push value and the number of searches of each data node in response to the absence of the unexpanded node.

11. The apparatus of claim 9, the extension unit comprising:

a probability obtaining subunit, configured to obtain a state transition probability corresponding to the target node;

an operation determination subunit, configured to determine, based on the state transition probability, a target execution operation corresponding to the target node;

and the creating subunit is used for creating the new child node below the target node based on the target execution operation.

12. The apparatus of claim 9, the emulation unit comprising:

a probability determination subunit, configured to determine the target child node based on a probability corresponding to the new child node;

the simulation subunit is used for simulating the target child node;

the simulation determining subunit is used for determining that the simulation is finished under the condition that the simulation time reaches the preset simulation time or the simulation depth reaches the preset simulation depth;

and the value determining subunit is used for determining the push value corresponding to the last child node.

13. The apparatus of claim 12, the emulation unit further comprising:

the data determining subunit is configured to determine whether the associated data corresponding to the last child node is repeated data repeatedly pushed within a target time period;

the first processing subunit is configured to, in response to that the associated data is the duplicate data, process the last child node by using a cost estimation model and a value estimation model to obtain a push value of the last child node;

and the second processing subunit is configured to, in response to that the associated data is not the duplicate data, process the last child node by using the value estimation model, so as to obtain a push value of the last child node.

14. The apparatus of claim 9, the second determination unit comprising:

the value acquisition subunit is used for acquiring the total value of the expansion node based on the push value, the reward and the state transition probability of at least one child node positioned on the current layer;

the value updating subunit is used for updating the push value of the parent node positioned at the upper layer based on the maximum value of the total value of the extension node;

and the execution subunit is used for repeatedly executing the functions of the value acquisition subunit and the value updating subunit until the pushing value updating of the root node is completed.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.