CN113283171A

CN113283171A - Industrial platform resource optimal allocation device and method

Info

Publication number: CN113283171A
Application number: CN202110582489.8A
Authority: CN
Inventors: 吴帆; 郭李毅; 郑臻哲
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-08-20

Abstract

An industrial platform resource optimization allocation device and method comprises the following steps: a content distribution system and a repository, wherein: the content distribution system generates a resource prediction request of the robot and outputs the resource prediction request to the resource library, optimal resource allocation is carried out according to feedback of the resource library, the neural network model in the appeal prediction unit in the content distribution system is updated based on newly added data while the robot service process is achieved, the resource library receives the resource prediction request sent by the content distribution system, the optimal resource allocation which is potentially allocable is predicted, the resource application of the resource scheduling unit of the content distribution system is received, and resources are allocated based on the resource application. According to the robot resource allocation method, the robot appeal and the optimization target are modeled, the robot resource which can be allocated to the robot is recommended, the reasonability of resource allocation is obtained through the feedback of the robot, and the information asymmetry impasse between the server side and the robot is broken.

Description

Industrial platform resource optimal allocation device and method

Technical Field

The invention relates to a technology in the field of industrial mass information processing, in particular to an industrial platform resource optimal allocation device and method.

Background

With the development of informatization, the scale of the industrial field system is larger and larger. For example, in a large-scale distributed task or system (e.g., a crowd-sourcing task covering multiple regions, a content distribution task of a recommendation system in the internet, etc.), a robot (or an agent, etc.) based on an intelligent algorithm needs to complete its respective task. However, when the data size is large or the robot cannot disclose all its information to the server for some reason, the control server cannot store the information of all robots, and the server cannot simultaneously allocate required resources and tasks to all robots.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides the device and the method for optimizing and distributing the resources of the industrial platform.

The invention is realized by the following technical scheme:

the invention relates to an industrial platform resource optimization allocation device, which comprises: a content distribution system and a repository, wherein: the content distribution system generates a resource prediction request of the robot and outputs the resource prediction request to the resource library, optimal resource allocation is carried out according to feedback of the resource library, the neural network model in the appeal prediction unit in the content distribution system is updated based on newly added data while the robot service process is achieved, the resource library receives the resource prediction request sent by the content distribution system, the optimal resource allocation which is potentially allocable is predicted, the resource application of the resource scheduling unit of the content distribution system is received, and resources are allocated based on the resource application.

The content distribution system includes: the device comprises an interaction unit, an appeal prediction unit, a feature storage unit, a resource scheduling unit and a network training unit, wherein: the interaction unit receives a resource request of the robot and sends the robot ID and the budget to the appeal prediction unit; the appeal prediction unit sends the robot ID to the feature storage unit; the feature storage unit sends the robot features to the appeal prediction unit; the neural network in the appeal prediction unit predicts the appeal of the robot based on the characteristics of the robot and sends the appeal and budget to the resource library; the appeal prediction unit sends a resource prediction result from the resource library to the interaction unit, and the interaction unit inquires whether the robot is adopted or not; when the robot adopts the resource scheduling result, the resource scheduling result authorized by the robot is sent to a resource scheduling unit; the resource scheduling unit sends a resource application request to a resource library; the resource scheduling unit sends the resource to the robot; and after the round of interaction is finished, the interaction unit sends the latest round of interaction data to the feature storage unit.

The neural network model is trained in the following way: the network training unit sends a data request to the feature storage unit; the feature storage unit sends the training data to the network training unit; the network training unit trains the neural network model and updates the neural network model in the appeal prediction unit.

Technical effects

The invention integrally solves the defect that the robot appeal is difficult to express clearly and further difficult to meet individually in the prior art due to the limitation of robot communication or expression capacity or the limitation of storage processing capacity and computing capacity of a resource distribution system, so that the system resource distribution efficiency is low.

Compared with the prior art, the method has the advantages that through modeling of the robot appeal and the optimization target, the resources which can be distributed to the robot are recommended to the robot, and the rationality of resource distribution is obtained through feedback of the robot. The designed system can collect the information of the robot on the demands of different resources according to the adoption behavior of the robot, and breaks the impasse of information asymmetry between the server and the robot.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention;

FIG. 2 is a schematic diagram of the internal structure of a content distribution system;

FIG. 3 is a graph showing the results of comparative experiments on the models of the examples;

in the figure: a) cumulative expectation regret at different ratios Dropout, b) cumulative acceptance at different ratios Dropout;

FIG. 4 is a diagram illustrating the impact of related information in accordance with an exemplary embodiment;

in the figure: a) appeal of the impact of the relevant information on the cumulative expectation regret, b) appeal of the impact of the relevant information on the cumulative adoption rate.

Detailed Description

As shown in fig. 1, the present embodiment relates to an industrial platform information optimizing and distributing apparatus, including: a content distribution system and a repository, wherein: and the resource library returns resources to the content distribution system according to the applied resource budget, and the content distribution system predicts the appeal of the robot, performs optimal resource allocation on the robot and updates the neural network model in the appeal prediction unit 2 based on the newly added data. The content distribution system receives the budget information applied by the robot resource, performs appeal prediction based on the historical data of the robot, and allocates the optimal resource configuration for the robot.

As shown in fig. 2, the content distribution system includes: interaction unit 1, appeal prediction unit 2, characteristic memory unit 3, resource scheduling unit 4, network training unit 5, wherein: the interaction unit 1 receives a resource request of the robot and sends the robot ID and budget to the appeal prediction unit 2; the appeal prediction unit 2 sends the robot ID to the feature storage unit 3; the feature storage unit 3 sends the robot features to the appeal prediction unit 2; the neural network in the appeal prediction unit 2 predicts the appeal of the robot based on the characteristics of the robot and sends the appeal and budget to a resource library; the resource library calculates potential allocable resources and sends the potential allocable resources to the appeal prediction unit 2; the appeal prediction unit 2 sends the resource prediction result to the interaction unit 1, and the interaction unit 1 inquires whether the robot is adopted or not; the robot adopts the resource scheduling result, and the interaction unit 1 sends the resource scheduling result authorized by the robot to the resource scheduling unit 4; the resource scheduling unit 4 sends a resource application request to a resource library; the resource library allocates corresponding resources to the resource scheduling unit 4; the resource scheduling unit 4 sends the resource to the robot; the round of interaction is finished and the interaction unit 1 sends the latest round of interaction data to the feature storage unit 3.

The training process of the neural network model comprises the following steps: (a) the network training unit 5 sends a data request to the feature storage unit 3; (b) the feature storage unit sends the training data to the network training unit 5; (c) the network training unit trains the neural network model and updates the neural network model in the appeal prediction unit 2.

The embodiment relates to the above industrial platform information optimized distribution device, which performs industrial platform information optimized distribution processing in the following manner: when a robot initiates a resource application request, the content distribution system analyzes the relevant information of the robot from the application request, generates an estimated robot demand, sends the information of the robot demand, budget allocated to the robot and the like to a resource library, and inquires allocable resources; the resource pool estimates available resources according to the demand and budget provided by the content distribution system, and sets the predicted allocable resource v ═ v₁，v₂，…，v_n]^TAnd returning, the content distribution system sends the resource application result to the robot according to the resource application result, and distributes the real resource result based on the requirement to the robot through the resource library according to the adoption feedback signal of the robot.

In conclusion, the system can collect preference information of the robot on different resources according to the adoption behavior of the robot, break through the impasse of information asymmetry between the server and the robot, and better configure the whole resources in the resource library.

The allocable resources refer to: under various constraints such as budget, the resource results that the robot can obtain are specifically: v ═ v₁，v₂，…，v_n]^TWhich isThe method comprises the following steps: n represents the number of classes of the resource, value v_iRepresenting the amount of the ith dimension resource.

The relevant information of the robot comprises: the robot has a resource application budget and the robot has preferences for different resources, namely an appeal weight vector: w ═ w₁，w₂，…，w_n]^TWherein: w is a_iAnd the preference weight of the robot to the i-dimensional report is represented.

This embodiment defines Π as a resource allocation method. When the appeal vector of the robot is w^*The server may configure its optimal resources based on the appeal vector. Let the optimization objective be w^*TV, i.e. a weighted sum of the resource outcomes based on the appeal. For appeal of w^*The robot recommends an optimal resource allocation strategy pi to the robot_w*I.e. to solve the optimization problem:

wherein: resource application result v_ΠFor the optimal solution, w, of the resource pool reachable based on the strategy Π^*T·v_ΠThe utility that can be obtained for the robot based on policy Π. Thus, the weight vector w^*The method can help the content distribution platform to find and obtain the most satisfactory resource application result from the resource library.

When making a recommendation, the recommendation is taken when the robot is satisfied with the expected results. Otherwise, the robot skips the recommendation. Based on the above observation, the present embodiment models the problem to be optimized into the context slot machine problem, and designs a specific algorithm program, which specifically includes:

1) the state is as follows: information about the robot that the platform can observe. Such as: robot characteristics, budgets that may be allocated for the robot, historical queries and adoption by the robot, and the like.

2) The actions are as follows: the robot appeals to the vector. The action selection space of the algorithm program is a high-dimensional continuous space. The algorithm program needs to send a request to the resource library according to constraint information in the state and actions selected by the algorithm program, namely, pre-estimated robot appeal vectors, and obtain resource recommendation items predicted by the resource library.

3) And (3) returning: the present embodiment sets the reward as the adoption behavior of the robot.

Based on the modeling, the context slot machine algorithm can continuously carry out strategy recommendation for visiting robots, and in each round of recommendation:

1) the algorithm program observes the state of the robot in the round of recommendations.

2) The algorithm program selects an appeal vector based on the state, transmits constraint information such as the appeal vector and budget and the like to the resource library, and the resource library carries out pre-estimated resource results. The algorithm program recommends the result to the robot and gets the feedback of the robot.

3) The algorithm program stores the observations of the round (robot state, appeal vector, resource allocation result, robot feedback) as training data to update its own intelligent recommendation strategy.

Estimating the action value: in a classic context slot machine algorithm, an algorithm program can pull a slot machine arm based on a certain strategy according to the observed context and learn the expected value of each optional action. In the problem of this embodiment, the reward of the server-side selection action is whether the robot is adopted or not, and the maximum reward is the most likely strategy recommended to the robot.

The action value estimation process in the embodiment includes:

1) action selection is performed based on the observable information and the action selection policy.

2) And establishing a relation among the selected action, the resource application result and the robot acquisition rate.

In this embodiment, first, a relation between robot information and action selection is described, that is, under information that can be observed by a server, a demand vector w is obtained based on a certain action selection policy, where: the function f is a multilayer perceptron and represents the mapping relation between the environment state x and the appeal information w, and the input x of the function f is the feature expression of the environment. In the problem of the present embodiment, the output of the network is w, and the supervision information (i.e., the value of the action) of the network is the behavior of the robot. Intuitively, let v beOptimal resource allocation result at w, w^TThe value of v reflects the utility that the robot of appeal w can obtain on the platform. Thus, the acceptance rate and w of the robot^TV is in some positive correlation.

In the present embodiment, p (adoption) ═ σ (w) is used^TV) represents the robot acceptance rate and w^TV, wherein: sigma is sigmoid function, and the value range is [0, 1 ]]The best bid result v based on w is also part of the model input.

Based on the method, the estimation of the network on the action value can be updated in a gradient updating mode. For each round of gradient update, the present embodiment updates the parameters of the model through the loss function L, specifically:

wherein: collection

For the data set with the size of N in the round of updating, the environmental characteristic x and the resource application result v are input of the model, p (x, v) is the predicted adoption rate of the model, and the label y is an adoption label. In the training process, the environmental characteristic x needs to be input firstly to obtain the appeal output w of the model, a result v is obtained according to w, and finally the estimated acquisition rate p (x, v) of the model is obtained.

And (3) an action selection algorithm: the present embodiment uses Thompson sampling for motion selection, which is a popular means of trade-off between Exploration (Exploration) and Exploitation (Exploitation). Generally speaking, thompson sampling requires bayesian processing of model parameters. At each step, thompson sampling re-samples a new set of model parameters, and then makes action selections based on the set of parameters. This can be seen as a random check: more likely parameters will be sampled more frequently and thus rejected or confirmed more quickly.

The Thompson sampling comprises the following steps: sampling a new set of parameters of the model; selecting the action with the highest expected yield according to the sampling parameters; and updating the model parameters.

Thompson sampling a neural network model requires characterizing model uncertainty, and a bayesian model provides reasoning model uncertainty based on a mathematical framework, but usually with prohibitive computational costs. Dropout refers to temporarily discarding a portion of neurons from the neural network with a certain probability during the training of the neural network. Yarin et al, in Dropout as a Bayesian approximation, reconstruction model uncaptaiyin de left, propose to utilize Dropout as a Bayesian approximation method to represent model uncertainty in deep learning and to demonstrate a nonlinear neural network with arbitrary depth, which is mathematically equivalent to an approximation of a depth probability Gaussian process when Dropout is applied before each weight layer. Furthermore, Dropout, a simple and common technique for preventing overfitting of neural networks, has been widely used in training neural networks due to its ease of implementation, high performance efficiency, and great effectiveness. Thus, the present embodiment uses Dropout in the neural network for thopson sampling, which is very simple but effective.

In the experiment, the input characteristics of the model are appeal-related characteristics and historical acquisition information of the robot. For appeal-related features, the present embodiment concatenates them as one of the model's inputs. During the training process, the present embodiment trains the network model using a small batch of gradient descent. In order to prevent the proportion of positive and negative samples obtained by the model from changing along with the training process, thereby affecting the performance of the model, the embodiment sets the proportion of positive and negative samples in each training batch as 1: 1. the optimizer in the model training process is Adam, and the experimental results are shown in table 1.

The optimization goal of the context slot machine is to recommend the expectation regret for the T round, therefore, the embodiment can recommend the regret by accumulating the expectation regret

And cumulative acceptance rate

To evaluate the performance of the model, where T denotes that the experiment has performed T rounds of interaction, value

Indicating the t-th round based on the inner appeal of the robot

Recommended acceptance rate, value

W representing the output based on the motion selection algorithm in the t-th round_tA recommended adoption rate is made.

TABLE 1

The comparative experiment results are as follows: in a simulation experiment, the embodiment verifies the effectiveness of the context slot machine algorithm. In comparative experiments, this example compares the effect of the model when Dropout is not performed or is performed at different ratios. The embodiment also introduces a random appeal recommendation strategy which does not apply any appeal estimation algorithm as a weak reference. In each set of experiment, the algorithm program and the environment perform 2000 rounds of interaction, the current accumulated expected regret and accumulated acceptance rate are recorded at intervals of a certain number of rounds, and the experimental result after the interaction is finished is shown in table 1. From the results, the present embodiment finds that the random appeal recommendation system causes a large reduction in the evaluation index, which indicates that the appeal of the robot must be considered when recommending the strategy. Fig. 3 shows cumulative expected regressions and cumulative percent adoption curves of different Dropout ratios, and since it is found in the present embodiment that different algorithms converge to different locally optimal solutions in an experiment, the expected regressions approximately linearly increase according to a certain slope after model convergence, in order to better understand the performance difference after model convergence, the present embodiment preprocesses the cumulative expected regressions by y ═ log (x +1), and normalizes the experiment results to draw a curve. Through the trends in fig. 3 and analyzing the real-time cumulative expected regret and cumulative adoption rate in the experimental process, the present embodiment finds that the cumulative expected regret increments of all models are gradually reduced and converged in the training process shown in fig. 3. During the training process shown in fig. 3, the cumulative adoption rates of all models are gradually increasing and converging. The observation shows that different models converge to different local optimal solutions, but all the models can learn the appeal of the robot to a certain degree and improve the performance of the recommendation system. For example, in table 1, even the model that does not utilize Dropout may reduce the cumulative expectation over the random appeal recommendation strategy (without the learning module) by 25.71%.

In the experiment, the effect of the motion selection algorithm using Dropout for motion Exploration is better than that of the motion selection algorithm without Dropout, because the motion sampling using Dropout can be approximately regarded as Thompson sampling, balance Exploration (Exploration) and utilization (Exploration), and the motion space of the model is better sampled, so that the model converges to a better local optimal solution. In four sets of experiments with Dropout ratios of 20%, 40%, 60%, and 80%, the model performance increased and then decreased as Dropout ratios increased. This may be because when Dropout ratio is low, the model adopts a more conservative exploration strategy, and is more likely to converge to a worse locally optimal solution; when Dropout ratio is high, the model is frequently explored, so that learned knowledge cannot be fully utilized, and performance is reduced. Wherein: when the Dropout ratio is 40%, the performance of the model achieves better effects in training and after convergence compared with other Dropout ratio models, which shows that the performance of the model can be optimized by setting the appropriate Dropout ratio to balance Exploration (optimization) and utilization (optimization).

As shown in fig. 3, a situation that the cumulative adoption rate is decreased may occur in the early stage of the interaction, which may be caused by a large uncertainty in training the early model. After analyzing the real-time accumulation expected regret and the accumulation acceptance rate in the experimental process, the embodiment finds that the increment of the accumulation expected regret is obviously reduced in the same period of the reduction of the accumulation acceptance rate, which indicates that the model can better learn the robot appeal through Exploration (Exploration).

To verify the generalization ability of the model, this example performed a control experiment. In the experiment, the experimental group is a model with a Dropout ratio of 40%, and the control group is the same model, but in the present embodiment, the randomisation processing is performed on the appeal-related information in the control group model input, and in the present embodiment, the experimental result is processed similarly to that in fig. two, and is shown in fig. 4 and table 1. The experimental results in fig. 4 and table 1 show that the performance of the model with the appeal-related information input is superior to that of the model without the appeal-related information input, which shows that the model can better learn the appeal of the robot through the appeal-related information.

The conventional means does not establish an interaction process with the resource allocation satisfaction degree of the robot, does not model the appeal preference of the robot, does not use a feedback signal of the robot on a resource allocation result to learn the appeal preference of the robot, does not use a mode of exploration and utilization in online learning to optimize personalized appeal recommendation for the robot, and does not realize large-scale generalized application of a robot appeal recommendation strategy.

Compared with the prior art, the method obviously improves the satisfaction rate of the robot appeal, the resource allocation efficiency and the generalization of the recommendation strategy.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. An industrial platform resource optimization allocation device, comprising: a content distribution system and a repository, wherein: the content distribution system generates a resource prediction request of the robot and outputs the resource prediction request to the resource library, optimal resource allocation is carried out according to feedback of the resource library, the neural network model in the appeal prediction unit in the content distribution system is updated based on newly added data while the robot service process is achieved, the resource library receives the resource prediction request sent by the content distribution system, the optimal resource allocation which is potentially allocable is predicted, the resource application of the resource scheduling unit of the content distribution system is received, and resources are allocated based on the resource application.

2. The device for optimized allocation of resources in industrial platform as claimed in claim 1, wherein said content distribution system comprises: the device comprises an interaction unit, an appeal prediction unit, a feature storage unit, a resource scheduling unit and a network training unit, wherein: the interaction unit receives a resource request of the robot and sends the robot ID and the budget to the appeal prediction unit; the appeal prediction unit sends the robot ID to the feature storage unit; the feature storage unit sends the robot features to the appeal prediction unit; the neural network in the appeal prediction unit predicts the appeal of the robot based on the characteristics of the robot and sends the appeal and budget to the resource library; the appeal prediction unit sends a resource prediction result from the resource library to the interaction unit, and the interaction unit inquires whether the robot is adopted or not; when the robot adopts the resource scheduling result, the resource scheduling result authorized by the robot is sent to a resource scheduling unit; the resource scheduling unit sends a resource application request to a resource library; the resource scheduling unit sends the resource to the robot; and after the round of interaction is finished, the interaction unit sends the latest round of interaction data to the feature storage unit.

3. The device as claimed in claim 2, wherein the neural network model is trained by: the network training unit sends a data request to the feature storage unit; the feature storage unit sends the training data to the network training unit; the network training unit trains the neural network model and updates the neural network model in the appeal prediction unit.

4. The method for optimized distribution of industrial platform information according to any one of claims 1 to 3, wherein the content distribution system is configured to start from when a resource application request is issued by a robotAnalyzing the relevant information of the robot in the application request, generating an estimated robot demand, sending the information such as the robot demand and budget allocated to the robot to a resource library, and inquiring allocable resources; the resource pool estimates the resources that can be acquired according to the demand and budget provided by the content distribution system, and assigns the predicted allocable resources

And returning, the content distribution system sends the resource application result to the robot according to the resource application result, and distributes the real resource result based on the requirement to the robot through the resource library according to the adoption feedback signal of the robot.

5. The method for optimized distribution and processing of industrial platform information as claimed in claim 4, wherein the allocable resources are: under various constraints such as budget, the resource results obtained by the robot are specifically as follows:

wherein: n represents the number of classes of the resource, value v_iRepresenting the amount of the ith dimension resource.

6. The method as claimed in claim 4, wherein the robot-related information includes: the robot has a resource application budget and the robot has preferences for different resources, namely an appeal weight vector:

wherein: w is a_iAnd the preference weight of the robot to the i-dimensional report is represented.