CN115392337A

CN115392337A - Reinforced learning mobile crowdsourcing incentive method based on user reputation

Info

Publication number: CN115392337A
Application number: CN202210790320.6A
Authority: CN
Inventors: 李先贤; 张嘉林; 石贞奎
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2022-07-06
Filing date: 2022-07-06
Publication date: 2022-11-25

Abstract

The invention discloses a reinforcement learning mobile crowdsourcing incentive method based on user reputation, which comprises a single-server multi-user MCS system, and the method comprises the following steps: 1) Reputation evaluation stage; 2) A task issuing stage; 3) And (5) a training stage. The method can fully consider the quality of the uploaded data of the user and the participation desire, the two standards are brought into the reputation index of the user, the service provider can be helped to effectively screen credible users, so that the cost is reduced, and higher benefits are obtained.

Description

Reinforced learning mobile crowdsourcing incentive method based on user reputation

Technical Field

The invention relates to a mobile crowdsourcing and reinforcement learning technology in edge computing, in particular to a reinforcement learning mobile crowdsourcing incentive method based on user reputation.

Background

Mobile crowd-sourcing awareness (MCS) has become a popular and widely adopted mode of city awareness and data collection. In recent years, mobile devices such as mobile phones, tablet computers, wearable devices, and vehicle-mounted smart devices have become widespread, and these mobile devices have sensing, computing, and communication capabilities. Compared with the traditional sensor network, the MCS system formed by the mobile equipment has the characteristics of high mobility and intelligence, and can realize wider coverage and ensure the dynamic sensing requirement. The perception and calculation capability of the MCS system depends on a large number of mobile devices to participate in the perception task and contribute perception data of the mobile devices, but in actual situations, users do not want to contribute privately, and the participation task may reveal privacy of the users. It is therefore a challenge to devise a reasonable incentive mechanism to encourage mobile users to continue to participate in the perception task.

MCS systems are typically reduced to cloud Server Providers (SPs) and users using terminal mobile devices in marginal computing scenarios, where SPs issue awareness tasks and rewards for participating in tasks to incentivize mobile users to complete the tasks. Dishonest users in the MCS may send false perception data making the task result inaccurate, so the SP is required to select users to upload data based on their reputation value, with higher reputation users representing higher quality data. The resulting social networking effect between mobile user devices may affect the participants' perception policies by other users, such that a user will gain additional benefit from information provided or shared by local neighbors in the social network, and thus have greater willingness to participate in sensing tasks. Therefore, it is necessary to take into account reputation considerations the willingness of a user as determined by social networking effects, which is the importance of considering user willingness and contributing data quality in incentives.

Disclosure of Invention

The invention aims to provide a reinforcement learning mobile crowdsourcing incentive method based on user reputation, aiming at the problems of rationality of a user incentive method for mobile crowd sensing in edge calculation and the problems that the data quality and the participation degree of a user in a sensing task process cannot be guaranteed. The method can fully consider the quality of the uploaded data of the user and the participation desire, the two standards are brought into the reputation index of the user, the service provider can be helped to effectively screen credible users, so that the cost is reduced, and higher benefits are obtained.

The technical scheme for realizing the purpose of the invention is as follows:

a reinforcement learning mobile crowdsourcing incentive method based on user reputation comprises a single-server multi-user MCS system, wherein the single-server multi-user MCS system is provided with service providers SP, edge nodes and mobile users, all the mobile users are assumed to be M = {1,2,. Eta., M }, N = {1,2,. Eta.,. N } in M is provided with a group of optimal users, a continuous decision period is divided into equal time slots T, T = {1,2,. Eta.,. T }, and the perception level of each user i is represented as T in each time slot T

The perception levels of other users than user i are

The sensing levels of all mobile users are integrated into

The method is characterized in that the service provider issues a perception task to an edge node in each time period, the edge node distributes the task to users in the social network, the tasks received by the users are determined and executed according to the perception levels of the users and other users, perception data are uploaded to a provider SP, the provider SP scores the reputation of the users according to the data quality and the user intention, and updates the reputation records, and the method comprises the following steps:

1) Reputation evaluation stage:

the participation intention of the user is determined by social network effect, the service provider SP updates the reputation of the user after receiving the data, evaluates historical reputation, determines utility functions of the user and the service provider, and defines the utility function of the user i as shown in formula (1):

the utility of the user i consists of a revenue function and a cost function, the first part

Defining a monetary reward for user i to obtain, as perceived by the level

Determine whether or not

Linear function of R ^t Representing the total mission reward over time period t, shared by all users, the user's reward also being subject to the user's own reputation

The higher the reputation of user i, the higher the reward, the second part

Defining the cost of user i participating in the task, wherein

Representing user perceived unit cost, the service provider SP utility function is the revenue minus the total reward R of the user ^t The revenue of the service provider SP is expressed by a function phi (t) obtained from the user's perception level, as shown in equation (2):

where x is a parameter of the system and,

the function reflects a diminishing return on service by the service provider SP for i, while the ln function reflects a diminishing return on the number of mobile users by the provider, assuming that all local participating users are in the social network G = [ G ] _ij ] _i,j∈N ， g _ij E {0,1}, where user i is affected by user j, then g _ij =1, otherwise g _ij =0, assuming reciprocity of social relations g _ij ＝g _ji According to the E-R model of the social network, all users are connected to the social network in a probability mu mode, the larger the value of mu is, the tighter the connection of each node is represented, the more users affecting a user i except the user i are represented, and the willingness of the user i to participate in a perception task is defined as shown in a formula (3):

the introduction of the participant performance indicators, i.e. the reputation values, is influenced by the awards obtained, since the provider SP is more inclined to users requiring less awards, the higher the awards obtained by the users, the lower the performance indicators, the performance index of which is

The reputation feedback function of the mobile user is defined as shown in equation (4):

equation (4) represents the extent to which user i completes the task, R, compared to all selected participants ^t N the average reward for each user i is from the total reward R ^t Following the same reward condition

And

the value of credit feedback increases, the service provider SP is more inclined to select user i,

the reputation update of the user is as shown in equation (5):

in the formula (5)

For the new reputation value obtained for user i,

for historical reputation values, ref _i ^t For feedback value, α is the factor that determines whether the participant will gain or lose new reputation value during reputation updating, the arctan function is a monotonically increasing function, with positive feedback from the participant, the user reputation value will increase and negative feedback will decrease faster, if any

This means that the reputation value is when the preset value is 0

Will start from 0.5;

2) And a task issuing stage: the service provider selects an optimal user N = {1,2,..,. N } from M = {1,2,..,. M } according to the reputation of the user, and determines the maximum profit of both the service provider and the user in the optimal user set,

the service provider SP and the mobile user both have their own benefits, so it is necessary to provide a reasonable price-setting policy of the service provider, and then determine the policy of service requirement of each user, the starkeberg Stackelberg game is a two-stage dynamic game with complete information, the main idea of the starkeberg Stackelberg game is that both parties select their own policy under the policy of each other to achieve the maximum effectiveness, so as to achieve nash equilibrium, in the case of nash equilibrium, both the provider SP and the user have no incentive to change their own policy unilaterally, the service provider SP and the mobile user are described as a two-stage starkeberg with a single leader and multiple followers, in the leader sub-game, the service reward provider decides the total service provided by the task provider for obtaining more as shown in formula (6):

in the follower sub-game, since all users share the total award, a non-cooperative game exists among the users, and the mobile user determines the perception level according to the reward factor and the cost factor, namely as shown in formula (7):

each user determines the optimal corresponding strategy thereof through the strategy of the service provider SP, which specifically comprises the following steps:

2-1) follower sub-game: the total reward R determined by the service provider SP is distributed to the users according to the perception level and the reputation weight of each user, the users compete to obtain more rewards, the competition among the users is described as a non-cooperative game, and in the game, the perception strategy of the user participants is that

The goal of each user is to maximize the benefit of the user, and when all users select the optimal strategy, the non-cooperative game reaches a stable state, namely Nash equilibrium NE, which is defined in the non-cooperative game among the users:

2-1-1) given

Is that

In that

An optimal response strategy;

2-1-2) when the benefits of each are satisfied

In time, nash equilibrium exists in non-cooperative gaming and is denoted as

When the total award R ^t Given, nash equilibrium exists in non-cooperative gaming,

whether the NE exists in the non-cooperative game depends on the fact that the user utility function is a concave function of the strategy, and the user utility function is proved to be a strict concave function by solving first-order and second-order derivative functions of the user utility function, as shown in a formula (8):

2-2) leader sub-Game: in the Starkegberg game, the leader, i.e. the service provider SP, provides the total award R to the user ^t Where there is nash equilibrium between users, the service provider SP determines R ^t To achieve maximum utility, R ^t(*) In the Stackelberg game, R ^t(*) Optimal strategy to maximize the utility of provider SP, when X ^t(*) At the optimum perceptual level, there is starkeberg equalization (R) ^t(*) ,X ^t(*) ) Step 2-1), calculating a second derivative of the utility function of the service provider SP to prove that the utility function of the service provider SP is a strict concave function and to prove that the Starkeberg equilibrium exists;

3) A training stage: the method comprises the following steps:

3-1) Markov decision process definition: modeling the user behavior as a Markov Decision Process (MDP), and combining a standard reinforcement learning framework for agent and environment interaction, describing the Markov decision process as a quintuple of M = < S, A, U, P, gamma >, and defining the state, action and reward of the t-th time slot as

Wherein the details of the state space, the action space, the state transition probability and the reward are as follows:

state space, the state of user i in each time slot is composed of the perception level of other users in the last time slot, and is expressed as

Perceptual unit cost

Mission reward R ^t As shown in equation (9):

motion space in the t-th time slot, the motion of the user i is expressed as the perception level of the user i, namely

State transition probability the dynamics of the environment are represented by state transition probabilities, i.e.

Wherein

Rewarding: reward transfer function compliance

The reward function of the user i is a user utility function;

3-2) perception strategy decision based on PPO algorithm: the method comprises the steps that a service provider and an optimal user select own optimal strategies by using a PPO algorithm, a deep reinforcement learning algorithm based on the PPO is adopted to optimize the perception strategies of each user, and the goal of the deep reinforcement learning algorithm of the PPO is to find a set of optimal parameters

Wherein theta is _i Representing the policy of user i, first, the policy θ is set _i Parameterization to

The optimization objective for the strategy gradient is then as shown in equation (10):

defining a function of state values as

Function of function value of

Namely, as shown in equation (11):

in the formula (11)

Is the accumulated reward of user i in the t-th time slot, gamma belongs to [0,1 ]]Is a discount factor, if γ =0, then

Meaning that it is only concerned with current utility and not with long-term utility; gamma =1 indicates that it is concerned about the cumulative effect from T time slot to end time slot T, and optimizes the strategy by combining the strategy gradient method based on the operator-critical framework, wherein the operator-critical model consists of two networks, namely, the operator network

Parameter is theta _i And criticc network

Parameter is omega _i For operator networks, input status

And (3) solving the optimal perception level to obtain the maximum long-term accumulated yield, and calculating the strategy gradient according to a random strategy gradient theory as shown in a formula (12):

wherein

Parameters representing the old sampling strategy and are:

is a dominant function of motion and space, representing motion

Relative to state

The near-end strategy optimization PPO method is adopted to prune the strategy gradient as shown in formula (13):

wherein

η (x) is a piecewise function with an interval:

defining a critc network

The updated loss function is equation (14):

calculate ω _i Is shown in equation (15):

likewise, calculate the operator network θ _i Is shown in equation (16):

d is the sample number of strategy gradient estimation of each training step, namely the size of the small batch, and the operator and critic networks are updated after the dynamic game of the D time slots, for

Adopts small-batch random gradient rising and small-batch random gradient falling method to

And (3) updating:

wherein l _i,1 ,l _i,2 The learning rates of the operator network and the critic network are respectively as follows:

first, initialize the omega of the operator-critical network _i And theta _i For each training period, the buffer D will be cleared,

is initialized to

For each time slot t, will

Inputting into a policy network and sampling an action

As a user perception level, the service provider SP determines its own pricing policy and assigns R ^t Sent to the mobile user, and the mobile user determines the perception level of the user in the t +1 time slot

Finally, agent will update the new state and store the information into experience replay buffer D, which will be updated when full, and for each user i, a small batch of experience is extracted from D to calculate the estimated gradient of the criticizing network

And estimated gradient of actor network

Parameter theta _i And omega _i Updated with a random gradient. After determining the self strategy, the user contributes the self sensing data to the service provider, and the service provider evaluates the sensing data by uploading the sensing data through the edge nodeAnd updating the reputation of the user.

The method can help the service provider to effectively screen the credible users, thereby reducing the cost and obtaining greater income, simultaneously protecting the privacy of the users and maximizing the benefits of both the service provider and the users in a dynamic scene.

Drawings

FIG. 1 is a schematic diagram of a single-server multi-user MCS system in an embodiment;

fig. 2 is a schematic diagram of a dynamic gaming process in the embodiment.

Detailed Description

The invention will be further illustrated by the following figures and examples, but is not limited thereto.

Example (b):

a reinforcement learning mobile crowdsourcing incentive method based on user reputation, comprising a single server multi-user MCS system, as shown in fig. 1, the single server multi-user MCS system is provided with a service provider SP, edge nodes and mobile users, assuming that all mobile users are M = {1, 2., M }, and there is a group of optimal users of N = {1, 2., N } in M, dividing a continuous decision cycle into equal time slots T, T = {1, 2., T }, and at time slot T, a perception level of each user i is represented as T

The perception levels of other users than user i are

The sensing levels of all mobile users are collected as

The method comprises the steps that a service provider issues sensing tasks to edge nodes in each time period, the edge nodes distribute the tasks to users in a social network, the tasks received by the users determine and execute the tasks according to the sensing levels of the users and other users, sensing data are uploaded to a provider SP, the provider SP scores the reputation of the users according to the data quality and the wishes of the users and updates the reputation records, and the method is characterized in that the service provider sends the sensing tasks to the edge nodes in each time period, the edge nodes distribute the tasks to the users in the social network, the tasks received by the users are determined and executed according to the sensing levels of the users and other users, the provider SP scores the reputation of the users according to the data quality and the wishes of the users, and updates the reputation recordsThe method comprises the following steps:

1) Reputation evaluation stage:

the participation willingness of the user is determined by the social network effect, the service provider SP updates the reputation of the user after receiving the data, evaluates historical reputation, determines utility functions of the user and the service provider, and defines the utility function of the user i as shown in formula (1):

Defining a monetary reward for user i to obtain, as perceived by the level

Determine whether or not

The higher the reputation of user i, the higher the reward, the second part

Defining the cost of user i participating in the task, wherein

Representing user perceived unit cost, the service provider SP utility function is the revenue minus the total reward R of the user ^t The gain of the service provider SP is expressed by a function phi (t) obtained from the user's perception level, as shown in equation (2):

where lambda is a parameter of the system,

the function reflects a diminishing return on service by the service provider SP for i, while the ln function reflects a diminishing return on the number of mobile users by the provider, assuming that all local participating users are in the social network G = [ G ] _ij ] _i,j∈N ， g _ij E {0,1}, where user i is affected by user j, then g _ij =1, otherwise g _ij =0, assuming reciprocity of social relations g _ij ＝g _ji According to the E-R model of the social network, all users are connected to the social network in a probability mu mode, the larger the value of mu is, the more closely the connection of each node is represented, the more users affecting the user i except the user i are represented, and the willingness of the user i to participate in the perception task is defined as shown in formula (3):

equation (4) represents the degree to which user i completed the task compared to all selected participants, R ^t The average reward expressed as per user i is from the total reward R ^t Under the same reward condition, with

And

the reputation update of the user is shown in equation (5):

in the formula (5)

For the new reputation value obtained for user i,

This means that the reputation value is when the preset value is 0

Will start from 0.5;

2) And a task issuing stage: the service provider selects an optimal user N = {1, 2..., N } from M = {1, 2..., M } according to the reputation of the user, determines the maximum benefit of both the service provider and the user in an optimal user set,

2-1) follower sub-game: the total reward R determined by the service provider SP is distributed to the users according to the perception level and the reputation weight of each user, the users compete to obtain more rewards, the competition between the users is described as a non-cooperative game, and in the game, the perception strategy of the user participants is

2-1-1) given

Is that

In that

An optimal response strategy;

2-1-2) when the benefit of each is satisfied

In time, nash equilibrium exists in non-cooperative gaming and is denoted as

whether the NE exists in the non-cooperative game depends on the fact that the user utility function is a concave function of the strategy, and the first-order derivative function and the second-order derivative function for solving the user utility function prove that the user utility function is a strict concave function as shown in a formula (8):

2-2) leader sub-game: in the Starkegberg game, the leader, i.e. the service provider SP, provides the total award R to the user ^t Where Nash equilibrium exists between subscribers, the service provider SP determines R ^t To achieve maximum utility, R ^t(*) In the Stackelberg game, R ^t(*) Optimal strategy to maximize the utility of provider SP, when X ^t(*) At the optimum perceptual level, there is starkeberg equalization (R) ^t(*) ,X ^t(*) ) The second derivative of the utility function of the service provider SP is calculated in the same step 2-1)To prove that the provider SP utility function is a strict concave function, to prove that starkeberg equilibrium exists;

3) A training stage: the method comprises the following steps:

Perceptual unit cost

Mission reward R ^t As shown in equation (9):

motion space-at the t-th time slot, the motion of user i is expressed as the user i perception level, i.e. the motion space

State transition probability the dynamics of the environment is represented by the state transition probability, i.e.

Wherein

Rewarding: reward transfer function compliance

The reward function of the user i is a user utility function;

The optimization objective of the strategy gradient is then as shown in equation (10):

defining a function of state values as

Function of value of

Namely, as shown in equation (11):

in the formula (11)

Is the accumulated reward of user i at the t-th time slot, gamma is equal to [0,1 ]]Is a discount factor, if γ =0, then

Meaning that it is only concerned with the current utility and not with the long-term utility; gamma =1 indicates that it is concerned about the cumulative effect from T time slot to end time slot T, and optimizes the strategy by combining the strategy gradient method based on the operator-critical framework, wherein the operator-critical model consists of two networks, namely, the operator network

Parameter is theta _i And criticc network

Parameter is omega _i For operator networks, input status

wherein

Parameters representing the old sampling strategy and are:

is a dominant function of motion and space, representing motion

Relative to the state

wherein

η (x) is a piecewise function with an interval:

defining a critc network

The updated loss function is equation (14):

calculate ω _i Is shown in equation (15):

likewise, calculate the operator network θ _i Is shown in equation (16):

Updating:

first, initialize the omega of the operator-critical network _i And theta _i For each training period, buffer D will be cleared,

is initialized to

For each time slot t, will

Inputting into a policy network and sampling an action

Finally, agent will update the new state and store the information in the experience replay buffer D, which will be updated when full, forAt each user i, extracting a small batch of experience from D to calculate the estimated gradient of the criticizing network

And estimated gradient of actor network

Parameter theta _i And ω _i Updated with a random gradient. And after determining the strategy, the user contributes the sensing data to the service provider, and the service provider evaluates the quality of the sensing data and updates the credit of the user by uploading the sensing data through the edge node.

Claims

1. A reinforcement learning mobile crowdsourcing incentive method based on user reputation comprises a single-server multi-user MCS system, wherein the single-server multi-user MCS system is provided with a service provider SP, edge nodes and mobile users, all the mobile users are assumed to be M = {1, 2.. Multidot.M }, a group of optimal users with N = {1, 2.. Multidot.N } in M are assumed, a continuous decision period is divided into equal time slots T, T = {1, 2.. Multidot.T }, and the perception level of each user i is represented as equal time slot T

The perception levels of other users than user i are

The sensing levels of all mobile users are collected as

1) Reputation evaluation stage:

Defining a monetary reward for user i to obtain, as perceived by the level

Determine whether or not

The higher the reputation of user i, the higher the reward, the second part

Defining the cost of user i participating in the task, wherein

Representing the unit cost perceived by the user,

the service provider SP utility function is the profit minus the total reward R of the user ^t The revenue of the service provider SP is expressed by a function phi (t) and is obtained from the level of perception of the user, e.g.Formula (2) shows:

where lambda is a parameter of the system,

the function reflects a diminishing return on service by the service provider SP for i, while the ln function reflects a diminishing return on the number of mobile users by the provider, assuming that all local participating users are in the social network G = [ G ] _ij ] _i,j∈N ，g _ij E {0,1}, where user i is affected by user j, then g _ij =1, otherwise g _ij =0, assuming reciprocity of social relations g _ij ＝g _ji According to the E-R model of the social network, all users are connected to the social network in a probability mu mode, the larger the value of mu is, the tighter the connection of each node is represented, the more users affecting a user i except the user i are represented, and the willingness of the user i to participate in a perception task is defined as shown in a formula (3):

introducing a participant performance indicator, i.e. the reputation value is influenced by the awards earned, the performance index of the performance indicator is

equation (4) represents the degree to which user i completed the task compared to all selected participants, R ^t N the average reward for each user i is from the totalReward R ^t Following the same reward condition

And

the reputation update of the user is shown in equation (5):

in the formula (5)

For the new reputation value obtained for user i,

for historical reputation values, ref _i ^t For the feedback value, α is the factor for determining whether the participant gets or loses a new reputation value during the reputation update, the arctan (—) function is a monotonically increasing function, with positive feedback from the participant, the reputation value of the user will increase, while negative feedback will decrease faster, if any

This means that the reputation value is when the preset value is 0

Will start from 0.5;

2) And a task issuing stage: the service provider selects an optimal user N = {1, 2.. And N } from M = {1, 2.. And M } according to the reputation of the user, and determines the maximum benefit of both the service provider and the user in the optimal user set, in the case of Nash equilibrium, both the provider SP and the user have no motivation to change their own policy unilaterally, and describes the service provider SP and the mobile user as a two-stage Starkberg game of a single leader and a plurality of followers, and in the leader sub game, the service provider decides the total service reward provided by the task provider for obtaining more, as shown in formula (6):

in the follower sub-game, all users share the total rewards, a non-cooperative game exists among the users, and the mobile users determine the perception level according to reward factors and cost factors, namely as shown in a formula (7):

2-1-1) given

Is that

In that

An optimal response strategy;

2-1-2) when the benefits of each are satisfied

At times, nash equilibrium exists in non-cooperative gaming and is denoted as

When the total award R ^t When the game is given, nash equilibrium exists in the non-cooperative game, whether NE exists in the non-cooperative game depends on that the user utility function is a concave function of a strategy, and the user utility function is proved to be a strict concave function by adopting a first-order derivative function and a second-order derivative function for solving the user utility function, as shown in a formula (8):

2-2) leader sub-game: in the Starkegberg game, the leader, i.e. the service provider SP, provides the total award R to the user ^t Where Nash equilibrium exists between subscribers, the service provider SP determines R ^t To achieve maximum utility, R ^t(*) In the Stackelberg game, R ^t(*) Optimal strategy to maximize the utility of provider SP, when X ^t(*) At the optimum perceptual level, there is starkeberg equalization (R) ^t(*) ,X ^t(*) ) Step 2-1), calculating a second derivative of the utility function of the service provider SP to prove that the utility function of the service provider SP is a strict concave function and to prove that the Starkeberg equilibrium exists;

3) A training stage: the method comprises the following steps:

3-1) Markov decision process definition: modeling user behavior as Markov decision Process MDP and incorporating for agent and RingThe standard reinforcement learning framework of environment interaction describes the Markov decision process as M =<S,A,U,P,γ>Defining the status, action and reward of the t-th slot as

state space, the state of user i in each time slot is composed of the perception level of other users in the last time slot, which is expressed as

Perceptual unit cost

Mission reward R ^t As shown in equation (9):

Wherein

Reward: reward transfer function compliance

The reward function of the user i is a user utility function;

3-2) perception strategy decision based on PPO algorithm: optimizing the perception strategy of each user by adopting a deep reinforcement learning algorithm based on PPO (polyphenylene oxide), wherein the goal of the deep reinforcement learning algorithm of PPO is to find an optimal set of parameters

Wherein theta is _i Representing the policy of the user i, first, the policy θ is set _i Parameterization to

defining a function of state values as

Function of value of

Namely, as shown in equation (11):

in the formula (11)

Meaning that it is only concerned with the current utility and not with the long-term utility; γ =1 indicates that it is concerned about the cumulative effect from T slot to end slot T, in combination with the policy gradient method based on the operator-critic frameworkThe operator-critical model consists of two networks, namely an operator network

Parameter is theta _i And critical network

Parameter is omega _i For an operator network, the input state

wherein

Parameters representing the old sampling strategy and are:

is a dominant function of motion and space, representing motion

Relative to state

wherein

η (x) is a piecewise function with an interval:

defining a critc network

The updated loss function is equation (14):

calculating omega _i Is shown in equation (15):

likewise, the operator network θ is calculated _i Is shown in equation (16):

And (3) updating:

firstly, initializing omega of the operator-critical network _i And theta _i For each training period, the buffer D will be cleared,

is initialized to

For each time slot t, will

Inputting into a policy network and sampling an action

Finally, agent will updateNew state and store information in the empirical review buffer D, which will be updated when full, for each user i, a small batch of experiences is extracted from D to calculate the estimated gradient of the criticizing network

And estimated gradient of actor network

Parameter theta _i And ω _i And updating by using a random gradient, wherein the user contributes own sensing data to a service provider after determining own strategy, and the service provider evaluates the quality of the sensing data and updates the credit of the user by uploading through the edge node.