CN115392337A - Reinforced learning mobile crowdsourcing incentive method based on user reputation - Google Patents
Reinforced learning mobile crowdsourcing incentive method based on user reputation Download PDFInfo
- Publication number
- CN115392337A CN115392337A CN202210790320.6A CN202210790320A CN115392337A CN 115392337 A CN115392337 A CN 115392337A CN 202210790320 A CN202210790320 A CN 202210790320A CN 115392337 A CN115392337 A CN 115392337A
- Authority
- CN
- China
- Prior art keywords
- user
- users
- function
- reputation
- service provider
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 230000002787 reinforcement Effects 0.000 claims abstract description 16
- 230000008901 benefit Effects 0.000 claims abstract description 14
- 238000012549 training Methods 0.000 claims abstract description 10
- 238000011156 evaluation Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 84
- 230000008447 perception Effects 0.000 claims description 54
- 230000008569 process Effects 0.000 claims description 11
- 230000007704 transition Effects 0.000 claims description 9
- 230000009471 action Effects 0.000 claims description 6
- 239000003795 chemical substances by application Substances 0.000 claims description 6
- 230000003467 diminishing effect Effects 0.000 claims description 6
- 230000007774 longterm Effects 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 5
- 230000006399 behavior Effects 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 3
- 238000012886 linear function Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 230000000630 rising effect Effects 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 239000004721 Polyphenylene oxide Substances 0.000 claims 5
- 229920006380 polyphenylene oxide Polymers 0.000 claims 5
- 230000008450 motivation Effects 0.000 claims 1
- 238000012552 review Methods 0.000 claims 1
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006855 networking Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012358 sourcing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/11—Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0207—Discounts or incentives, e.g. coupons or rebates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Marketing (AREA)
- Data Mining & Analysis (AREA)
- Pure & Applied Mathematics (AREA)
- Accounting & Taxation (AREA)
- General Business, Economics & Management (AREA)
- Development Economics (AREA)
- Finance (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Primary Health Care (AREA)
- Operations Research (AREA)
- Tourism & Hospitality (AREA)
- Algebra (AREA)
- Human Resources & Organizations (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Entrepreneurship & Innovation (AREA)
- Computing Systems (AREA)
- Game Theory and Decision Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a reinforcement learning mobile crowdsourcing incentive method based on user reputation, which comprises a single-server multi-user MCS system, and the method comprises the following steps: 1) Reputation evaluation stage; 2) A task issuing stage; 3) And (5) a training stage. The method can fully consider the quality of the uploaded data of the user and the participation desire, the two standards are brought into the reputation index of the user, the service provider can be helped to effectively screen credible users, so that the cost is reduced, and higher benefits are obtained.
Description
Technical Field
The invention relates to a mobile crowdsourcing and reinforcement learning technology in edge computing, in particular to a reinforcement learning mobile crowdsourcing incentive method based on user reputation.
Background
Mobile crowd-sourcing awareness (MCS) has become a popular and widely adopted mode of city awareness and data collection. In recent years, mobile devices such as mobile phones, tablet computers, wearable devices, and vehicle-mounted smart devices have become widespread, and these mobile devices have sensing, computing, and communication capabilities. Compared with the traditional sensor network, the MCS system formed by the mobile equipment has the characteristics of high mobility and intelligence, and can realize wider coverage and ensure the dynamic sensing requirement. The perception and calculation capability of the MCS system depends on a large number of mobile devices to participate in the perception task and contribute perception data of the mobile devices, but in actual situations, users do not want to contribute privately, and the participation task may reveal privacy of the users. It is therefore a challenge to devise a reasonable incentive mechanism to encourage mobile users to continue to participate in the perception task.
MCS systems are typically reduced to cloud Server Providers (SPs) and users using terminal mobile devices in marginal computing scenarios, where SPs issue awareness tasks and rewards for participating in tasks to incentivize mobile users to complete the tasks. Dishonest users in the MCS may send false perception data making the task result inaccurate, so the SP is required to select users to upload data based on their reputation value, with higher reputation users representing higher quality data. The resulting social networking effect between mobile user devices may affect the participants' perception policies by other users, such that a user will gain additional benefit from information provided or shared by local neighbors in the social network, and thus have greater willingness to participate in sensing tasks. Therefore, it is necessary to take into account reputation considerations the willingness of a user as determined by social networking effects, which is the importance of considering user willingness and contributing data quality in incentives.
Disclosure of Invention
The invention aims to provide a reinforcement learning mobile crowdsourcing incentive method based on user reputation, aiming at the problems of rationality of a user incentive method for mobile crowd sensing in edge calculation and the problems that the data quality and the participation degree of a user in a sensing task process cannot be guaranteed. The method can fully consider the quality of the uploaded data of the user and the participation desire, the two standards are brought into the reputation index of the user, the service provider can be helped to effectively screen credible users, so that the cost is reduced, and higher benefits are obtained.
The technical scheme for realizing the purpose of the invention is as follows:
a reinforcement learning mobile crowdsourcing incentive method based on user reputation comprises a single-server multi-user MCS system, wherein the single-server multi-user MCS system is provided with service providers SP, edge nodes and mobile users, all the mobile users are assumed to be M = {1,2,. Eta., M }, N = {1,2,. Eta.,. N } in M is provided with a group of optimal users, a continuous decision period is divided into equal time slots T, T = {1,2,. Eta.,. T }, and the perception level of each user i is represented as T in each time slot TThe perception levels of other users than user i areThe sensing levels of all mobile users are integrated intoThe method is characterized in that the service provider issues a perception task to an edge node in each time period, the edge node distributes the task to users in the social network, the tasks received by the users are determined and executed according to the perception levels of the users and other users, perception data are uploaded to a provider SP, the provider SP scores the reputation of the users according to the data quality and the user intention, and updates the reputation records, and the method comprises the following steps:
1) Reputation evaluation stage:
the participation intention of the user is determined by social network effect, the service provider SP updates the reputation of the user after receiving the data, evaluates historical reputation, determines utility functions of the user and the service provider, and defines the utility function of the user i as shown in formula (1):
the utility of the user i consists of a revenue function and a cost function, the first partDefining a monetary reward for user i to obtain, as perceived by the levelDetermine whether or notLinear function of R t Representing the total mission reward over time period t, shared by all users, the user's reward also being subject to the user's own reputationThe higher the reputation of user i, the higher the reward, the second partDefining the cost of user i participating in the task, whereinRepresenting user perceived unit cost, the service provider SP utility function is the revenue minus the total reward R of the user t The revenue of the service provider SP is expressed by a function phi (t) obtained from the user's perception level, as shown in equation (2):
where x is a parameter of the system and,the function reflects a diminishing return on service by the service provider SP for i, while the ln function reflects a diminishing return on the number of mobile users by the provider, assuming that all local participating users are in the social network G = [ G ] ij ] i,j∈N , g ij E {0,1}, where user i is affected by user j, then g ij =1, otherwise g ij =0, assuming reciprocity of social relations g ij =g ji According to the E-R model of the social network, all users are connected to the social network in a probability mu mode, the larger the value of mu is, the tighter the connection of each node is represented, the more users affecting a user i except the user i are represented, and the willingness of the user i to participate in a perception task is defined as shown in a formula (3):
the introduction of the participant performance indicators, i.e. the reputation values, is influenced by the awards obtained, since the provider SP is more inclined to users requiring less awards, the higher the awards obtained by the users, the lower the performance indicators, the performance index of which is
The reputation feedback function of the mobile user is defined as shown in equation (4):
equation (4) represents the extent to which user i completes the task, R, compared to all selected participants t N the average reward for each user i is from the total reward R t Following the same reward conditionAndthe value of credit feedback increases, the service provider SP is more inclined to select user i,
the reputation update of the user is as shown in equation (5):
in the formula (5)For the new reputation value obtained for user i,for historical reputation values, ref i t For feedback value, α is the factor that determines whether the participant will gain or lose new reputation value during reputation updating, the arctan function is a monotonically increasing function, with positive feedback from the participant, the user reputation value will increase and negative feedback will decrease faster, if anyThis means that the reputation value is when the preset value is 0Will start from 0.5;
2) And a task issuing stage: the service provider selects an optimal user N = {1,2,..,. N } from M = {1,2,..,. M } according to the reputation of the user, and determines the maximum profit of both the service provider and the user in the optimal user set,
the service provider SP and the mobile user both have their own benefits, so it is necessary to provide a reasonable price-setting policy of the service provider, and then determine the policy of service requirement of each user, the starkeberg Stackelberg game is a two-stage dynamic game with complete information, the main idea of the starkeberg Stackelberg game is that both parties select their own policy under the policy of each other to achieve the maximum effectiveness, so as to achieve nash equilibrium, in the case of nash equilibrium, both the provider SP and the user have no incentive to change their own policy unilaterally, the service provider SP and the mobile user are described as a two-stage starkeberg with a single leader and multiple followers, in the leader sub-game, the service reward provider decides the total service provided by the task provider for obtaining more as shown in formula (6):
in the follower sub-game, since all users share the total award, a non-cooperative game exists among the users, and the mobile user determines the perception level according to the reward factor and the cost factor, namely as shown in formula (7):
each user determines the optimal corresponding strategy thereof through the strategy of the service provider SP, which specifically comprises the following steps:
2-1) follower sub-game: the total reward R determined by the service provider SP is distributed to the users according to the perception level and the reputation weight of each user, the users compete to obtain more rewards, the competition among the users is described as a non-cooperative game, and in the game, the perception strategy of the user participants is thatThe goal of each user is to maximize the benefit of the user, and when all users select the optimal strategy, the non-cooperative game reaches a stable state, namely Nash equilibrium NE, which is defined in the non-cooperative game among the users:
2-1-2) when the benefits of each are satisfiedIn time, nash equilibrium exists in non-cooperative gaming and is denoted asWhen the total award R t Given, nash equilibrium exists in non-cooperative gaming,
whether the NE exists in the non-cooperative game depends on the fact that the user utility function is a concave function of the strategy, and the user utility function is proved to be a strict concave function by solving first-order and second-order derivative functions of the user utility function, as shown in a formula (8):
2-2) leader sub-Game: in the Starkegberg game, the leader, i.e. the service provider SP, provides the total award R to the user t Where there is nash equilibrium between users, the service provider SP determines R t To achieve maximum utility, R t(*) In the Stackelberg game, R t(*) Optimal strategy to maximize the utility of provider SP, when X t(*) At the optimum perceptual level, there is starkeberg equalization (R) t(*) ,X t(*) ) Step 2-1), calculating a second derivative of the utility function of the service provider SP to prove that the utility function of the service provider SP is a strict concave function and to prove that the Starkeberg equilibrium exists;
3) A training stage: the method comprises the following steps:
3-1) Markov decision process definition: modeling the user behavior as a Markov Decision Process (MDP), and combining a standard reinforcement learning framework for agent and environment interaction, describing the Markov decision process as a quintuple of M = < S, A, U, P, gamma >, and defining the state, action and reward of the t-th time slot asWherein the details of the state space, the action space, the state transition probability and the reward are as follows:
state space, the state of user i in each time slot is composed of the perception level of other users in the last time slot, and is expressed asPerceptual unit costMission reward R t As shown in equation (9):
motion space in the t-th time slot, the motion of the user i is expressed as the perception level of the user i, namely
State transition probability the dynamics of the environment are represented by state transition probabilities, i.e.Wherein
Rewarding: reward transfer function complianceThe reward function of the user i is a user utility function;
3-2) perception strategy decision based on PPO algorithm: the method comprises the steps that a service provider and an optimal user select own optimal strategies by using a PPO algorithm, a deep reinforcement learning algorithm based on the PPO is adopted to optimize the perception strategies of each user, and the goal of the deep reinforcement learning algorithm of the PPO is to find a set of optimal parametersWherein theta is i Representing the policy of user i, first, the policy θ is set i Parameterization toThe optimization objective for the strategy gradient is then as shown in equation (10):
defining a function of state values asFunction of function value ofNamely, as shown in equation (11):
in the formula (11)Is the accumulated reward of user i in the t-th time slot, gamma belongs to [0,1 ]]Is a discount factor, if γ =0, thenMeaning that it is only concerned with current utility and not with long-term utility; gamma =1 indicates that it is concerned about the cumulative effect from T time slot to end time slot T, and optimizes the strategy by combining the strategy gradient method based on the operator-critical framework, wherein the operator-critical model consists of two networks, namely, the operator networkParameter is theta i And criticc networkParameter is omega i For operator networks, input statusAnd (3) solving the optimal perception level to obtain the maximum long-term accumulated yield, and calculating the strategy gradient according to a random strategy gradient theory as shown in a formula (12):
is a dominant function of motion and space, representing motionRelative to stateThe near-end strategy optimization PPO method is adopted to prune the strategy gradient as shown in formula (13):
defining a critc networkThe updated loss function is equation (14):calculate ω i Is shown in equation (15):
d is the sample number of strategy gradient estimation of each training step, namely the size of the small batch, and the operator and critic networks are updated after the dynamic game of the D time slots, forAdopts small-batch random gradient rising and small-batch random gradient falling method toAnd (3) updating:
wherein l i,1 ,l i,2 The learning rates of the operator network and the critic network are respectively as follows:
first, initialize the omega of the operator-critical network i And theta i For each training period, the buffer D will be cleared,is initialized toFor each time slot t, willInputting into a policy network and sampling an actionAs a user perception level, the service provider SP determines its own pricing policy and assigns R t Sent to the mobile user, and the mobile user determines the perception level of the user in the t +1 time slotFinally, agent will update the new state and store the information into experience replay buffer D, which will be updated when full, and for each user i, a small batch of experience is extracted from D to calculate the estimated gradient of the criticizing networkAnd estimated gradient of actor networkParameter theta i And omega i Updated with a random gradient. After determining the self strategy, the user contributes the self sensing data to the service provider, and the service provider evaluates the sensing data by uploading the sensing data through the edge nodeAnd updating the reputation of the user.
The method can help the service provider to effectively screen the credible users, thereby reducing the cost and obtaining greater income, simultaneously protecting the privacy of the users and maximizing the benefits of both the service provider and the users in a dynamic scene.
Drawings
FIG. 1 is a schematic diagram of a single-server multi-user MCS system in an embodiment;
fig. 2 is a schematic diagram of a dynamic gaming process in the embodiment.
Detailed Description
The invention will be further illustrated by the following figures and examples, but is not limited thereto.
Example (b):
a reinforcement learning mobile crowdsourcing incentive method based on user reputation, comprising a single server multi-user MCS system, as shown in fig. 1, the single server multi-user MCS system is provided with a service provider SP, edge nodes and mobile users, assuming that all mobile users are M = {1, 2., M }, and there is a group of optimal users of N = {1, 2., N } in M, dividing a continuous decision cycle into equal time slots T, T = {1, 2., T }, and at time slot T, a perception level of each user i is represented as TThe perception levels of other users than user i areThe sensing levels of all mobile users are collected asThe method comprises the steps that a service provider issues sensing tasks to edge nodes in each time period, the edge nodes distribute the tasks to users in a social network, the tasks received by the users determine and execute the tasks according to the sensing levels of the users and other users, sensing data are uploaded to a provider SP, the provider SP scores the reputation of the users according to the data quality and the wishes of the users and updates the reputation records, and the method is characterized in that the service provider sends the sensing tasks to the edge nodes in each time period, the edge nodes distribute the tasks to the users in the social network, the tasks received by the users are determined and executed according to the sensing levels of the users and other users, the provider SP scores the reputation of the users according to the data quality and the wishes of the users, and updates the reputation recordsThe method comprises the following steps:
1) Reputation evaluation stage:
the participation willingness of the user is determined by the social network effect, the service provider SP updates the reputation of the user after receiving the data, evaluates historical reputation, determines utility functions of the user and the service provider, and defines the utility function of the user i as shown in formula (1):
the utility of the user i consists of a revenue function and a cost function, the first partDefining a monetary reward for user i to obtain, as perceived by the levelDetermine whether or notLinear function of R t Representing the total mission reward over time period t, shared by all users, the user's reward also being subject to the user's own reputationThe higher the reputation of user i, the higher the reward, the second partDefining the cost of user i participating in the task, whereinRepresenting user perceived unit cost, the service provider SP utility function is the revenue minus the total reward R of the user t The gain of the service provider SP is expressed by a function phi (t) obtained from the user's perception level, as shown in equation (2):
where lambda is a parameter of the system,the function reflects a diminishing return on service by the service provider SP for i, while the ln function reflects a diminishing return on the number of mobile users by the provider, assuming that all local participating users are in the social network G = [ G ] ij ] i,j∈N , g ij E {0,1}, where user i is affected by user j, then g ij =1, otherwise g ij =0, assuming reciprocity of social relations g ij =g ji According to the E-R model of the social network, all users are connected to the social network in a probability mu mode, the larger the value of mu is, the more closely the connection of each node is represented, the more users affecting the user i except the user i are represented, and the willingness of the user i to participate in the perception task is defined as shown in formula (3):
the introduction of the participant performance indicators, i.e. the reputation values, is influenced by the awards obtained, since the provider SP is more inclined to users requiring less awards, the higher the awards obtained by the users, the lower the performance indicators, the performance index of which is
The reputation feedback function of the mobile user is defined as shown in equation (4):
equation (4) represents the degree to which user i completed the task compared to all selected participants, R t The average reward expressed as per user i is from the total reward R t Under the same reward condition, withAndthe value of credit feedback increases, the service provider SP is more inclined to select user i,
the reputation update of the user is shown in equation (5):
in the formula (5)For the new reputation value obtained for user i,for historical reputation values, ref i t For feedback value, α is the factor that determines whether the participant will gain or lose new reputation value during reputation updating, the arctan function is a monotonically increasing function, with positive feedback from the participant, the user reputation value will increase and negative feedback will decrease faster, if anyThis means that the reputation value is when the preset value is 0Will start from 0.5;
2) And a task issuing stage: the service provider selects an optimal user N = {1, 2..., N } from M = {1, 2..., M } according to the reputation of the user, determines the maximum benefit of both the service provider and the user in an optimal user set,
the service provider SP and the mobile user both have their own benefits, so it is necessary to provide a reasonable price-setting policy of the service provider, and then determine the policy of service requirement of each user, the starkeberg Stackelberg game is a two-stage dynamic game with complete information, the main idea of the starkeberg Stackelberg game is that both parties select their own policy under the policy of each other to achieve the maximum effectiveness, so as to achieve nash equilibrium, in the case of nash equilibrium, both the provider SP and the user have no incentive to change their own policy unilaterally, the service provider SP and the mobile user are described as a two-stage starkeberg with a single leader and multiple followers, in the leader sub-game, the service reward provider decides the total service provided by the task provider for obtaining more as shown in formula (6):
in the follower sub-game, since all users share the total award, a non-cooperative game exists among the users, and the mobile user determines the perception level according to the reward factor and the cost factor, namely as shown in formula (7):
each user determines the optimal corresponding strategy thereof through the strategy of the service provider SP, which specifically comprises the following steps:
2-1) follower sub-game: the total reward R determined by the service provider SP is distributed to the users according to the perception level and the reputation weight of each user, the users compete to obtain more rewards, the competition between the users is described as a non-cooperative game, and in the game, the perception strategy of the user participants isThe goal of each user is to maximize the benefit of the user, and when all users select the optimal strategy, the non-cooperative game reaches a stable state, namely Nash equilibrium NE, which is defined in the non-cooperative game among the users:
2-1-2) when the benefit of each is satisfiedIn time, nash equilibrium exists in non-cooperative gaming and is denoted asWhen the total award R t Given, nash equilibrium exists in non-cooperative gaming,
whether the NE exists in the non-cooperative game depends on the fact that the user utility function is a concave function of the strategy, and the first-order derivative function and the second-order derivative function for solving the user utility function prove that the user utility function is a strict concave function as shown in a formula (8):
2-2) leader sub-game: in the Starkegberg game, the leader, i.e. the service provider SP, provides the total award R to the user t Where Nash equilibrium exists between subscribers, the service provider SP determines R t To achieve maximum utility, R t(*) In the Stackelberg game, R t(*) Optimal strategy to maximize the utility of provider SP, when X t(*) At the optimum perceptual level, there is starkeberg equalization (R) t(*) ,X t(*) ) The second derivative of the utility function of the service provider SP is calculated in the same step 2-1)To prove that the provider SP utility function is a strict concave function, to prove that starkeberg equilibrium exists;
3) A training stage: the method comprises the following steps:
3-1) Markov decision process definition: modeling the user behavior as a Markov Decision Process (MDP), and combining a standard reinforcement learning framework for agent and environment interaction, describing the Markov decision process as a quintuple of M = < S, A, U, P, gamma >, and defining the state, action and reward of the t-th time slot asWherein the details of the state space, the action space, the state transition probability and the reward are as follows:
state space, the state of user i in each time slot is composed of the perception level of other users in the last time slot, and is expressed asPerceptual unit costMission reward R t As shown in equation (9):
motion space-at the t-th time slot, the motion of user i is expressed as the user i perception level, i.e. the motion space
State transition probability the dynamics of the environment is represented by the state transition probability, i.e.Wherein
Rewarding: reward transfer function complianceThe reward function of the user i is a user utility function;
3-2) perception strategy decision based on PPO algorithm: the method comprises the steps that a service provider and an optimal user select own optimal strategies by using a PPO algorithm, a deep reinforcement learning algorithm based on the PPO is adopted to optimize the perception strategies of each user, and the goal of the deep reinforcement learning algorithm of the PPO is to find a set of optimal parametersWherein theta is i Representing the policy of user i, first, the policy θ is set i Parameterization toThe optimization objective of the strategy gradient is then as shown in equation (10):
in the formula (11)Is the accumulated reward of user i at the t-th time slot, gamma is equal to [0,1 ]]Is a discount factor, if γ =0, thenMeaning that it is only concerned with the current utility and not with the long-term utility; gamma =1 indicates that it is concerned about the cumulative effect from T time slot to end time slot T, and optimizes the strategy by combining the strategy gradient method based on the operator-critical framework, wherein the operator-critical model consists of two networks, namely, the operator networkParameter is theta i And criticc networkParameter is omega i For operator networks, input statusAnd (3) solving the optimal perception level to obtain the maximum long-term accumulated yield, and calculating the strategy gradient according to a random strategy gradient theory as shown in a formula (12):
is a dominant function of motion and space, representing motionRelative to the stateThe near-end strategy optimization PPO method is adopted to prune the strategy gradient as shown in formula (13):
defining a critc networkThe updated loss function is equation (14):calculate ω i Is shown in equation (15):
d is the sample number of strategy gradient estimation of each training step, namely the size of the small batch, and the operator and critic networks are updated after the dynamic game of the D time slots, forAdopts small-batch random gradient rising and small-batch random gradient falling method toUpdating:
wherein l i,1 ,l i,2 The learning rates of the operator network and the critic network are respectively as follows:
first, initialize the omega of the operator-critical network i And theta i For each training period, buffer D will be cleared,is initialized toFor each time slot t, willInputting into a policy network and sampling an actionAs a user perception level, the service provider SP determines its own pricing policy and assigns R t Sent to the mobile user, and the mobile user determines the perception level of the user in the t +1 time slotFinally, agent will update the new state and store the information in the experience replay buffer D, which will be updated when full, forAt each user i, extracting a small batch of experience from D to calculate the estimated gradient of the criticizing networkAnd estimated gradient of actor networkParameter theta i And ω i Updated with a random gradient. And after determining the strategy, the user contributes the sensing data to the service provider, and the service provider evaluates the quality of the sensing data and updates the credit of the user by uploading the sensing data through the edge node.
Claims (1)
1. A reinforcement learning mobile crowdsourcing incentive method based on user reputation comprises a single-server multi-user MCS system, wherein the single-server multi-user MCS system is provided with a service provider SP, edge nodes and mobile users, all the mobile users are assumed to be M = {1, 2.. Multidot.M }, a group of optimal users with N = {1, 2.. Multidot.N } in M are assumed, a continuous decision period is divided into equal time slots T, T = {1, 2.. Multidot.T }, and the perception level of each user i is represented as equal time slot TThe perception levels of other users than user i areThe sensing levels of all mobile users are collected asThe method is characterized in that the service provider issues a perception task to an edge node in each time period, the edge node distributes the task to users in the social network, the tasks received by the users are determined and executed according to the perception levels of the users and other users, perception data are uploaded to a provider SP, the provider SP scores the reputation of the users according to the data quality and the user intention, and updates the reputation records, and the method comprises the following steps:
1) Reputation evaluation stage:
the participation willingness of the user is determined by the social network effect, the service provider SP updates the reputation of the user after receiving the data, evaluates historical reputation, determines utility functions of the user and the service provider, and defines the utility function of the user i as shown in formula (1):
the utility of the user i consists of a revenue function and a cost function, the first partDefining a monetary reward for user i to obtain, as perceived by the levelDetermine whether or notLinear function of R t Representing the total mission reward over time period t, shared by all users, the user's reward also being subject to the user's own reputationThe higher the reputation of user i, the higher the reward, the second partDefining the cost of user i participating in the task, whereinRepresenting the unit cost perceived by the user,
the service provider SP utility function is the profit minus the total reward R of the user t The revenue of the service provider SP is expressed by a function phi (t) and is obtained from the level of perception of the user, e.g.Formula (2) shows:
where lambda is a parameter of the system,the function reflects a diminishing return on service by the service provider SP for i, while the ln function reflects a diminishing return on the number of mobile users by the provider, assuming that all local participating users are in the social network G = [ G ] ij ] i,j∈N ,g ij E {0,1}, where user i is affected by user j, then g ij =1, otherwise g ij =0, assuming reciprocity of social relations g ij =g ji According to the E-R model of the social network, all users are connected to the social network in a probability mu mode, the larger the value of mu is, the tighter the connection of each node is represented, the more users affecting a user i except the user i are represented, and the willingness of the user i to participate in a perception task is defined as shown in a formula (3):
introducing a participant performance indicator, i.e. the reputation value is influenced by the awards earned, the performance index of the performance indicator is
The reputation feedback function of the mobile user is defined as shown in equation (4):
equation (4) represents the degree to which user i completed the task compared to all selected participants, R t N the average reward for each user i is from the totalReward R t Following the same reward conditionAndthe value of credit feedback increases, the service provider SP is more inclined to select user i,
the reputation update of the user is shown in equation (5):
in the formula (5)For the new reputation value obtained for user i,for historical reputation values, ref i t For the feedback value, α is the factor for determining whether the participant gets or loses a new reputation value during the reputation update, the arctan (—) function is a monotonically increasing function, with positive feedback from the participant, the reputation value of the user will increase, while negative feedback will decrease faster, if anyThis means that the reputation value is when the preset value is 0Will start from 0.5;
2) And a task issuing stage: the service provider selects an optimal user N = {1, 2.. And N } from M = {1, 2.. And M } according to the reputation of the user, and determines the maximum benefit of both the service provider and the user in the optimal user set, in the case of Nash equilibrium, both the provider SP and the user have no motivation to change their own policy unilaterally, and describes the service provider SP and the mobile user as a two-stage Starkberg game of a single leader and a plurality of followers, and in the leader sub game, the service provider decides the total service reward provided by the task provider for obtaining more, as shown in formula (6):
in the follower sub-game, all users share the total rewards, a non-cooperative game exists among the users, and the mobile users determine the perception level according to reward factors and cost factors, namely as shown in a formula (7):
each user determines the optimal corresponding strategy thereof through the strategy of the service provider SP, which specifically comprises the following steps:
2-1) follower sub-game: the total reward R determined by the service provider SP is distributed to the users according to the perception level and the reputation weight of each user, the users compete to obtain more rewards, the competition among the users is described as a non-cooperative game, and in the game, the perception strategy of the user participants is thatThe goal of each user is to maximize the benefit of the user, and when all users select the optimal strategy, the non-cooperative game reaches a stable state, namely Nash equilibrium NE, which is defined in the non-cooperative game among the users:
2-1-2) when the benefits of each are satisfiedAt times, nash equilibrium exists in non-cooperative gaming and is denoted asWhen the total award R t When the game is given, nash equilibrium exists in the non-cooperative game, whether NE exists in the non-cooperative game depends on that the user utility function is a concave function of a strategy, and the user utility function is proved to be a strict concave function by adopting a first-order derivative function and a second-order derivative function for solving the user utility function, as shown in a formula (8):
2-2) leader sub-game: in the Starkegberg game, the leader, i.e. the service provider SP, provides the total award R to the user t Where Nash equilibrium exists between subscribers, the service provider SP determines R t To achieve maximum utility, R t(*) In the Stackelberg game, R t(*) Optimal strategy to maximize the utility of provider SP, when X t(*) At the optimum perceptual level, there is starkeberg equalization (R) t(*) ,X t(*) ) Step 2-1), calculating a second derivative of the utility function of the service provider SP to prove that the utility function of the service provider SP is a strict concave function and to prove that the Starkeberg equilibrium exists;
3) A training stage: the method comprises the following steps:
3-1) Markov decision process definition: modeling user behavior as Markov decision Process MDP and incorporating for agent and RingThe standard reinforcement learning framework of environment interaction describes the Markov decision process as M =<S,A,U,P,γ>Defining the status, action and reward of the t-th slot asWherein the details of the state space, the action space, the state transition probability and the reward are as follows:
state space, the state of user i in each time slot is composed of the perception level of other users in the last time slot, which is expressed asPerceptual unit costMission reward R t As shown in equation (9):
motion space in the t-th time slot, the motion of the user i is expressed as the perception level of the user i, namely
State transition probability the dynamics of the environment is represented by the state transition probability, i.e.Wherein
Reward: reward transfer function complianceThe reward function of the user i is a user utility function;
3-2) perception strategy decision based on PPO algorithm: optimizing the perception strategy of each user by adopting a deep reinforcement learning algorithm based on PPO (polyphenylene oxide), wherein the goal of the deep reinforcement learning algorithm of PPO is to find an optimal set of parametersWherein theta is i Representing the policy of the user i, first, the policy θ is set i Parameterization toThe optimization objective for the strategy gradient is then as shown in equation (10):
in the formula (11)Is the accumulated reward of user i at the t-th time slot, gamma is equal to [0,1 ]]Is a discount factor, if γ =0, thenMeaning that it is only concerned with the current utility and not with the long-term utility; γ =1 indicates that it is concerned about the cumulative effect from T slot to end slot T, in combination with the policy gradient method based on the operator-critic frameworkThe operator-critical model consists of two networks, namely an operator networkParameter is theta i And critical networkParameter is omega i For an operator network, the input stateAnd (3) solving the optimal perception level to obtain the maximum long-term accumulated yield, and calculating the strategy gradient according to a random strategy gradient theory as shown in a formula (12):
is a dominant function of motion and space, representing motionRelative to stateThe near-end strategy optimization PPO method is adopted to prune the strategy gradient as shown in formula (13):
calculating omega i Is shown in equation (15):
likewise, the operator network θ is calculated i Is shown in equation (16):
d is the sample number of strategy gradient estimation of each training step, namely the size of the small batch, and the operator and critic networks are updated after the dynamic game of the D time slots, forAdopts small-batch random gradient rising and small-batch random gradient falling method toAnd (3) updating:
wherein l i,1 ,l i,2 The learning rates of the operator network and the critic network are respectively as follows:
firstly, initializing omega of the operator-critical network i And theta i For each training period, the buffer D will be cleared,is initialized toFor each time slot t, willInputting into a policy network and sampling an actionAs a user perception level, the service provider SP determines its own pricing policy and assigns R t Sent to the mobile user, and the mobile user determines the perception level of the user in the t +1 time slotFinally, agent will updateNew state and store information in the empirical review buffer D, which will be updated when full, for each user i, a small batch of experiences is extracted from D to calculate the estimated gradient of the criticizing networkAnd estimated gradient of actor networkParameter theta i And ω i And updating by using a random gradient, wherein the user contributes own sensing data to a service provider after determining own strategy, and the service provider evaluates the quality of the sensing data and updates the credit of the user by uploading through the edge node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210790320.6A CN115392337A (en) | 2022-07-06 | 2022-07-06 | Reinforced learning mobile crowdsourcing incentive method based on user reputation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210790320.6A CN115392337A (en) | 2022-07-06 | 2022-07-06 | Reinforced learning mobile crowdsourcing incentive method based on user reputation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115392337A true CN115392337A (en) | 2022-11-25 |
Family
ID=84116151
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210790320.6A Pending CN115392337A (en) | 2022-07-06 | 2022-07-06 | Reinforced learning mobile crowdsourcing incentive method based on user reputation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115392337A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116095721A (en) * | 2023-04-07 | 2023-05-09 | 湖北工业大学 | Mobile crowd-sourced network contract excitation method and system integrating perception communication |
CN116744289A (en) * | 2023-06-02 | 2023-09-12 | 中国矿业大学 | Intelligent position privacy protection method for 3D space mobile crowd sensing application |
-
2022
- 2022-07-06 CN CN202210790320.6A patent/CN115392337A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116095721A (en) * | 2023-04-07 | 2023-05-09 | 湖北工业大学 | Mobile crowd-sourced network contract excitation method and system integrating perception communication |
CN116095721B (en) * | 2023-04-07 | 2023-06-27 | 湖北工业大学 | Mobile crowd-sourced network contract excitation method and system integrating perception communication |
CN116744289A (en) * | 2023-06-02 | 2023-09-12 | 中国矿业大学 | Intelligent position privacy protection method for 3D space mobile crowd sensing application |
CN116744289B (en) * | 2023-06-02 | 2024-02-09 | 中国矿业大学 | Intelligent position privacy protection method for 3D space mobile crowd sensing application |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115392337A (en) | Reinforced learning mobile crowdsourcing incentive method based on user reputation | |
CN107426621B (en) | A kind of method and system showing any active ues image in mobile terminal direct broadcasting room | |
Nie et al. | A socially-aware incentive mechanism for mobile crowdsensing service market | |
CN103647671A (en) | Gur Game based crowd sensing network management method and system | |
CN113724096B (en) | Group knowledge sharing method based on public evolution game model | |
CN114637911B (en) | Method for recommending next interest point of attention fusion perception network | |
CN114124955B (en) | Computing and unloading method based on multi-agent game | |
Zhang et al. | Wireless service pricing competition under network effect, congestion effect, and bounded rationality | |
CN114912626A (en) | Method for processing distributed data of federal learning mobile equipment based on summer pril value | |
CN113918829A (en) | Content caching and recommending method based on federal learning in fog computing network | |
CN112291284B (en) | Content pushing method and device and computer readable storage medium | |
CN110149161B (en) | Multi-task cooperative spectrum sensing method based on Stackelberg game | |
Chen et al. | An incentive mechanism for crowdsourcing systems with network effects | |
CN116761207B (en) | User portrait construction method and system based on communication behaviors | |
Chen et al. | Qoe-aware dynamic video rate adaptation | |
CN116073924B (en) | Anti-interference channel allocation method and system based on Stackelberg game | |
CN116774584A (en) | Unmanned aerial vehicle differentiated service track optimization method based on multi-agent deep reinforcement learning | |
CN116451800A (en) | Multi-task federal edge learning excitation method and system based on deep reinforcement learning | |
CN114598655B (en) | Reinforcement learning-based mobility load balancing method | |
Biczók et al. | Incentivizing the global wireless village | |
CN114698125A (en) | Method, device and system for optimizing computation offload of mobile edge computing network | |
CN111328107B (en) | Multi-cloud heterogeneous mobile edge computing system architecture and energy optimization design method | |
CN117392483A (en) | Album classification model training acceleration method, system and medium based on reinforcement learning | |
Back et al. | Small Profits and Quick Returns: A Practical SocialWelfare Maximizing Incentive Mechanism for Deadline-Sensitive Tasks in Crowdsourcing | |
He et al. | A leader–follower controlled Markov stopping game for delay tolerant and opportunistic resource sharing networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |