CN115392337A - Reinforced learning mobile crowdsourcing incentive method based on user reputation - Google Patents

Reinforced learning mobile crowdsourcing incentive method based on user reputation Download PDF

Info

Publication number
CN115392337A
CN115392337A CN202210790320.6A CN202210790320A CN115392337A CN 115392337 A CN115392337 A CN 115392337A CN 202210790320 A CN202210790320 A CN 202210790320A CN 115392337 A CN115392337 A CN 115392337A
Authority
CN
China
Prior art keywords
user
users
function
reputation
service provider
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210790320.6A
Other languages
Chinese (zh)
Inventor
李先贤
张嘉林
石贞奎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Normal University
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN202210790320.6A priority Critical patent/CN115392337A/en
Publication of CN115392337A publication Critical patent/CN115392337A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0207Discounts or incentives, e.g. coupons or rebates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • General Business, Economics & Management (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Primary Health Care (AREA)
  • Operations Research (AREA)
  • Tourism & Hospitality (AREA)
  • Algebra (AREA)
  • Human Resources & Organizations (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computing Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a reinforcement learning mobile crowdsourcing incentive method based on user reputation, which comprises a single-server multi-user MCS system, and the method comprises the following steps: 1) Reputation evaluation stage; 2) A task issuing stage; 3) And (5) a training stage. The method can fully consider the quality of the uploaded data of the user and the participation desire, the two standards are brought into the reputation index of the user, the service provider can be helped to effectively screen credible users, so that the cost is reduced, and higher benefits are obtained.

Description

Reinforced learning mobile crowdsourcing incentive method based on user reputation
Technical Field
The invention relates to a mobile crowdsourcing and reinforcement learning technology in edge computing, in particular to a reinforcement learning mobile crowdsourcing incentive method based on user reputation.
Background
Mobile crowd-sourcing awareness (MCS) has become a popular and widely adopted mode of city awareness and data collection. In recent years, mobile devices such as mobile phones, tablet computers, wearable devices, and vehicle-mounted smart devices have become widespread, and these mobile devices have sensing, computing, and communication capabilities. Compared with the traditional sensor network, the MCS system formed by the mobile equipment has the characteristics of high mobility and intelligence, and can realize wider coverage and ensure the dynamic sensing requirement. The perception and calculation capability of the MCS system depends on a large number of mobile devices to participate in the perception task and contribute perception data of the mobile devices, but in actual situations, users do not want to contribute privately, and the participation task may reveal privacy of the users. It is therefore a challenge to devise a reasonable incentive mechanism to encourage mobile users to continue to participate in the perception task.
MCS systems are typically reduced to cloud Server Providers (SPs) and users using terminal mobile devices in marginal computing scenarios, where SPs issue awareness tasks and rewards for participating in tasks to incentivize mobile users to complete the tasks. Dishonest users in the MCS may send false perception data making the task result inaccurate, so the SP is required to select users to upload data based on their reputation value, with higher reputation users representing higher quality data. The resulting social networking effect between mobile user devices may affect the participants' perception policies by other users, such that a user will gain additional benefit from information provided or shared by local neighbors in the social network, and thus have greater willingness to participate in sensing tasks. Therefore, it is necessary to take into account reputation considerations the willingness of a user as determined by social networking effects, which is the importance of considering user willingness and contributing data quality in incentives.
Disclosure of Invention
The invention aims to provide a reinforcement learning mobile crowdsourcing incentive method based on user reputation, aiming at the problems of rationality of a user incentive method for mobile crowd sensing in edge calculation and the problems that the data quality and the participation degree of a user in a sensing task process cannot be guaranteed. The method can fully consider the quality of the uploaded data of the user and the participation desire, the two standards are brought into the reputation index of the user, the service provider can be helped to effectively screen credible users, so that the cost is reduced, and higher benefits are obtained.
The technical scheme for realizing the purpose of the invention is as follows:
a reinforcement learning mobile crowdsourcing incentive method based on user reputation comprises a single-server multi-user MCS system, wherein the single-server multi-user MCS system is provided with service providers SP, edge nodes and mobile users, all the mobile users are assumed to be M = {1,2,. Eta., M }, N = {1,2,. Eta.,. N } in M is provided with a group of optimal users, a continuous decision period is divided into equal time slots T, T = {1,2,. Eta.,. T }, and the perception level of each user i is represented as T in each time slot T
Figure RE-GDA0003819577010000021
The perception levels of other users than user i are
Figure RE-GDA0003819577010000022
The sensing levels of all mobile users are integrated into
Figure RE-GDA0003819577010000023
The method is characterized in that the service provider issues a perception task to an edge node in each time period, the edge node distributes the task to users in the social network, the tasks received by the users are determined and executed according to the perception levels of the users and other users, perception data are uploaded to a provider SP, the provider SP scores the reputation of the users according to the data quality and the user intention, and updates the reputation records, and the method comprises the following steps:
1) Reputation evaluation stage:
the participation intention of the user is determined by social network effect, the service provider SP updates the reputation of the user after receiving the data, evaluates historical reputation, determines utility functions of the user and the service provider, and defines the utility function of the user i as shown in formula (1):
Figure RE-GDA0003819577010000024
the utility of the user i consists of a revenue function and a cost function, the first part
Figure RE-GDA0003819577010000025
Defining a monetary reward for user i to obtain, as perceived by the level
Figure RE-GDA0003819577010000026
Determine whether or not
Figure RE-GDA0003819577010000027
Linear function of R t Representing the total mission reward over time period t, shared by all users, the user's reward also being subject to the user's own reputation
Figure RE-GDA0003819577010000028
The higher the reputation of user i, the higher the reward, the second part
Figure RE-GDA0003819577010000029
Defining the cost of user i participating in the task, wherein
Figure RE-GDA00038195770100000210
Representing user perceived unit cost, the service provider SP utility function is the revenue minus the total reward R of the user t The revenue of the service provider SP is expressed by a function phi (t) obtained from the user's perception level, as shown in equation (2):
Figure RE-GDA00038195770100000211
where x is a parameter of the system and,
Figure RE-GDA00038195770100000212
the function reflects a diminishing return on service by the service provider SP for i, while the ln function reflects a diminishing return on the number of mobile users by the provider, assuming that all local participating users are in the social network G = [ G ] ij ] i,j∈N , g ij E {0,1}, where user i is affected by user j, then g ij =1, otherwise g ij =0, assuming reciprocity of social relations g ij =g ji According to the E-R model of the social network, all users are connected to the social network in a probability mu mode, the larger the value of mu is, the tighter the connection of each node is represented, the more users affecting a user i except the user i are represented, and the willingness of the user i to participate in a perception task is defined as shown in a formula (3):
Figure RE-GDA00038195770100000213
the introduction of the participant performance indicators, i.e. the reputation values, is influenced by the awards obtained, since the provider SP is more inclined to users requiring less awards, the higher the awards obtained by the users, the lower the performance indicators, the performance index of which is
Figure RE-GDA0003819577010000031
The reputation feedback function of the mobile user is defined as shown in equation (4):
Figure RE-GDA0003819577010000032
equation (4) represents the extent to which user i completes the task, R, compared to all selected participants t N the average reward for each user i is from the total reward R t Following the same reward condition
Figure RE-GDA0003819577010000033
And
Figure RE-GDA0003819577010000034
the value of credit feedback increases, the service provider SP is more inclined to select user i,
the reputation update of the user is as shown in equation (5):
Figure RE-GDA0003819577010000035
in the formula (5)
Figure RE-GDA0003819577010000036
For the new reputation value obtained for user i,
Figure RE-GDA0003819577010000037
for historical reputation values, ref i t For feedback value, α is the factor that determines whether the participant will gain or lose new reputation value during reputation updating, the arctan function is a monotonically increasing function, with positive feedback from the participant, the user reputation value will increase and negative feedback will decrease faster, if any
Figure RE-GDA0003819577010000038
This means that the reputation value is when the preset value is 0
Figure RE-GDA0003819577010000039
Will start from 0.5;
2) And a task issuing stage: the service provider selects an optimal user N = {1,2,..,. N } from M = {1,2,..,. M } according to the reputation of the user, and determines the maximum profit of both the service provider and the user in the optimal user set,
the service provider SP and the mobile user both have their own benefits, so it is necessary to provide a reasonable price-setting policy of the service provider, and then determine the policy of service requirement of each user, the starkeberg Stackelberg game is a two-stage dynamic game with complete information, the main idea of the starkeberg Stackelberg game is that both parties select their own policy under the policy of each other to achieve the maximum effectiveness, so as to achieve nash equilibrium, in the case of nash equilibrium, both the provider SP and the user have no incentive to change their own policy unilaterally, the service provider SP and the mobile user are described as a two-stage starkeberg with a single leader and multiple followers, in the leader sub-game, the service reward provider decides the total service provided by the task provider for obtaining more as shown in formula (6):
Figure RE-GDA00038195770100000310
in the follower sub-game, since all users share the total award, a non-cooperative game exists among the users, and the mobile user determines the perception level according to the reward factor and the cost factor, namely as shown in formula (7):
Figure RE-GDA0003819577010000041
each user determines the optimal corresponding strategy thereof through the strategy of the service provider SP, which specifically comprises the following steps:
2-1) follower sub-game: the total reward R determined by the service provider SP is distributed to the users according to the perception level and the reputation weight of each user, the users compete to obtain more rewards, the competition among the users is described as a non-cooperative game, and in the game, the perception strategy of the user participants is that
Figure RE-GDA0003819577010000042
The goal of each user is to maximize the benefit of the user, and when all users select the optimal strategy, the non-cooperative game reaches a stable state, namely Nash equilibrium NE, which is defined in the non-cooperative game among the users:
2-1-1) given
Figure RE-GDA0003819577010000043
Figure RE-GDA0003819577010000044
Is that
Figure RE-GDA0003819577010000045
In that
Figure RE-GDA0003819577010000046
An optimal response strategy;
2-1-2) when the benefits of each are satisfied
Figure RE-GDA0003819577010000047
In time, nash equilibrium exists in non-cooperative gaming and is denoted as
Figure RE-GDA0003819577010000048
When the total award R t Given, nash equilibrium exists in non-cooperative gaming,
whether the NE exists in the non-cooperative game depends on the fact that the user utility function is a concave function of the strategy, and the user utility function is proved to be a strict concave function by solving first-order and second-order derivative functions of the user utility function, as shown in a formula (8):
Figure RE-GDA0003819577010000049
2-2) leader sub-Game: in the Starkegberg game, the leader, i.e. the service provider SP, provides the total award R to the user t Where there is nash equilibrium between users, the service provider SP determines R t To achieve maximum utility, R t(*) In the Stackelberg game, R t(*) Optimal strategy to maximize the utility of provider SP, when X t(*) At the optimum perceptual level, there is starkeberg equalization (R) t(*) ,X t(*) ) Step 2-1), calculating a second derivative of the utility function of the service provider SP to prove that the utility function of the service provider SP is a strict concave function and to prove that the Starkeberg equilibrium exists;
3) A training stage: the method comprises the following steps:
3-1) Markov decision process definition: modeling the user behavior as a Markov Decision Process (MDP), and combining a standard reinforcement learning framework for agent and environment interaction, describing the Markov decision process as a quintuple of M = < S, A, U, P, gamma >, and defining the state, action and reward of the t-th time slot as
Figure RE-GDA00038195770100000410
Wherein the details of the state space, the action space, the state transition probability and the reward are as follows:
state space, the state of user i in each time slot is composed of the perception level of other users in the last time slot, and is expressed as
Figure RE-GDA0003819577010000051
Perceptual unit cost
Figure RE-GDA0003819577010000052
Mission reward R t As shown in equation (9):
Figure RE-GDA0003819577010000053
motion space in the t-th time slot, the motion of the user i is expressed as the perception level of the user i, namely
Figure RE-GDA0003819577010000054
State transition probability the dynamics of the environment are represented by state transition probabilities, i.e.
Figure RE-GDA0003819577010000055
Wherein
Figure RE-GDA0003819577010000056
Rewarding: reward transfer function compliance
Figure RE-GDA0003819577010000057
The reward function of the user i is a user utility function;
3-2) perception strategy decision based on PPO algorithm: the method comprises the steps that a service provider and an optimal user select own optimal strategies by using a PPO algorithm, a deep reinforcement learning algorithm based on the PPO is adopted to optimize the perception strategies of each user, and the goal of the deep reinforcement learning algorithm of the PPO is to find a set of optimal parameters
Figure RE-GDA0003819577010000058
Wherein theta is i Representing the policy of user i, first, the policy θ is set i Parameterization to
Figure RE-GDA0003819577010000059
The optimization objective for the strategy gradient is then as shown in equation (10):
Figure RE-GDA00038195770100000510
defining a function of state values as
Figure RE-GDA00038195770100000511
Function of function value of
Figure RE-GDA00038195770100000512
Namely, as shown in equation (11):
Figure RE-GDA00038195770100000513
in the formula (11)
Figure RE-GDA00038195770100000514
Is the accumulated reward of user i in the t-th time slot, gamma belongs to [0,1 ]]Is a discount factor, if γ =0, then
Figure RE-GDA00038195770100000515
Meaning that it is only concerned with current utility and not with long-term utility; gamma =1 indicates that it is concerned about the cumulative effect from T time slot to end time slot T, and optimizes the strategy by combining the strategy gradient method based on the operator-critical framework, wherein the operator-critical model consists of two networks, namely, the operator network
Figure RE-GDA00038195770100000516
Parameter is theta i And criticc network
Figure RE-GDA00038195770100000517
Parameter is omega i For operator networks, input status
Figure RE-GDA00038195770100000518
And (3) solving the optimal perception level to obtain the maximum long-term accumulated yield, and calculating the strategy gradient according to a random strategy gradient theory as shown in a formula (12):
Figure RE-GDA00038195770100000519
wherein
Figure RE-GDA0003819577010000061
Figure RE-GDA0003819577010000062
Parameters representing the old sampling strategy and are:
Figure RE-GDA0003819577010000063
Figure RE-GDA0003819577010000064
is a dominant function of motion and space, representing motion
Figure RE-GDA0003819577010000065
Relative to state
Figure RE-GDA0003819577010000066
The near-end strategy optimization PPO method is adopted to prune the strategy gradient as shown in formula (13):
Figure RE-GDA0003819577010000067
wherein
Figure RE-GDA0003819577010000068
η (x) is a piecewise function with an interval:
Figure RE-GDA0003819577010000069
defining a critc network
Figure RE-GDA00038195770100000610
The updated loss function is equation (14):
Figure RE-GDA00038195770100000611
calculate ω i Is shown in equation (15):
Figure RE-GDA00038195770100000612
likewise, calculate the operator network θ i Is shown in equation (16):
Figure RE-GDA00038195770100000613
d is the sample number of strategy gradient estimation of each training step, namely the size of the small batch, and the operator and critic networks are updated after the dynamic game of the D time slots, for
Figure RE-GDA00038195770100000614
Adopts small-batch random gradient rising and small-batch random gradient falling method to
Figure RE-GDA00038195770100000615
And (3) updating:
Figure RE-GDA00038195770100000616
Figure RE-GDA00038195770100000617
wherein l i,1 ,l i,2 The learning rates of the operator network and the critic network are respectively as follows:
first, initialize the omega of the operator-critical network i And theta i For each training period, the buffer D will be cleared,
Figure RE-GDA00038195770100000618
is initialized to
Figure RE-GDA00038195770100000619
For each time slot t, will
Figure RE-GDA00038195770100000620
Inputting into a policy network and sampling an action
Figure RE-GDA00038195770100000621
As a user perception level, the service provider SP determines its own pricing policy and assigns R t Sent to the mobile user, and the mobile user determines the perception level of the user in the t +1 time slot
Figure RE-GDA00038195770100000622
Finally, agent will update the new state and store the information into experience replay buffer D, which will be updated when full, and for each user i, a small batch of experience is extracted from D to calculate the estimated gradient of the criticizing network
Figure RE-GDA0003819577010000071
And estimated gradient of actor network
Figure RE-GDA0003819577010000072
Parameter theta i And omega i Updated with a random gradient. After determining the self strategy, the user contributes the self sensing data to the service provider, and the service provider evaluates the sensing data by uploading the sensing data through the edge nodeAnd updating the reputation of the user.
The method can help the service provider to effectively screen the credible users, thereby reducing the cost and obtaining greater income, simultaneously protecting the privacy of the users and maximizing the benefits of both the service provider and the users in a dynamic scene.
Drawings
FIG. 1 is a schematic diagram of a single-server multi-user MCS system in an embodiment;
fig. 2 is a schematic diagram of a dynamic gaming process in the embodiment.
Detailed Description
The invention will be further illustrated by the following figures and examples, but is not limited thereto.
Example (b):
a reinforcement learning mobile crowdsourcing incentive method based on user reputation, comprising a single server multi-user MCS system, as shown in fig. 1, the single server multi-user MCS system is provided with a service provider SP, edge nodes and mobile users, assuming that all mobile users are M = {1, 2., M }, and there is a group of optimal users of N = {1, 2., N } in M, dividing a continuous decision cycle into equal time slots T, T = {1, 2., T }, and at time slot T, a perception level of each user i is represented as T
Figure RE-GDA0003819577010000073
The perception levels of other users than user i are
Figure RE-GDA0003819577010000074
The sensing levels of all mobile users are collected as
Figure RE-GDA0003819577010000075
The method comprises the steps that a service provider issues sensing tasks to edge nodes in each time period, the edge nodes distribute the tasks to users in a social network, the tasks received by the users determine and execute the tasks according to the sensing levels of the users and other users, sensing data are uploaded to a provider SP, the provider SP scores the reputation of the users according to the data quality and the wishes of the users and updates the reputation records, and the method is characterized in that the service provider sends the sensing tasks to the edge nodes in each time period, the edge nodes distribute the tasks to the users in the social network, the tasks received by the users are determined and executed according to the sensing levels of the users and other users, the provider SP scores the reputation of the users according to the data quality and the wishes of the users, and updates the reputation recordsThe method comprises the following steps:
1) Reputation evaluation stage:
the participation willingness of the user is determined by the social network effect, the service provider SP updates the reputation of the user after receiving the data, evaluates historical reputation, determines utility functions of the user and the service provider, and defines the utility function of the user i as shown in formula (1):
Figure RE-GDA0003819577010000076
the utility of the user i consists of a revenue function and a cost function, the first part
Figure RE-GDA0003819577010000077
Defining a monetary reward for user i to obtain, as perceived by the level
Figure RE-GDA0003819577010000078
Determine whether or not
Figure RE-GDA0003819577010000079
Linear function of R t Representing the total mission reward over time period t, shared by all users, the user's reward also being subject to the user's own reputation
Figure RE-GDA00038195770100000710
The higher the reputation of user i, the higher the reward, the second part
Figure RE-GDA00038195770100000711
Defining the cost of user i participating in the task, wherein
Figure RE-GDA00038195770100000712
Representing user perceived unit cost, the service provider SP utility function is the revenue minus the total reward R of the user t The gain of the service provider SP is expressed by a function phi (t) obtained from the user's perception level, as shown in equation (2):
Figure RE-GDA0003819577010000081
where lambda is a parameter of the system,
Figure RE-GDA0003819577010000082
the function reflects a diminishing return on service by the service provider SP for i, while the ln function reflects a diminishing return on the number of mobile users by the provider, assuming that all local participating users are in the social network G = [ G ] ij ] i,j∈N , g ij E {0,1}, where user i is affected by user j, then g ij =1, otherwise g ij =0, assuming reciprocity of social relations g ij =g ji According to the E-R model of the social network, all users are connected to the social network in a probability mu mode, the larger the value of mu is, the more closely the connection of each node is represented, the more users affecting the user i except the user i are represented, and the willingness of the user i to participate in the perception task is defined as shown in formula (3):
Figure RE-GDA0003819577010000083
the introduction of the participant performance indicators, i.e. the reputation values, is influenced by the awards obtained, since the provider SP is more inclined to users requiring less awards, the higher the awards obtained by the users, the lower the performance indicators, the performance index of which is
Figure RE-GDA0003819577010000084
The reputation feedback function of the mobile user is defined as shown in equation (4):
Figure RE-GDA0003819577010000085
equation (4) represents the degree to which user i completed the task compared to all selected participants, R t The average reward expressed as per user i is from the total reward R t Under the same reward condition, with
Figure RE-GDA0003819577010000086
And
Figure RE-GDA0003819577010000087
the value of credit feedback increases, the service provider SP is more inclined to select user i,
the reputation update of the user is shown in equation (5):
Figure RE-GDA0003819577010000088
in the formula (5)
Figure RE-GDA0003819577010000089
For the new reputation value obtained for user i,
Figure RE-GDA00038195770100000810
for historical reputation values, ref i t For feedback value, α is the factor that determines whether the participant will gain or lose new reputation value during reputation updating, the arctan function is a monotonically increasing function, with positive feedback from the participant, the user reputation value will increase and negative feedback will decrease faster, if any
Figure RE-GDA00038195770100000811
This means that the reputation value is when the preset value is 0
Figure RE-GDA0003819577010000091
Will start from 0.5;
2) And a task issuing stage: the service provider selects an optimal user N = {1, 2..., N } from M = {1, 2..., M } according to the reputation of the user, determines the maximum benefit of both the service provider and the user in an optimal user set,
the service provider SP and the mobile user both have their own benefits, so it is necessary to provide a reasonable price-setting policy of the service provider, and then determine the policy of service requirement of each user, the starkeberg Stackelberg game is a two-stage dynamic game with complete information, the main idea of the starkeberg Stackelberg game is that both parties select their own policy under the policy of each other to achieve the maximum effectiveness, so as to achieve nash equilibrium, in the case of nash equilibrium, both the provider SP and the user have no incentive to change their own policy unilaterally, the service provider SP and the mobile user are described as a two-stage starkeberg with a single leader and multiple followers, in the leader sub-game, the service reward provider decides the total service provided by the task provider for obtaining more as shown in formula (6):
Figure RE-GDA0003819577010000092
in the follower sub-game, since all users share the total award, a non-cooperative game exists among the users, and the mobile user determines the perception level according to the reward factor and the cost factor, namely as shown in formula (7):
Figure RE-GDA0003819577010000093
each user determines the optimal corresponding strategy thereof through the strategy of the service provider SP, which specifically comprises the following steps:
2-1) follower sub-game: the total reward R determined by the service provider SP is distributed to the users according to the perception level and the reputation weight of each user, the users compete to obtain more rewards, the competition between the users is described as a non-cooperative game, and in the game, the perception strategy of the user participants is
Figure RE-GDA0003819577010000094
The goal of each user is to maximize the benefit of the user, and when all users select the optimal strategy, the non-cooperative game reaches a stable state, namely Nash equilibrium NE, which is defined in the non-cooperative game among the users:
2-1-1) given
Figure RE-GDA0003819577010000095
Figure RE-GDA0003819577010000096
Is that
Figure RE-GDA0003819577010000097
In that
Figure RE-GDA0003819577010000098
An optimal response strategy;
2-1-2) when the benefit of each is satisfied
Figure RE-GDA0003819577010000099
In time, nash equilibrium exists in non-cooperative gaming and is denoted as
Figure RE-GDA00038195770100000910
When the total award R t Given, nash equilibrium exists in non-cooperative gaming,
whether the NE exists in the non-cooperative game depends on the fact that the user utility function is a concave function of the strategy, and the first-order derivative function and the second-order derivative function for solving the user utility function prove that the user utility function is a strict concave function as shown in a formula (8):
Figure RE-GDA0003819577010000101
2-2) leader sub-game: in the Starkegberg game, the leader, i.e. the service provider SP, provides the total award R to the user t Where Nash equilibrium exists between subscribers, the service provider SP determines R t To achieve maximum utility, R t(*) In the Stackelberg game, R t(*) Optimal strategy to maximize the utility of provider SP, when X t(*) At the optimum perceptual level, there is starkeberg equalization (R) t(*) ,X t(*) ) The second derivative of the utility function of the service provider SP is calculated in the same step 2-1)To prove that the provider SP utility function is a strict concave function, to prove that starkeberg equilibrium exists;
3) A training stage: the method comprises the following steps:
3-1) Markov decision process definition: modeling the user behavior as a Markov Decision Process (MDP), and combining a standard reinforcement learning framework for agent and environment interaction, describing the Markov decision process as a quintuple of M = < S, A, U, P, gamma >, and defining the state, action and reward of the t-th time slot as
Figure RE-GDA0003819577010000102
Wherein the details of the state space, the action space, the state transition probability and the reward are as follows:
state space, the state of user i in each time slot is composed of the perception level of other users in the last time slot, and is expressed as
Figure RE-GDA0003819577010000103
Perceptual unit cost
Figure RE-GDA0003819577010000104
Mission reward R t As shown in equation (9):
Figure RE-GDA0003819577010000105
Figure RE-GDA0003819577010000106
motion space-at the t-th time slot, the motion of user i is expressed as the user i perception level, i.e. the motion space
Figure RE-GDA0003819577010000107
State transition probability the dynamics of the environment is represented by the state transition probability, i.e.
Figure RE-GDA0003819577010000108
Wherein
Figure RE-GDA0003819577010000109
Rewarding: reward transfer function compliance
Figure RE-GDA00038195770100001010
The reward function of the user i is a user utility function;
3-2) perception strategy decision based on PPO algorithm: the method comprises the steps that a service provider and an optimal user select own optimal strategies by using a PPO algorithm, a deep reinforcement learning algorithm based on the PPO is adopted to optimize the perception strategies of each user, and the goal of the deep reinforcement learning algorithm of the PPO is to find a set of optimal parameters
Figure RE-GDA00038195770100001011
Wherein theta is i Representing the policy of user i, first, the policy θ is set i Parameterization to
Figure RE-GDA00038195770100001012
The optimization objective of the strategy gradient is then as shown in equation (10):
Figure RE-GDA0003819577010000111
defining a function of state values as
Figure RE-GDA0003819577010000112
Function of value of
Figure RE-GDA0003819577010000113
Namely, as shown in equation (11):
Figure RE-GDA0003819577010000114
in the formula (11)
Figure RE-GDA0003819577010000115
Is the accumulated reward of user i at the t-th time slot, gamma is equal to [0,1 ]]Is a discount factor, if γ =0, then
Figure RE-GDA0003819577010000116
Meaning that it is only concerned with the current utility and not with the long-term utility; gamma =1 indicates that it is concerned about the cumulative effect from T time slot to end time slot T, and optimizes the strategy by combining the strategy gradient method based on the operator-critical framework, wherein the operator-critical model consists of two networks, namely, the operator network
Figure RE-GDA0003819577010000117
Parameter is theta i And criticc network
Figure RE-GDA0003819577010000118
Parameter is omega i For operator networks, input status
Figure RE-GDA0003819577010000119
And (3) solving the optimal perception level to obtain the maximum long-term accumulated yield, and calculating the strategy gradient according to a random strategy gradient theory as shown in a formula (12):
Figure RE-GDA00038195770100001110
wherein
Figure RE-GDA00038195770100001111
Figure RE-GDA00038195770100001112
Parameters representing the old sampling strategy and are:
Figure RE-GDA00038195770100001113
Figure RE-GDA00038195770100001114
is a dominant function of motion and space, representing motion
Figure RE-GDA00038195770100001115
Relative to the state
Figure RE-GDA00038195770100001116
The near-end strategy optimization PPO method is adopted to prune the strategy gradient as shown in formula (13):
Figure RE-GDA00038195770100001117
wherein
Figure RE-GDA00038195770100001118
η (x) is a piecewise function with an interval:
Figure RE-GDA00038195770100001119
defining a critc network
Figure RE-GDA00038195770100001120
The updated loss function is equation (14):
Figure RE-GDA00038195770100001121
calculate ω i Is shown in equation (15):
Figure RE-GDA0003819577010000121
likewise, calculate the operator network θ i Is shown in equation (16):
Figure RE-GDA0003819577010000122
d is the sample number of strategy gradient estimation of each training step, namely the size of the small batch, and the operator and critic networks are updated after the dynamic game of the D time slots, for
Figure RE-GDA0003819577010000123
Adopts small-batch random gradient rising and small-batch random gradient falling method to
Figure RE-GDA0003819577010000124
Updating:
Figure RE-GDA0003819577010000125
Figure RE-GDA0003819577010000126
wherein l i,1 ,l i,2 The learning rates of the operator network and the critic network are respectively as follows:
first, initialize the omega of the operator-critical network i And theta i For each training period, buffer D will be cleared,
Figure RE-GDA0003819577010000127
is initialized to
Figure RE-GDA0003819577010000128
For each time slot t, will
Figure RE-GDA0003819577010000129
Inputting into a policy network and sampling an action
Figure RE-GDA00038195770100001210
As a user perception level, the service provider SP determines its own pricing policy and assigns R t Sent to the mobile user, and the mobile user determines the perception level of the user in the t +1 time slot
Figure RE-GDA00038195770100001211
Finally, agent will update the new state and store the information in the experience replay buffer D, which will be updated when full, forAt each user i, extracting a small batch of experience from D to calculate the estimated gradient of the criticizing network
Figure RE-GDA00038195770100001212
And estimated gradient of actor network
Figure RE-GDA00038195770100001213
Parameter theta i And ω i Updated with a random gradient. And after determining the strategy, the user contributes the sensing data to the service provider, and the service provider evaluates the quality of the sensing data and updates the credit of the user by uploading the sensing data through the edge node.

Claims (1)

1. A reinforcement learning mobile crowdsourcing incentive method based on user reputation comprises a single-server multi-user MCS system, wherein the single-server multi-user MCS system is provided with a service provider SP, edge nodes and mobile users, all the mobile users are assumed to be M = {1, 2.. Multidot.M }, a group of optimal users with N = {1, 2.. Multidot.N } in M are assumed, a continuous decision period is divided into equal time slots T, T = {1, 2.. Multidot.T }, and the perception level of each user i is represented as equal time slot T
Figure FDA0003733654210000011
The perception levels of other users than user i are
Figure FDA0003733654210000012
The sensing levels of all mobile users are collected as
Figure FDA0003733654210000013
The method is characterized in that the service provider issues a perception task to an edge node in each time period, the edge node distributes the task to users in the social network, the tasks received by the users are determined and executed according to the perception levels of the users and other users, perception data are uploaded to a provider SP, the provider SP scores the reputation of the users according to the data quality and the user intention, and updates the reputation records, and the method comprises the following steps:
1) Reputation evaluation stage:
the participation willingness of the user is determined by the social network effect, the service provider SP updates the reputation of the user after receiving the data, evaluates historical reputation, determines utility functions of the user and the service provider, and defines the utility function of the user i as shown in formula (1):
Figure FDA0003733654210000014
the utility of the user i consists of a revenue function and a cost function, the first part
Figure FDA0003733654210000015
Defining a monetary reward for user i to obtain, as perceived by the level
Figure FDA0003733654210000016
Determine whether or not
Figure FDA0003733654210000017
Linear function of R t Representing the total mission reward over time period t, shared by all users, the user's reward also being subject to the user's own reputation
Figure FDA0003733654210000018
The higher the reputation of user i, the higher the reward, the second part
Figure FDA0003733654210000019
Defining the cost of user i participating in the task, wherein
Figure FDA00037336542100000110
Representing the unit cost perceived by the user,
the service provider SP utility function is the profit minus the total reward R of the user t The revenue of the service provider SP is expressed by a function phi (t) and is obtained from the level of perception of the user, e.g.Formula (2) shows:
Figure FDA00037336542100000111
where lambda is a parameter of the system,
Figure FDA0003733654210000021
the function reflects a diminishing return on service by the service provider SP for i, while the ln function reflects a diminishing return on the number of mobile users by the provider, assuming that all local participating users are in the social network G = [ G ] ij ] i,j∈N ,g ij E {0,1}, where user i is affected by user j, then g ij =1, otherwise g ij =0, assuming reciprocity of social relations g ij =g ji According to the E-R model of the social network, all users are connected to the social network in a probability mu mode, the larger the value of mu is, the tighter the connection of each node is represented, the more users affecting a user i except the user i are represented, and the willingness of the user i to participate in a perception task is defined as shown in a formula (3):
Figure FDA0003733654210000022
introducing a participant performance indicator, i.e. the reputation value is influenced by the awards earned, the performance index of the performance indicator is
Figure FDA0003733654210000023
The reputation feedback function of the mobile user is defined as shown in equation (4):
Figure FDA0003733654210000024
equation (4) represents the degree to which user i completed the task compared to all selected participants, R t N the average reward for each user i is from the totalReward R t Following the same reward condition
Figure FDA0003733654210000025
And
Figure FDA0003733654210000026
the value of credit feedback increases, the service provider SP is more inclined to select user i,
the reputation update of the user is shown in equation (5):
Figure FDA0003733654210000027
in the formula (5)
Figure FDA0003733654210000028
For the new reputation value obtained for user i,
Figure FDA0003733654210000029
for historical reputation values, ref i t For the feedback value, α is the factor for determining whether the participant gets or loses a new reputation value during the reputation update, the arctan (—) function is a monotonically increasing function, with positive feedback from the participant, the reputation value of the user will increase, while negative feedback will decrease faster, if any
Figure FDA00037336542100000210
This means that the reputation value is when the preset value is 0
Figure FDA00037336542100000211
Will start from 0.5;
2) And a task issuing stage: the service provider selects an optimal user N = {1, 2.. And N } from M = {1, 2.. And M } according to the reputation of the user, and determines the maximum benefit of both the service provider and the user in the optimal user set, in the case of Nash equilibrium, both the provider SP and the user have no motivation to change their own policy unilaterally, and describes the service provider SP and the mobile user as a two-stage Starkberg game of a single leader and a plurality of followers, and in the leader sub game, the service provider decides the total service reward provided by the task provider for obtaining more, as shown in formula (6):
Figure FDA0003733654210000031
in the follower sub-game, all users share the total rewards, a non-cooperative game exists among the users, and the mobile users determine the perception level according to reward factors and cost factors, namely as shown in a formula (7):
Figure FDA0003733654210000032
each user determines the optimal corresponding strategy thereof through the strategy of the service provider SP, which specifically comprises the following steps:
2-1) follower sub-game: the total reward R determined by the service provider SP is distributed to the users according to the perception level and the reputation weight of each user, the users compete to obtain more rewards, the competition among the users is described as a non-cooperative game, and in the game, the perception strategy of the user participants is that
Figure FDA0003733654210000033
The goal of each user is to maximize the benefit of the user, and when all users select the optimal strategy, the non-cooperative game reaches a stable state, namely Nash equilibrium NE, which is defined in the non-cooperative game among the users:
2-1-1) given
Figure FDA0003733654210000034
Figure FDA0003733654210000035
Is that
Figure FDA0003733654210000036
In that
Figure FDA0003733654210000037
An optimal response strategy;
2-1-2) when the benefits of each are satisfied
Figure FDA0003733654210000038
At times, nash equilibrium exists in non-cooperative gaming and is denoted as
Figure FDA0003733654210000039
When the total award R t When the game is given, nash equilibrium exists in the non-cooperative game, whether NE exists in the non-cooperative game depends on that the user utility function is a concave function of a strategy, and the user utility function is proved to be a strict concave function by adopting a first-order derivative function and a second-order derivative function for solving the user utility function, as shown in a formula (8):
Figure FDA00037336542100000310
2-2) leader sub-game: in the Starkegberg game, the leader, i.e. the service provider SP, provides the total award R to the user t Where Nash equilibrium exists between subscribers, the service provider SP determines R t To achieve maximum utility, R t(*) In the Stackelberg game, R t(*) Optimal strategy to maximize the utility of provider SP, when X t(*) At the optimum perceptual level, there is starkeberg equalization (R) t(*) ,X t(*) ) Step 2-1), calculating a second derivative of the utility function of the service provider SP to prove that the utility function of the service provider SP is a strict concave function and to prove that the Starkeberg equilibrium exists;
3) A training stage: the method comprises the following steps:
3-1) Markov decision process definition: modeling user behavior as Markov decision Process MDP and incorporating for agent and RingThe standard reinforcement learning framework of environment interaction describes the Markov decision process as M =<S,A,U,P,γ>Defining the status, action and reward of the t-th slot as
Figure FDA00037336542100000413
Wherein the details of the state space, the action space, the state transition probability and the reward are as follows:
state space, the state of user i in each time slot is composed of the perception level of other users in the last time slot, which is expressed as
Figure FDA0003733654210000041
Perceptual unit cost
Figure FDA0003733654210000042
Mission reward R t As shown in equation (9):
Figure FDA0003733654210000043
motion space in the t-th time slot, the motion of the user i is expressed as the perception level of the user i, namely
Figure FDA0003733654210000044
State transition probability the dynamics of the environment is represented by the state transition probability, i.e.
Figure FDA0003733654210000045
Wherein
Figure FDA0003733654210000046
Reward: reward transfer function compliance
Figure FDA0003733654210000047
The reward function of the user i is a user utility function;
3-2) perception strategy decision based on PPO algorithm: optimizing the perception strategy of each user by adopting a deep reinforcement learning algorithm based on PPO (polyphenylene oxide), wherein the goal of the deep reinforcement learning algorithm of PPO is to find an optimal set of parameters
Figure FDA0003733654210000048
Wherein theta is i Representing the policy of the user i, first, the policy θ is set i Parameterization to
Figure FDA0003733654210000049
The optimization objective for the strategy gradient is then as shown in equation (10):
Figure FDA00037336542100000410
defining a function of state values as
Figure FDA00037336542100000411
Function of value of
Figure FDA00037336542100000412
Namely, as shown in equation (11):
Figure FDA0003733654210000051
in the formula (11)
Figure FDA0003733654210000052
Is the accumulated reward of user i at the t-th time slot, gamma is equal to [0,1 ]]Is a discount factor, if γ =0, then
Figure FDA0003733654210000053
Meaning that it is only concerned with the current utility and not with the long-term utility; γ =1 indicates that it is concerned about the cumulative effect from T slot to end slot T, in combination with the policy gradient method based on the operator-critic frameworkThe operator-critical model consists of two networks, namely an operator network
Figure FDA0003733654210000054
Parameter is theta i And critical network
Figure FDA0003733654210000055
Parameter is omega i For an operator network, the input state
Figure FDA0003733654210000056
And (3) solving the optimal perception level to obtain the maximum long-term accumulated yield, and calculating the strategy gradient according to a random strategy gradient theory as shown in a formula (12):
Figure FDA0003733654210000057
wherein
Figure FDA0003733654210000058
Figure FDA0003733654210000059
Parameters representing the old sampling strategy and are:
Figure FDA00037336542100000510
Figure FDA00037336542100000511
is a dominant function of motion and space, representing motion
Figure FDA00037336542100000512
Relative to state
Figure FDA00037336542100000513
The near-end strategy optimization PPO method is adopted to prune the strategy gradient as shown in formula (13):
Figure FDA00037336542100000514
wherein
Figure FDA00037336542100000515
η (x) is a piecewise function with an interval:
Figure FDA00037336542100000516
defining a critc network
Figure FDA00037336542100000517
The updated loss function is equation (14):
Figure FDA00037336542100000518
calculating omega i Is shown in equation (15):
Figure FDA0003733654210000061
likewise, the operator network θ is calculated i Is shown in equation (16):
Figure FDA0003733654210000062
d is the sample number of strategy gradient estimation of each training step, namely the size of the small batch, and the operator and critic networks are updated after the dynamic game of the D time slots, for
Figure FDA0003733654210000063
Adopts small-batch random gradient rising and small-batch random gradient falling method to
Figure FDA0003733654210000064
And (3) updating:
Figure FDA0003733654210000065
Figure FDA0003733654210000066
wherein l i,1 ,l i,2 The learning rates of the operator network and the critic network are respectively as follows:
firstly, initializing omega of the operator-critical network i And theta i For each training period, the buffer D will be cleared,
Figure FDA0003733654210000067
is initialized to
Figure FDA0003733654210000068
For each time slot t, will
Figure FDA0003733654210000069
Inputting into a policy network and sampling an action
Figure FDA00037336542100000610
As a user perception level, the service provider SP determines its own pricing policy and assigns R t Sent to the mobile user, and the mobile user determines the perception level of the user in the t +1 time slot
Figure FDA00037336542100000611
Finally, agent will updateNew state and store information in the empirical review buffer D, which will be updated when full, for each user i, a small batch of experiences is extracted from D to calculate the estimated gradient of the criticizing network
Figure FDA00037336542100000612
And estimated gradient of actor network
Figure FDA00037336542100000613
Parameter theta i And ω i And updating by using a random gradient, wherein the user contributes own sensing data to a service provider after determining own strategy, and the service provider evaluates the quality of the sensing data and updates the credit of the user by uploading through the edge node.
CN202210790320.6A 2022-07-06 2022-07-06 Reinforced learning mobile crowdsourcing incentive method based on user reputation Pending CN115392337A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210790320.6A CN115392337A (en) 2022-07-06 2022-07-06 Reinforced learning mobile crowdsourcing incentive method based on user reputation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210790320.6A CN115392337A (en) 2022-07-06 2022-07-06 Reinforced learning mobile crowdsourcing incentive method based on user reputation

Publications (1)

Publication Number Publication Date
CN115392337A true CN115392337A (en) 2022-11-25

Family

ID=84116151

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210790320.6A Pending CN115392337A (en) 2022-07-06 2022-07-06 Reinforced learning mobile crowdsourcing incentive method based on user reputation

Country Status (1)

Country Link
CN (1) CN115392337A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116095721A (en) * 2023-04-07 2023-05-09 湖北工业大学 Mobile crowd-sourced network contract excitation method and system integrating perception communication
CN116744289A (en) * 2023-06-02 2023-09-12 中国矿业大学 Intelligent position privacy protection method for 3D space mobile crowd sensing application

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116095721A (en) * 2023-04-07 2023-05-09 湖北工业大学 Mobile crowd-sourced network contract excitation method and system integrating perception communication
CN116095721B (en) * 2023-04-07 2023-06-27 湖北工业大学 Mobile crowd-sourced network contract excitation method and system integrating perception communication
CN116744289A (en) * 2023-06-02 2023-09-12 中国矿业大学 Intelligent position privacy protection method for 3D space mobile crowd sensing application
CN116744289B (en) * 2023-06-02 2024-02-09 中国矿业大学 Intelligent position privacy protection method for 3D space mobile crowd sensing application

Similar Documents

Publication Publication Date Title
CN115392337A (en) Reinforced learning mobile crowdsourcing incentive method based on user reputation
CN107426621B (en) A kind of method and system showing any active ues image in mobile terminal direct broadcasting room
Nie et al. A socially-aware incentive mechanism for mobile crowdsensing service market
CN103647671A (en) Gur Game based crowd sensing network management method and system
CN113724096B (en) Group knowledge sharing method based on public evolution game model
CN114637911B (en) Method for recommending next interest point of attention fusion perception network
CN114124955B (en) Computing and unloading method based on multi-agent game
Zhang et al. Wireless service pricing competition under network effect, congestion effect, and bounded rationality
CN114912626A (en) Method for processing distributed data of federal learning mobile equipment based on summer pril value
CN113918829A (en) Content caching and recommending method based on federal learning in fog computing network
CN112291284B (en) Content pushing method and device and computer readable storage medium
CN110149161B (en) Multi-task cooperative spectrum sensing method based on Stackelberg game
Chen et al. An incentive mechanism for crowdsourcing systems with network effects
CN116761207B (en) User portrait construction method and system based on communication behaviors
Chen et al. Qoe-aware dynamic video rate adaptation
CN116073924B (en) Anti-interference channel allocation method and system based on Stackelberg game
CN116774584A (en) Unmanned aerial vehicle differentiated service track optimization method based on multi-agent deep reinforcement learning
CN116451800A (en) Multi-task federal edge learning excitation method and system based on deep reinforcement learning
CN114598655B (en) Reinforcement learning-based mobility load balancing method
Biczók et al. Incentivizing the global wireless village
CN114698125A (en) Method, device and system for optimizing computation offload of mobile edge computing network
CN111328107B (en) Multi-cloud heterogeneous mobile edge computing system architecture and energy optimization design method
CN117392483A (en) Album classification model training acceleration method, system and medium based on reinforcement learning
Back et al. Small Profits and Quick Returns: A Practical SocialWelfare Maximizing Incentive Mechanism for Deadline-Sensitive Tasks in Crowdsourcing
He et al. A leader–follower controlled Markov stopping game for delay tolerant and opportunistic resource sharing networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination