CN109496305A - Nash equilibrium strategy on continuous action space and social network public opinion evolution model - Google Patents
Nash equilibrium strategy on continuous action space and social network public opinion evolution model Download PDFInfo
- Publication number
- CN109496305A CN109496305A CN201880001570.9A CN201880001570A CN109496305A CN 109496305 A CN109496305 A CN 109496305A CN 201880001570 A CN201880001570 A CN 201880001570A CN 109496305 A CN109496305 A CN 109496305A
- Authority
- CN
- China
- Prior art keywords
- media
- gossiper
- idea
- intelligent body
- strategy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009471 action Effects 0.000 title claims abstract description 44
- 238000005070 sampling Methods 0.000 claims description 14
- 238000004088 simulation Methods 0.000 claims description 11
- 230000008859 change Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 3
- 230000001360 synchronised effect Effects 0.000 claims description 3
- 230000000977 initiatory effect Effects 0.000 claims description 2
- 238000011017 operating method Methods 0.000 claims description 2
- 238000000034 method Methods 0.000 abstract description 16
- 230000001186 cumulative effect Effects 0.000 abstract description 5
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 230000002787 reinforcement Effects 0.000 abstract 1
- 238000004422 calculation algorithm Methods 0.000 description 79
- 230000006870 function Effects 0.000 description 19
- 239000003795 chemical substances by application Substances 0.000 description 18
- 238000002474 experimental method Methods 0.000 description 12
- 238000004458 analytical method Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 7
- 230000006399 behavior Effects 0.000 description 6
- 230000004069 differentiation Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000005315 distribution function Methods 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000005312 nonlinear dynamic Methods 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 238000012952 Resampling Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000009194 climbing Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001027 hydrothermal synthesis Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000010287 polarization Effects 0.000 description 1
- 238000005086 pumping Methods 0.000 description 1
- 238000004451 qualitative analysis Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 239000011800 void material Substances 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Feedback Control In General (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a Nash equilibrium strategy on a continuous action space and a social network public opinion evolution model, belonging to the field of a reinforcement learning method. The strategy of the invention comprises the following steps: initializing parameters; according to a normal distribution with a certain exploration rateRandomly selecting an action(ii) a And executes the execution and then obtains the reward from the environment(ii) a If an agentPerforming an actionThe return received laterGreater than the current cumulative average rewardThen, thenHas a learning rate ofOtherwise the learning rate isUpdating according to the selected learning rateVariance, variance、Finally, updating the cumulative average policy(ii) a If cumulative average policyConvergence, then output cumulative average strategyAs the final action of agent i. The invention has the beneficial effects that: maximize their interest in interacting with other agents and ultimately enable learning of nash equilibrium.
Description
Technical field
The present invention relates to the Nash Equilibrium strategy on a kind of Nash Equilibrium strategy more particularly to a kind of Continuous action space,
Further relate to a kind of social networks public opinion evolution model based on the Nash Equilibrium strategy on the Continuous action space.
Background technique
In the environment of Continuous action space, on the one hand, intelligent body be to the selection of movement it is unlimited, it is traditional based on Q
Table class algorithm can not also store the estimation of infinite number of return;On the other hand, in multiple agent environment, continuous movement is empty
Between also will increase the difficulty of problem.
In multiple agent nitrification enhancement field, the motion space of intelligent body can be discrete finite aggregate, can also be with
It is continuously to gather.Because the essence of intensified learning is to find optimal, and continuous motion space tool by continuous trial and error
There is infinite more movement selection, and multiple agent environment increases the dimension of motion space, this makes general intensified learning
Algorithm is difficult study to global optimum (or balanced).
Major part algorithm is all based on function approximation technology and solves continuous problem at present, and this kind of algorithm can be divided into two classes: value
Approximate algorithm [1-5] and tactful approximate algorithm [6-9].It is worth approximate algorithm to explore motion space and estimate corresponding value according to return
Function, and policy definition is the probability-distribution function on Continuous action space and direct learning strategy by tactful approximate algorithm.This
The performance of class algorithm depends on the accuracy of the estimation to value function or strategy, asks in processing challenge such as nonlinear Control
It is often unable to do what one wishes when topic.In addition, there are also a kind of algorithm [10,11] based on sampling, this kind of algorithm maintain one it is discrete dynamic
Work collects, then using the optimal movement in conventional discrete class algorithms selection behavior aggregate, finally according to a kind of resampling new mechanism
Behavior aggregate is to gradually learn to optimal.This kind of algorithm can be very easily in conjunction with conventional discrete class algorithm, the disadvantage is that algorithm
Need longer convergence time.Above-mentioned all algorithms are all to calculate the optimal policy in single intelligent body environment as target design
, it can not be directly applied in the study of multiple agent environment.
Many work develop [12-14] using the public opinion in intelligent body simulation Technique Study social networks in recent years.It is given
Different groups there are the groups of different ideas distribution, research group its idea during mutual contacts is finally that can reach altogether
Know or polarization still tanglewracks [15] always.The key for solving the problems, such as this is how to understand public opinion differentiation
Dynamic, thus obtain cause public opinion move towards consistent immanent cause [15].Problem is developed for the public opinion in social networks, is ground
The person of studying carefully proposes a variety of multi-agent Learning models [16-20] and has studied the factors such as different information sharings or exchange degree to public opinion
The influence of differentiation.Wherein [21-23] have studied the influence that the factors such as different information sharings or exchange degree develop public opinion.[14
24-28] etc. work studied using evolutionary game theory model intelligent body behavior (such as betray and cooperation) it is how mutual from companion
It is evolved in dynamic.These work are to the behavior modeling of intelligent body, and it is identical for assuming all intelligent bodies all.However, in reality
In the situation of border, individual can play the part of different roles (for example, leader or follower) in society, and this is according to the above method
It is unable to accurate modeling.For this purpose, social groups are divided into media and public two parts and difference by Quattrociochi et al. [12]
Modeling, wherein media and other masses that public idea is followed by it are influenced, and the idea of media is by outstanding in media
Person influences.Then, Zhao et al. [29] proposes the public opinion mould based on leader follower (leader-follower) type
Type explores the formation of public opinion.In the two work, the adjustable strategies of intelligent body idea are all to imitate leader or successfully
Colleague.Based on the related work of imitation, there are also Local Majority [30], Conformity [31] and Imitating
Neighbor[32].However, in actual environment, strategy that people take in doing decision is complicated more than simply imitating.
People are combined and oneself carry out decision factum with the knowledge of grasp often by constantly interacting with circumstances not known.This
Outside, what the strategy based on imitation cannot guarantee that algorithm can learn is global optimum, because of the quality of its intelligent body strategy
The strategy of leader or the person of being imitated are depended on, and the strategy of leader is also not necessarily all best.
Summary of the invention
To solve the problems of the prior art, the present invention provides the Nash Equilibrium strategy on a kind of Continuous action space, this
Invention additionally provides a kind of social networks public opinion evolution model based on the Nash Equilibrium strategy on the Continuous action space.
The present invention includes the following steps:
(1) constant α is setubAnd αus, wherein αub> αus,αQ,ασ∈ (0,1) is learning rate;
(2) initiation parameter, wherein the parameter includes the mean value u of intelligent body i expectation movement ui, accumulative Average StrategyConstant C, variances sigmaiWith accumulative average return Qi;
(3) the accumulative Average Strategy until the sampling action of intelligent body i is repeated the steps ofConvergence,
(3.1) by certain exploration rate according to normal distribution N (ui,σj) one movement x of random selectioni;
(3.2) execution acts xi, return r is then obtained from environmenti;
(3.3) if intelligent body i execution acts xiThe return r received afterwardsiGreater than current accumulative average return Qi, then
uiLearning rate be αub, otherwise learning rate is αus, u is updated according to selected learning ratei;
(3.4) u is arrived according to studyiUpdate variances sigmai;
(3.5) if intelligent body i execution acts xiThe return r received afterwardsiGreater than current accumulative average return Qi, then
uiLearning rate be αub, otherwise learning rate is αus, Q is updated according to selected learning ratei;
(3.6) according to constant C and movement xiIt updates
(4) accumulative Average Strategy is exportedFinal movement as intelligent body i.
The present invention is further improved, in step (3.3) and step (3.5), the update step-length of Q and the update step-length of u
It is synchronous, in uiNeighborhood in, QiAbout uiMapping can linearly turn to Qi=Kui+ C, wherein slope
The present invention is further improved, and gives positive number σLWith positive number K, receiving on the Continuous action space of two intelligent bodies is assorted
Balance policy may finally converge to Nash Equilibrium, wherein σLIt is the lower bound of variances sigma.
The present invention also provides a kind of social networks public opinions based on the Nash Equilibrium strategy on the Continuous action space
Evolution model, the social networks public opinion evolution model include two class intelligent bodies, respectively ordinary populace in simulation social networks
Gossiper class intelligent body and simulation social networks in media or public figure for the purpose of attracting ordinary populace Media
Class intelligent body, wherein the Media class intelligent body is using the Nash Equilibrium policy calculation on the Continuous action space to its time
Optimal idea is reported, its idea is updated and is broadcasted in social networks.
The present invention is further improved, and is included the following steps:
S1: the idea of each Gossiper and Media is by a random value being initialized as on motion space [0,1];
S2: each time interact in, each intelligent body according to following Developing Tactics oneself idea, until each intelligent body not
Changing concept again;S21: it to any one Gossiper class intelligent body, is selected at random in Gossiper network according to setting probability
Select a neighbours, according to BCM (the bounded confidence model, bounded confidence model) its idea of policy update and
The Media followed;
S22: a subset of stochastical sampling Gossiper network GGossiper idea in subset G ' is wide
It broadcasts to all Media;
S23: to any one Media, using the Nash Equilibrium policy calculation on Continuous action space, it is returned optimal
Idea, and updated idea is broadcast in entire social networks.
The present invention is further improved, in the step s 21, the operating method of the Gossiper class intelligent body are as follows:
A1: idea initialization: xi τ=xi τ-1;
A2: the renewal of ideas: it is less than given threshold when the intelligent body is differed with the idea of the intelligent body of selection, updates the intelligence
The idea of body;
A3: the intelligent body compares the difference of oneself and other Media ideas, follows according to one Media of probability selection.
The present invention is further improved, in step A2, if the neighbours currently selected are Gossiper j, and | xj τ-
xi τ| < dg, then xi τ←xi τ+αg(xj τ-xi τ);If the neighbours currently selected are Media k, and | yk τ-xi τ| < dm, then xi τ
←xi τ+αm(yk τ-xi τ), wherein dgAnd dmThe threshold value respectively set for the idea of different types of neighbours, ɑgAnd ɑmRespectively
For the learning rate for different types of neighbours.
The present invention is further improved, in step A3, according to probabilityMedia k is followed,
Wherein,
The present invention is further improved, in step S23, Media j current return rjIt is defined as the middle selection of G ' to chase after
The ratio of the middle total number of persons of G ' shared by number with the Gossiper of j,
PijIndicate that Gossiper i follows the probability of Media j.
The present invention is further improved, the presence of a Media, and the public opinion of each Gossiper intelligent body can be accelerated to tend to system
One;When there are in the environment of multiple Media competition, the dynamic change of each Gossiper intelligent body idea is to be influenced by each Media
Weighted average.
Compared with prior art, the beneficial effects of the present invention are: in the environment of Continuous action space, intelligent body with it is other
The interests of oneself can either be maximized during intelligent body interaction, and finally can learn to arrive Nash Equilibrium.
Detailed description of the invention
Fig. 1 is that r=0.7 > of the present invention 2/3, a=0.4, b=0.6, two intelligent body converge to the signal of Nash Equilibrium point
Figure;
Fig. 2 is that r=0.6 < of the present invention 2/3, a=0.4, b=0.6, two intelligent body converge to the signal of Nash Equilibrium point
Figure;
Fig. 3 is that the public opinion of Gossiper-Media model each network when full-mesh network does not have Media develops schematic diagram;
Fig. 4 is that the public opinion of Gossiper-Media model each network when small-world network does not have Media develops schematic diagram;
Fig. 5 is that the public opinion differentiation of Gossiper-Media model each network when full-mesh network has a Media is shown
It is intended to;
Fig. 6 is that the public opinion differentiation of Gossiper-Media model each network when small-world network has a Media is shown
It is intended to;
The public opinion of Fig. 7 each network when being the Media competed there are two Gossiper-Media model has in full-mesh network
Develop schematic diagram;
The public opinion of Fig. 8 each network when being the Media competed there are two Gossiper-Media model has in small-world network
Develop schematic diagram.
Specific embodiment
The present invention is described in further details with reference to the accompanying drawings and examples.
Nash Equilibrium strategy on Continuous action space of the invention is extended from single intelligent body nitrification enhancement CALA [7]
(Continuous Action Learning Automata, continuous action learning automaton), by introducing WoLS (Win or
Learn Slow wins then Fast Learning) study mechanism, the study that algorithm is effectively handled in multiple agent environment asks
Topic, therefore, Nash Equilibrium strategy of the invention is referred to as are as follows: WoLS-CALA (Win or Learn Slow Continuous
Action Learning Automaton wins then fast-continuous action learning automaton).The present invention first carries out the CALA
It is described in detail.
Continuous action learning automaton (CALA) [7] is the Policy-Gradient of the problem concerning study of a solution Continuous action space
Nitrification enhancement.Wherein, the strategy of intelligent body is defined as the Normal Distribution N (u on motion spacet,σt) probability it is close
Spend function.
The policy update of CALA intelligent body is as follows: in moment t, intelligent body is according to normal distribution N (ut,σt) select one to move
Make xt;Execution acts xtAnd ut, then obtain corresponding return V (x respectively from environmentt) and V (ut), it means that algorithm is each
It is acted twice with being needed to be implemented during environmental interaction;Finally, updating normal distribution N (u according to following formulat,σt) mean value
And variance,
Wherein,
Here αuAnd ασFor learning rate, K is a normal number, is used to control algolithm convergence.Specifically, the size and calculation of K
The study number of method is related, is typically set to the order of magnitude of 1/N, and N is algorithm iteration number, σLIt is the lower bound of variances sigma.Algorithm continues
Mean value and variance are updated until u is constant and σtIt is intended to σL.Mean value u is by a most optimal solution of problem of being directed toward after algorithmic statement.Side
The size of σ determines the exploring ability of CALA algorithm: σ in journey (2)tBigger, CALA is more possible to search out potential better
Movement.
By definition, CALA algorithm is the learning algorithm based on Policy-Gradient class.The algorithm is confirmed returning by theory
In the case where reporting function V (x) smooth enough, CALA algorithm, which can be sought, looks for local optimum [7].De Jong et al. [34] passes through
Reward Program is improved, CALA is extended and is applied under multiple agent environment, and its modified hydrothermal process can by experimental verification
To converge to Nash Equilibrium.WoLS-CALA proposed by the present invention introduces " WoLS " mechanism and solves the problems, such as multi-agent Learning, and from
Theoretically analyze and prove that algorithm can learn to arrive Nash Equilibrium in continuous motion space.
Since CALA requires intelligent body to need disposable to obtain sampling action simultaneously and expectation acts in each study
Return, however this be in most of intensified learning environment it is infeasible, usual intelligent body be in the interaction of environment every time only
A movement can be executed.For this purpose, the present invention extends CALA in terms of Q value function estimation and variable learning rate two, propose
WoLS-CALA algorithm.
1, Q Function Estimation
In free-standing multiple agent intensified learning environment, intelligent body once selects a movement, then obtains from environment
Return.To the exploring mode of normal distribution, one naturally mode be exactly using Q value to expectation movement u average return into
Row estimation.Specifically, in formula (1) intelligent body i movement uiExpected returnsIt can be estimated with following formula,
HereFor the sampling action of t moment.It is that intelligent body i is acted in selectionWhen the return that receives, when by $ t $
Carve the teamwork of each intelligent bodyIt determines.It is intelligent body i about the learning rate to Q.Update side in formula (3)
Formula is the normal method for the value function that intensified learning estimates single state, and essence is to use riAssembly average go to estimateThis
It is outer it is a further advantage thatIt can update for one time one, and the return newly received is all α forever to the accounting that Q value is estimated.
According to formula (3), the renewal process (formula (1)) of u and the renewal process (formula (2)) of σ are represented by,
HereFor the sampling action of t moment.It is that intelligent body i is acted in selectionWhen the return that receives, it is each by t moment
The teamwork of intelligent bodyIt determines.WithIt is intelligent body i about uiAnd σiLearning rate.
However new problem directly can be brought to algorithm using Q Function Estimation in multiple agent environment.Because in more intelligence
In energy body environment, the return of intelligent body is influenced by other intelligent bodies, and the strategy change of other intelligent bodies will lead to environment not
Stablize.Update mode in formula (4) does not ensure that u can adapt to the dynamic change of environment.Here it cites a plain example,
Assuming that $ t $ moment intelligent body i has acquired the optimal movement at current timeAndIt is exactly rightAccurate estimation
According to definition, in t moment, to arbitrary xiHaveFormula (3) is brought into (4) to obtain,
If environment remains unchanged, then havingContinue to set up;If however environment changes, it is assumed thatAndIt is no longer optimal movement, then can existSo that its corresponding returnIn this case continue according to the update mode in formula (5), uiIt can be far from xi,
However theoretically becauseTo guarantee accurately estimation uiIt should be close to xi.Because Q is the statistical estimate of r, institute
It is slower than the variation of r with the update of Q, cause below at no point in the update processIt sets up always, u under multiple repairing weldiIt will
It is continuously maintained inIt is constant nearby.Theoretically uiStandard, which should be changed, looks for new optimal movement just right.Cause the original of these problems
Because of unstability caused by being primarily due to multiple agent environment, and traditional estimation method (such as Q study) can not be answered effectively
Variation to environment.
2, WoLS rule and analysis
In order to more accurately estimate the expected returns of u in multiple agent environment, the present invention passes through variable learning rate
Mode updates expectation movement u.Formally, it is expected that acting uiLearning rate be more newly defined as following formula according to the following formula,
Then uiUpdate be represented by
WoLS rule can be intuitively construed to, if the return V (x) of intelligent body movement x is greater than the return V (u) of desired u,
So it should learn fastly, on the contrary then slow.It can be seen that WoLS and WoLF (Win or Learn Fast) [35]
Strategy is exactly the opposite.Difference is that the target of WoLF design is to guarantee convergence, and WoLS strategy of the invention is
In order to enable algorithm to update according to increased direction is returned while guaranteeing the expected returns of correct estimation movement u
u.By the inherent dynamic characteristic of analysis WoLS strategy, following conclusion can be obtained,
On 1 Continuous action space of theorem, the learning dynamics using the CALA algorithm of WoLS rule can be approximately that gradient rises
Tactful (GA, gradient ascent).
It proves: according to definition (4), it is known that xtIt is intelligent body in moment t according to normal distribution N (ut,σt) selection movement, V
(xt) and V ({ ut) it is to correspond respectively to movement xtAnd utReturn.Define f (x)=E [V (xt)|xt=x] it is about movement x
Expected returns function.Assuming that αuInfinitesimal, then u in WoLS-CALA algorithmtDynamic change can be indicated by following ODE,
Here N (u, σu) be just be distributed very much probability density function (dN (a, b) indicate mean value be a, variance b2Normal state
It is distributed the differential about a).X=u+y is enabled, then by f (x) Taylor expansion in formula (8) at y=0, and abbreviation arrangement can obtain,
It notices in formula (9), itemAnd σ2It is that weighing apparatus is positive.
The conclusion of CALA can be used directly as original CALA algorithm in the renewal process (formula (4)) of standard deviation sigma: to
Fixed sufficiently large positive number a K, σ will eventually converge to σL.Convolution (9), the present invention can obtain following conclusion:
To a small positive number σL(such as 1/10000), after the enough time, about utODE can be approximately,
WhereinFor a small normal number.F ' (u) is gradient direction of the function f (u) at u.Formula (10) table
Bright u can change towards the gradient direction of f (u), i.e. the fastest-rising direction f (u).I.e. the dynamic trajectory of u can be approximately in gradient
Rise strategy.
In the presence of only one intelligent body, the dynamic of u finally will converge to an optimum point, because working as u=
u*When for an optimum point,And
It can be seen that from theorem 1, the learning dynamics of the CALA intelligent body expectation movement of WoLS rule are similar to previously described
Gradient rise strategy, i.e., they about the differential of time can all be expressed as shaped likeForm.If f (u) is deposited
In multiple local optimums, can algorithm finally converge to global optimum depending on algorithm to exploration-utilization (Exploration-
Exploitation distribution [36]), and this is a problem that can not be satisfactory to both parties in intensified learning field.To explore to the overall situation most
It is excellent the common approach is that the initial exploration rate σ (i.e. standard deviation) of algorithm is taken biggish value, and to the initial learning rate of σ
Especially small value is taken, to guarantee that algorithm there can be enough sampling numbers within the scope of entire motion space.Furthermore it is advised plus WoLS
The expectation movement u of CALA algorithm after then can also be restrained when standard deviation sigma is not 0 itself, therefore, in order to ensure that enough spies
The lower bound σ of rope rate σLA biggish value can be taken.In summary tactful, it may learn by choosing suitable parameter algorithm
Global optimum.
Another problem is to may result in algorithm using pure gradient rising strategy under multiple agent environment not restrain, thus
Present invention combination PHC (Policy Hill Climbing, strategy are climbed the mountain) [35] algorithm, proposes an Actor-Critic type
Free-standing multiple agent nitrification enhancement, referred to as WoLS-CALA.The main thought of Actor-Critic framework is strategy
Estimation and strategy update learn respectively in independent process, processing strategie estimating part is known as Critic, policy update
Part be known as Actor.Specific learning process is following (algorithm 1),
The learning strategy of 1 WoLS-CALA intelligent body i of algorithm
For simplicity, with two constant αs in algorithm 1ubAnd αus, (αub> αus), instead of uiLearning rateIf intelligent body
I execution acts xiThe return r received afterwardsiGreater than current accumulative average return Qi, then ujLearning rate be αub(winning),
(otherwise losing) is αus(the 3.3rd step).Because containing denominator φ (σ in formula (7) and (4)i t), one when denominator very little
Point tolerance all can cause very big influence to the update of u and σ.More held during specific experiment using two fixed step-lengths
The renewal process of algorithm easy to control, it is also easy to accomplish.It is furthermore noted that the update step-length of Q and the step-length of u in the 3.5th step of algorithm
It is synchronous, i.e., in ri> QiShi Douwei αub, otherwise it is all αus.Because of αubAnd αusIt is the number of two very littles, in uiVery little neighborhood
It is interior, QiAbout uiMapping available linearization be Qi=Kui+ C, wherein slopeIf uiIt changesThenThe purpose done so is also for more accurately estimating
Count the expected returns of u.Finally (step 4), algorithm withConvergence is exported as loop termination condition and algorithm.The purpose done so
Primarily to preventing in the environment of competition, uiIt will appear periodic solution and cause algorithm that cannot terminate.Here it should be noted variable
And uiRepresent different meanings:For the cumulative statistics average value of the sampling action of intelligent body i, it is most under multiple agent environment
Termination fruit can converge to Nash Equilibrium strategy;And ujIt is the expectation mean value of the policy distributed of intelligent body i, it may under competitive environment
It can periodically be shaken near equilibrium point.Detailed explanation will provide in theorem 2 later.
Because the dynamic trajectory in higher dimensional space might have chaos phenomenon, cause to be difficult to algorithm with multiple intelligence
Qualitative analysis is done in dynamic behaviour when body.It is essentially all based on two to the dynamic analysis of multiple agent related algorithm in field
[3537-39] of intelligent body.Therefore here there are two Main Analysis tools the case where WoLS-CALA intelligent body.
Theorem 2 gives positive number σLWith a sufficiently large positive number K, the strategy of two WoLS-CALA intelligent bodies may finally
Converge to Nash Equilibrium.
It proves: two classes can be divided by the position Nash Equilibrium of equilibrium point: being located at Continuous action space (bounded closed set) side
Equilibrium point in boundary and it is another kind of be equalization point inside Continuous action space.In view of borderline equalization point can wait
Valence is the equalization point inside the lower one-dimensional space, and this example inquires into emphatically the second class equalization point here.One ODE it is dynamic
State feature depends on the stability property [40] of its internal balance point, therefore the equalization point in this example calculating formula first (10), then
Analyze the stability of these equalization points.
It enablesFor intelligent body i in t moment according to normal distributionThe movement of stochastical sampling.WithPoint
It Wei not actWithCorresponding expected returns.Such as fruit dotIt is an equalization point of equation (10), thenHaveAccording to nonlinear dynamics theory [40], the stability of point eq can be under
The characteristic value decision of face matrix,
WhereinAs i ≠ j.
Furthermore according to the definition of Nash Equilibrium, Nash Equilibrium pointMeet lower surface properties,
Formula (12) is brought into M, it is known that the characteristic value of Nash equilibrium point belongs to one of following three kinds of possibilities:
(a) all characteristic values of matrix M have negative real part.This kind of equalization point is asymptotically stability, i.e., all $ eq $ are attached
Close track finally can all converge to this equalization point.
(b) all characteristic values of matrix M have non-positive real part, and the characteristic root containing a pair of pure void.This kind of balance
Point is stable, but the alpha limit set of the track near it is periodic solution, and alpha limit set is noncountable.In addition, being easy to prove I.e.The Nash Equilibrium will finally be converged to.In view of WoLS-CALA
To add up average valueTo export, therefore algorithm can also handle this kind of equalization point problem.
(c) there are the characteristic value of positive real part, i.e. equalization point are unstable by matrix M.To this kind of equalization point, according to Nonlinear Dynamic
Theory of mechanics, the track around the unstable equilibrium point can be divided into two kinds: track and other tracks in stable manifold cite
{Shilnikov1998Methods}.Stable manifold is the subspace generated by the stable corresponding feature vector of characteristic value.Place
Track in stable manifold theoretically finally can all converge to this equalization point.In view of due to randomness and calculate error,
It is 0 that algorithm, which maintains the probability that do not go out in the subspace,.And all tracks for being not belonging to the stable manifold all will be gradually remote
The above-mentioned other kinds of equalization point analyzed from the equalization point and finally is converged to, that is, converges to borderline equalization point or
One and the second class equalization point.
In addition, being similar to single intelligent body environment, closed according to the analysis to theorem 1 given if there is multiple equalization points
When suitable exploration-utilization rate, such as σLSufficiently large, σ takes big initial value and small learning rate, algorithm can converge to one receive it is assorted
It weighs point (global optimum of each intelligent body when other intelligent body strategies are constant).In conclusion the present invention is completed to algorithm
Converge to the proof of Nash Equilibrium.
The present invention also provides a kind of social networks public opinions based on the Nash Equilibrium strategy on the Continuous action space
Evolution model, the social networks public opinion evolution model include two class intelligent bodies, respectively ordinary populace in simulation social networks
Gossiper class intelligent body and simulation social networks in media or public figure for the purpose of attracting ordinary populace Media
Class intelligent body, therefore, social networks public opinion evolution model of the invention are also Gossiper-Media model.Wherein, described
Media class intelligent body returns it optimal idea using the Nash Equilibrium policy calculation on the Continuous action space, updates
Its idea is simultaneously broadcasted in social networks.WoLS-CALA algorithm is applied to the public opinion in true social networks and developed by the present invention
Research in, by being modeled to the media in network using WoLS-CALA, inquire into competition media public opinion can be caused it is assorted
The influence of sample.
It is described in detail below:
1.Gossiper-Media model
The present invention proposes a multiple agent intensified learning frame, Gossiper-Media model, to study group's public opinion
Differentiation.Gossiper-Media model includes two class intelligent bodies, Gossiper class intelligent body and Media class intelligent body.Wherein
Gossiper class intelligent body is used to simulate ordinary populace in live network, and idea (public opinion) is simultaneously by Media and other
The influence of Gossiper;And Media class intelligent body is used to simulate the media in social networks for the purpose of attracting masses or the public
The idea of personage, the selection oneself of such intelligent body active go to maximize the follower of oneself.Considering one has N number of intelligent body
Network, wherein the number of Gossiper be | G |, the number of Media is | M | (N=G ∪ M).Assuming that Gossiper and Media
Between be full connection, i.e., each Gossiper can equiprobable any one Media of selection interaction.And between Gossiper
Full connection is not provided, i.e., each Gossiper is only possible to interact with the neighbours of oneself.Network between Gossiper is by between it
Social networks determine.Particularly, in emulation experiment below, it is imitative to do that this example respectively defines two kinds of Gossiper networks
True experiment: Quan Liantong network (fully connected network) and small-world network (small-world network).
The idea of note Gossiper i and Media j is denoted as x respectivelyiAnd yj.The interactive process of each intelligent body is in accordance with algorithm 2 in model.
The learning model of idea in 2 Gossiper-Media network of algorithm
Firstly, the idea of each Gossiper and Media is by a random value being initialized as on motion space [0,1]
(step 1).Then in interacting each time, each intelligent body adjusts separately the idea of oneself until algorithmic statement according to Different Strategies
(each intelligent body all no longer Changing concept).To each Gossiper intelligent body, selection first selects the object interacted with it: according to
Select a Gossiper in the probability ξ random neighbours from it, or according to probability 1- ξ random one Media of selection (the
2.1 steps).This subsequent Gossiper updates its idea according to algorithm 3, and is selected according to the idea difference of itself and each Media
Follow a Media closest to oneself idea.Assuming that Media intelligent body can be by sampling random acquisition a part
The idea of Gossiper, and all Media are broadcast to, it is denoted as G ' (the 2.2nd step) here.Then each Media uses WoLS-CALA
The mutual game of algorithm calculates the idea that can maximize the follower of oneself, and updated idea is broadcast to entire net
In network (the 2.3rd step).Each Media can also be sampled alone in principle, so that the G ' that they obtain is different, this is to below
The study influence of WoLS-CALA algorithm is simultaneously little, because theoretically the idea distribution of G ' is identical as G.Environmental postulates of the invention
Primarily to easy consider, while also reducing the possible uncertainty as caused by stochastical sampling.
1.1Gossiper tactful
The strategy of each Gossiper includes two parts: 1) how to break with the conventional idea;2) Media followed how is selected.Tool
Body is described as follows (algorithm 3):
The strategy that 3 Gossiper i of algorithm takes turns in τ
To Gossiper i, its idea: x is initialized firsti τ=xi τ-1(step 1).Then according to BCM (the bounded
Confidence model, bounded confidence model) tactful [12,33] update its idea (step 2).BCM is a kind of more typical
The model of group's idea is described, the idea of the intelligent body based on BCM is only influenced by intelligent body similar in idea therewith.In algorithm
In 3, is only differed with the idea of the intelligent body of its selection and be less than threshold value dg(or dm) when, Gossiper just will be updated its idea.
Here dgAnd dmThe intelligent body for corresponding respectively to selection is Gossiper and Media.Threshold value dg(or dm) size represent
Gossiper receives the degree of new idea.Intuitively, d is bigger, and Gossiper is easier to be influenced [41- by other intelligent bodies
43].Then the Gossiper compares the difference of oneself and other Media ideas, follows the (the 3rd according to one Media of probability selection
Step).Here probability P is usedij τIt indicates to follow the probability of Media j in τ moment Gossiper i selection, meets following characteristic:
(i) as | xi-yj| > dmWhen, Pij=0;
(ii)(ii)PijIdea y of the > 0 and if only if Media jjMeet | xi-yj|≤dm;
(iii)(iii)PijWith idea xiAnd yjDistance | xi-yj| increase and reduce.
It notices if rightHave | xi-yj| > dm, thenThis means that there are this possibility, one
A Gossiper will not follow any one Media.Equation λijIn parameter δ be a small positive number, for preventing score
Denominator is 0.
1.2 Media strategy
To the idea sample information of one group of given Gossiper, each Media can be by learning adjustment appropriate oneself
Idea, to cater to the hobby of Gossiper, so that more Gossiper be attracted to follow it.There are the more of multiple Media
In multiagent system, Nash Equilibrium is that multiple intelligent bodies are vied each other the stable state being finally achieved.In this condition, each intelligence
Energy body cannot obtain higher return by the one-side strategy for changing oneself.In view of the motion space of Media is to connect
Continuous (idea is defined as any point on section [0,1]), builds the behavior of Media used here as WoLS-CALA algorithm
Mould, algorithm 4 are the Media strategies based on WoLS-CALA building.
The strategy that 4 Media j of algorithm takes turns in τ
Media j current return rjIt is defined as total people in G ' shared by the number of the middle Gossiper for selecting to follow j of G '
Several ratios,
Here λijDefinition with algorithm 3.PijIndicate that Gossiper i follows the probability of Media j.
2, group's public opinion dynamic analysis
Remember { yj}j∈M, yj∈ (0,1) is the idea of Media j.Assuming that Gossiper network is infinitely great, then Gossiper
Idea distribution can be indicated by a continuous distribution density function, indicate Gossiper group in t moment with p (x, t) here
The probability density function of idea distribution.Then Gossiper public opinion differentiation can be expressed as probability density function p (x, t) about when
Between partial derivative.First this example consider only one Media there are the case where.
Theorem 3 is contained only at one in the Gosiper-Media network of a Media, and the distribution of Gossiper idea is drilled
Become and obey following formula,
Wherein,
Here I1=x | | and x-y | < (1- αm)dm, I2=x | dm≥|x-y|≥(1-αm)dm}。
It proves:, Gossiper based on BCM theoretical based on MF approximate [40] (Mean Field approximations)
The probability distribution of idea about t local derviation p (x, t) can with [12] are expressed below,
Here Wx+y→xIndicate Gossiper of the idea equal to x+y can probability of the Changing concept to x, and Wx+y→xp(x+y)dy
Indicate the ratio for being transferred to x from section (x+y, x+y+dy) in the idea of time interval (t, t+dt) interior intelligent body.Similar
Wx→x+yIndicate probability of the intelligent body meeting Changing concept of idea x to x+y, Wx→x+yP (x) dy indicates that idea is equal to the Gossiper of x
It is transferred to section (x+y, x+y+dy) ratio.
According to the definition of algorithm 3, intelligent body Gossiper is according to probability ξ by the ideal effect of other Gossiper, Huo Zheyi
Then probability 1- ξ is made the decision of oneself by the ideal effect of Media.By Wx+y→xAnd Wx→x+yIt is refined as by other Gossiper
Idea and the two parts influenced by Media idea, are denoted as w respectively[g]And w[m], then Wx→x+yAnd Wx+y→xIt is represented by,
Formula (18), which is brought into formula (17), to be obtained,
Definition
Wherein Ψg(x, t) indicates the variation that the probability density function p (x, t) of intelligent body g idea is influenced by Gossiper
Rate.Weisbuch G [45] et al. is proved Ψg(x, t) obeys following formula,
HereIt is second order local derviation of the p about x.αgBe one between 0 to 0.5 real number.dgFor
The threshold value of Gossiper.
Formula Ψm(x, t) represents the change rate that the distribution density function p (x, t) of idea is influenced by media.Assuming that Media j
Idea be uj(uj=x+dj), then the idea distribution of Media can use Dirac delta equation q (x)=δ (x-uj) table
Show.Dirac delta equation δ (x) [46] is commonly used for simulating the narrow peaking function (pulse) of a height and other similar pumping
As concepts, charge, point mass or electronics are such as put, is defined as follows,
The then rate of transform from x+y to xIt is represented by
δ (x- [(x+y)+α in formula (21)m((x+z)-(x+y))]) indicate that following event occurs, idea x+y is by idea x+z
Influence and be transferred to x.Q (x+z) is distribution density of the Media at idea x+z.Similarly, wx→x+yIt can be expressed as,
Convolution (21)-(22) calculate and arrange and can obtain,
Wherein I1=x | | and x-y | < (1- αm)dm, I2=x | dm≥|x-y|≥(1-αm)dm}。
Composite type (20) is completed to prove.
This example can be seen that from formula (14), and the change rate of p (x, t) is formula Ψg(x, t) and ΨmThe weighted average of (x, t).
The former, which represents public opinion variation, is influenced part by Gossiper network, and the latter, which represents, is influenced part by Media network.Only
Formula Ψ containing Gossiperg(x, t) was researched and analysed by the work [45] of Weisbuch G.Its obtain one it is important
Property be from any one distribution, the point of local optimum can gradually be strengthened in distribution density, this shows pure Gossiper net
The development of public opinion can gradually tend to always in network.In addition, can be seen that from theorem 3, formula Ψg(x, t) and formula Ψm(x, t) all with
The specific network of Gossiper is unrelated, and when this shows network infinity, the development of public opinion is not influenced by network structure.
The second part of following analysis equation (14), Ψm(x, t) (formula (23)).Assuming that y is constant, analysis (23) can
,
Intuitively, formula (24) shows that the viewpoint of Gossiper similar with Media idea can all converge to this Media,
It therefore follows that following conclusion,
The presence of 1 one Media of inference can accelerate the public opinion of Gossiper to tend to unified.
Below this example consider multiple Media there are the case where.Define Pj(x) idea for being Gossiper is at x by Media
The probability that j influences, then
So Gossiper with multiple Media competition in the environment of, the dynamic change of idea can be expressed as by
The weighted average that each Media influences.Following conclusion can be obtained,
The dynamic change of the distribution function of 2 Gossiper idea of inference obeys following formula:
Wherein Ψg(x, t) and Ψm(x, t) is defined respectively by formula (20) and (23).
3, emulation experiment and analysis
First verify that WoLS-CALA algorithm may learn Nash Equilibrium.Then provide Gossiper-Media model
Experiment simulation, for verifying the theoretical analysis result of front.
3.1 WoLS-CALA algorithm performances are examined
This example considers the Gossiper-Media model of a simplified version, to examine whether WoLS-CALA algorithm can be learned
Practise Nash Equilibrium strategy.Specifically, the problem of two Media are competed follower is modeled as following objective optimisation problems,
max(f1(x,y),f2(x,y))
S.t., (s.t. indicates constraint condition to x, y ∈ [0,1], is the standard literary style for optimizing class problem.)(26)
Wherein
And
r∈[0,1].A, b ∈ [0,1] ∧ | a-b | >=0.2 is the idea of Gossiper.
Here function f1(x, y) and f2R in (x, y) simulation algorithm 4, respectively represent Media 1 and 2 teamwork be <
The return for x, y > be.This example uses two WoLS-CALA intelligent bodies, controls x and y respectively by independent study, each to maximize
From Reward Program f1(x, y) and f2(x,y).In the model, the idea of Gossiper can according to various forms of Nash Equilibriums
It is divided into two classes:
(i) as r > 2/3, equilibrium point is that (a, a), as r < 1/3, equilibrium point is (b, b);
(ii) (ii) as 1/3≤r≤2/3, equilibrium point be set | x-a | 0.1 ∧ of < | y-b | < 0.1 or | x-b | <
0.1 ∧ | y-a | any point on < 0.1.
In specific emulation experiment, this example has respectively taken a point, i.e. r=0.7 > 2/3 and r=0.6 in the two types
< 2/3.Then it observes when the idea distribution of Gossiper is different, can algorithm can learn as expected to Nash Equilibrium.Table 1 is
The parameter setting of WoLS-CALA.
1 parameter setting of table
Fig. 1 and 2 is the simulation result of two experiments, can be evident that, Media intelligent body is passing through in two experiments
After crossing 3000 times or so study, Nash Equilibrium has all been converged to, that is to say, that converge to<0.4,0.4 when r=0.6>, r
<0.4,0.57>has been converged to when=0.7.As shown in Figure 1, when r=0.7 > 2/3, a=0.4, b=0.6, two intelligent body are received
Hold back Nash Equilibrium point (0.4,0.4);As shown in Fig. 2, working as r=0.6 < 2/3, a=0.4, b=0.6, intelligent body 1 (agent1)
X=0.4 is converged to, intelligent body 2 (agent2) converges to y=0.57.
The experiment simulation of 3.2 Gossiper-Media models
The simulation result of this trifle displaying Gossiper-Media model.Consider 200 Gossiper and there is difference
The experimental situation of number Media is respectively as follows: (i) without Media;(ii) only one Media;(iii) there are two competitions
Media.To each environment, this example considers two kinds of representative Gossiper networks, full-mesh network (Fully respectively
Connected Network) and small-world network [47] (Small-World Network).Pass through these comparative experimentss, this example
Inquire into the influence that Media develops Gossiper public opinion.
To even things up, each experimental situation uses same parameter setting.Same net is used in three experimental situations
The initial idea of network and identical Gossiper and Media.Here, small-world network is constructed using Watts-Strogatz
Method [47] is generated at random by degree of communication p=0.2.The initial idea of each Gossiper is by being evenly distributed on section [0,1]
Stochastical sampling.The initial idea of Media is 0.5.In view of the observation for crossing conference interference experiment of threshold value, here will
Gossiper-Media threshold value dmWith Gossiper-Gossiper threshold value dgIt is set as a small positive number 0.1.Gossiper
Habit rate αgAnd αmIt is set as 0.5.Set G ' is sampled from G at random, and is met | G ' |=80 % | G |.
Because there are two types of Gossiper network modes under each environment: full-mesh network and small-world network.Therefore, Fig. 3-
4 respectively show under full-mesh network and small-world network, and the public opinion of each network develops when without Media;Fig. 5-6 is opened up respectively
Show under full-mesh network and small-world network, the public opinion of each network develops when with a Media;Fig. 7-8 is shown respectively
Under the full-mesh network and small-world network, there are two the public opinions of network each when the Media competed to develop for tool.From these figures
In, it can be seen firstly that under three kinds of all Media environment, the number of the final convergent point of different Gossiper networks
It is identical: to converge to 5 in zero Media environment;4 are converged in one Media environment;3 are converged in two Media environment
It is a.This phenomenon is consistent with the conclusion in theorem 3 and inference 2, the topology of the public opinion dynamic and Gossiper network of Gossiper
Structure is unrelated, because the public opinion dynamic under heterogeneous networks of Gossiper can be modeled with identical formula.
Second, it can be observed from Fig. 3-6, when there are in the case where a Media, the carriage of Gossiper in two networks
4 all are reduced to from 5 by last convergent points.This shows that the presence of Media can accelerate the generation of Gossiper public opinion unification,
Meet conclusion of this example in inference 1.Meanwhile from Fig. 5-8, when the number of Media increases to 2 from 1, in two networks
The last convergent points of the public opinion of Gossiper are further reduced to 3 from 4.This shows that the Media of competition can be further speeded up
The unification of Gossiper public opinion.
In addition, experimental result is also able to verify that the performance of WoLS-CALA algorithm.In fig. 5 and fig., the sight of Media intelligent body
Thought maintains always around the idea with most Gossiper (N in full-mesh networkmax=69, N in small-world networkmax
=68).This phenomenon meets the expection of algorithm design, i.e. WoLS-CALA intelligent body can learn to global optimum.In Fig. 7 and
In Fig. 8, it can be seen that when there are two Media, the idea of a Media is maintained around the idea with most Gossiper
(N in two networksmaxIt is all that 89), another Media maintains around the idea with more than second Gossiper (full-mesh net
N ' in networkmax=70, N ' in small-world networkmax=66).This also complies with the expection of theorem 2, and two WoLS-CALA intelligent bodies are most
Nash Equilibrium can be converged to eventually.The idea of Media vibrates up and down around Gossiper idea always in Fig. 3-8, be because
In Gossiper-Media model, the optimal strategy of Media is not unique (to be less than d around Gossiper ideamIn the range of all
It is the optimum point of Media).
4, it summarizes
The invention proposes the nitrification enhancement WoLS- of the Continuous action space of the multiple agent of an independent study
CALA, demonstrating the algorithm in terms of theoretical proof and experimental verification two respectively may learn Nash Equilibrium.Then should
Algorithm is applied in the research that public opinion develops in network environment.Here by the individual in social networks be divided into Gossiper and
Two class of Media models respectively, and wherein Gossiper class represents ordinary populace, and Media represents society using the modeling of WoLS-CALA algorithm
Hand over the individual for the purpose of attracting public concern such as media.By modeling respectively to two kinds of intelligent bodies, the present invention has discussed difference
The influence that the competition of number Media generates Gossiper public opinion.Last theoretical and experiment shows that the competition of Media can add
Fast public opinion reaches unanimity.
The specific embodiment of the above is better embodiment of the invention, is not limited with this of the invention specific
Practical range, the scope of the present invention includes being not limited to present embodiment, all equal according to equivalence changes made by the present invention
Within the scope of the present invention.
The corresponding bibliography of the label being related in the present invention is as follows:
[1]PazisJ,LagoudakisMG.Binary Action Search for Learning Continuous-
action Control Policies[C].In Proceedings of the 26th Annual International
Conference on Machine Learning,New York,NY,USA,2009:793–800.
[2]Pazis J,Lagoudakis M G.Reinforcement learning in multidimensional
continuous action spaces[C].In IEEE Symposiumon Adaptive Dynamic Programming&
Reinforcement Learning,2011:97–104.
[3]Sutton R S,Maei H R,Precup D,et al.Fast Gradient-descent Methods
for Temporal-difference Learning with Linear Function Approximation[C].In
Proceedings of the 26th Annual International Conference on Machine Learning,
2009:993–1000.
[4]Pazis J,Parr R.Generalized Value Functions for Large Action Sets
[C].In International Conference on Machine Learning,ICML 2011,Bellevue,
Washington,USA,2011:1185–1192.
[5]Lillicrap T P,Hunt J J,Pritzel A,et al.Continuous control with
deep reinforcement learning[J].Computer Science,2015,8(6):A187.
[6]KONDA V R.Actor-critic algorithms[J].SIAM Journal on Control and
Optimization,2003,42(4).
[7]Thathachar M A L,Sastry P S.Networks of Learning Automata:
Techniques for Online Stochastic Optimization[J].Kluwer Academic Publishers,
2004.
[8]Peters J,Schaal S.2008Special Issue:Reinforcement Learning of
Motor Skills with Policy Gradients[J].Neural Netw.,2008,21(4).
[9]van Hasselt H.Reinforcement Learning in Continuous State and
Action Spaces[M].In Reinforcement Learning:State-of-the-Art.Berlin,
Heidelberg:Springer Berlin Heidelberg,2012:207–251.
[10]Sallans B,Hinton G E.Reinforcement Learning with Factored States
and Actions [J].J.Mach.Learn.Res.,2004,5:1063–1088.
[11]Lazaric A,Restelli M,Bonarini A.Reinforcement Learning in
Continuous Action Spaces through Sequential Monte Carlo Methods[C].In
Conference on Neural Information Processing Systems,Vancouver,British
Columbia,Canada,2007:833–840.
[12]Quattrociocchi W,Caldarelli G,Scala A.Opinion dynamics on
interacting networks:media competition and social influence[J].Scientific
Reports,2014,4(21):4938–4938.
[13]Yang H X,Huang L.Opinion percolation in structured population[J]
.Computer Physics Communications,2015,192(2):124–129.
[14]Chao Y,Tan G,Lv H,et al.Modelling Adaptive Learning Behaviours
for Consensus Formation in Human Societies[J].Scientific Reports,2016,6:
27626.
[15]De Vylder B.The evolution of conventions in multi-agent systems
[J].Unpublished doctoral dissertation,Vrije Universiteit Brussel,Brussels,
2007.
[16]Holley R A,Liggett T M.Ergodic Theorems for Weakly Interacting
Infinite Systems and the Voter Model[J].Annals of Probability,1975,3(4):643–
663.
[17] Nowak A, Szamrej J, Latan thatch B.From private attitude to public
opinion:A dynamic theory of social impact.[J].Psychological Review,1990,97
(3):362–376.
[18]Tsang A,Larson K.Opinion dynamics of skeptical agents[C].In
Proceedings of the 2014international conference on Autonomous agents and
multi-agent systems,2014:277–284.
[19]Ghaderi J,Srikant R.Opinion dynamics in social networks with
stubborn agents:Equilibrium and convergence rate[J].Automatica,2014,50(12):
3209–3215.
[20]Kimura M,Saito K,Ohara K,et al.Learning to Predict Opinion Share
in Social Networks.[C].In Twenty-Fourth AAAI Conference on Artificial
Intelligence,AAAI 2010,Atlanta,Georgia,Usa,July,2010.
[21]Liakos P,Papakonstantinopoulou K.On the Impact of Social Cost in
Opinion Dynamics [C].In Tenth International AAAI Conference on Web and Social
Media ICWSM,2016.
[22]Bond R M,Fariss C J,Jones J J,et al.A 61-million-person
experiment in social influence and political mobilization[J].Nature,2012,489
(7415):295–8.
[23]Szolnoki A,Perc M.Information sharing promotes prosocial
behaviour[J].New Journal of Physics,2013,15(15):1–5.
[24]Hofbauer J,Sigmund K.Evolutionary games and population dynamics
[M].Cambridge;New York,NY:Cambridge University Press,1998.
[25]Tuyls K,Nowe A,Lenaerts T,et al.An Evolutionary Game Theoretic
Perspective on Learning in Multi-Agent Systems[J].Synthese,2004,139(2):297–
330.
[26]Szabo B G.Fath G(2007)Evolutionary games on graphs[C].In Physics
Reports,2010.
[27]Han T A,Santos F C.The role of intention recognition in the
evolution of cooperative behavior[C].In International Joint Conference on
Artificial Intelligence,2011:1684–1689.
[28]Santos F P,Santos F C,Pacheco J M.Social Norms of Cooperation in
Small-Scale Societies[J].PLoS computational biology,2016,12(1):e1004709.
[29]Zhao Y,Zhang L,Tang M,et al.Bounded confidence opinion dynamics
with opinion leaders and environmental noises[J].Computers and Operations
Research,2016,74(C):205–213.
[30]Pujol J M,Delgado J,Sang,et al.The role of clustering on the
emergence of efficient social conventions[C].In International Joint
Conference on Artificial Intelligence,2005:965–970.
[31]Nori N,Bollegala D,Ishizuka M.Interest Prediction on Multinomial,
Time-Evolving Social Graph.[C].In IJCAI 2011,Proceedings of the International
Joint Conference on Artificial Intelligence,Barcelona,Catalonia,Spain,July,
2011:2507–2512.
[32]Fang H.Trust modeling for opinion evaluation by coping with
subjectivity and dishonesty[C].In International Joint Conference on
Artificial Intelligence,2013:3211–3212.
[33]Deffuant G,Neau D,Amblard F,et al.Mixing beliefs among
interacting agents[J].Advances in Complex Systems,2011,3(1n04):87–98.
[34]De Jong S,Tuyls K,Verbeeck K.Artificial agents learning human
fairness[C].In International Joint Conference on Autonomous Agents and
Multiagent Systems,2008:863–870.
[35]BowlingM,Veloso.Multiagent learning using a variable learning
rate[J].Artificial Intelligence,2002,136(2):215–250.
[36]Sutton R S,Barto A G.Reinforcement learning:an introduction[M]
.Cambridge,Mass:MIT Press,1998.
[37]Abdallah S,Lesser V.A Multiagent Reinforcement Learning Algorithm
with Non-linear Dynamics[J].J.Artif.Int.Res.,2008,33(1):521–549.
[38]Singh S P,Kearns M J,Mansour Y.Nash Convergence of Gradient
Dynamics in General-Sum Games[J],2000:541–548.
[39]Zhang C,Lesser V R.Multi-agent learning with policy prediction
[J],2010:927–934.
[40]Shilnikov L P,Shilnikov A L,Turaev D,et al.Methods of qualitative
theory in nonlinear dynamics/[M].World Scientific,1998.
[41]Dittmer J C.Consensus formation under bounded confidence[J]
.Nonlinear Analysis Theory Methods and Applications,2001,47(7):4615–4621.
[42]LORENZ J.CONTINUOUS OPINION DYNAMICS UNDER BOUNDED CONFIDENCE:A
SURVEY[J].International Journal of Modern Physics C,2007,18(12):2007.
[43]Krawczyk M J,Malarz K,Korff R,et al.Communication and trust in
the bounded confidence model[J].Computational Collective
Intelligence.Technologies and Applications,2010,6421:90–99.
[44]Lasry J M,Lions P L.Mean field games[J].Japanese Journal of
Mathematics,2007,2(1):229–260.
[45]WeisbuchG,DeffuantG,AmblardF,etal.Interacting Agents and
Continuous Opinions Dynamics[M].Springer Berlin Heidelberg,2003.
[46]Hassani S.Dirac Delta Function[M].Springer New York,2000.
[47]DJ W,SH S.Collectivedynamics of’small-world’networks[C].In
Nature,1998:440–442.
Claims (10)
1. the Nash Equilibrium strategy on Continuous action space, it is characterised in that include the following steps:
(1) constant α is setubAnd αus, wherein αub> αus,αQ,ασ∈ (0,1) is learning rate;
(2) initiation parameter, wherein the parameter includes the mean value u of intelligent body i expectation movement ui, accumulative Average StrategyOften
Number C, variances sigmaiWith accumulative average return Qi;
(3) the accumulative Average Strategy until the sampling action of intelligent body i is repeated the steps ofConvergence,
(3.1) by certain exploration rate according to normal distribution N (ui,σj) one movement x of random selectioni;
(3.2) execution acts xi, return r is then obtained from environmenti;
(3.3) if intelligent body i execution acts xiThe return r received afterwardsiGreater than current accumulative average return Qi, then ui?
Habit rate is αub, otherwise learning rate is αus, u is updated according to selected learning ratei;
(3.4) u is arrived according to studyiUpdate variances sigmai;
(3.5) if intelligent body i execution acts xiThe return r received afterwardsiGreater than current accumulative average return Qi, then ui?
Habit rate is αub, otherwise learning rate is αus, Q is updated according to selected learning ratei;
(3.6) according to constant C and movement xiIt updates
(4) accumulative Average Strategy is exportedFinal movement as intelligent body i.
2. the Nash Equilibrium strategy on Continuous action space according to claim 1, it is characterised in that: in step (3.3)
In step (3.5), the update step-length of Q and the update step-length of u are synchronous, in uiNeighborhood in, QiAbout uiMapping can be linear
Turn to Qi=Kui+ C, wherein slope
3. the Nash Equilibrium strategy on Continuous action space according to claim 2, it is characterised in that: given positive number σLWith
One positive number K, the Nash Equilibrium strategy on the Continuous action space of two intelligent bodies may finally converge to Nash Equilibrium,
In, σLIt is the lower bound of variances sigma.
4. the social networks public opinion based on the Nash Equilibrium strategy on the described in any item Continuous action spaces of claim 1-3 is drilled
Varying model, it is characterised in that: the social networks public opinion evolution model includes two class intelligent bodies, respectively in simulation social networks
Media or public people in the Gossiper class intelligent body and simulation social networks of ordinary populace for the purpose of attracting ordinary populace
The Media class intelligent body of object, wherein the Media class intelligent body is using the Nash Equilibrium strategy on the Continuous action space
Optimal idea is returned it in calculating, is updated its idea and is broadcasted in social networks.
5. social networks public opinion evolution model according to claim 4, it is characterised in that include the following steps:
S1: the idea of each Gossiper and Media is by a random value being initialized as on motion space [0,1];
S2: in interacting each time, each intelligent body according to following Developing Tactics oneself idea, until each intelligent body all no longer changes
Become idea;
S21: to any one Gossiper class intelligent body, a neighbour is randomly choosed in Gossiper network according to setting probability
It occupies, according to Media BCM policy update its idea and followed;
S22: a subset of stochastical sampling Gossiper network GGossiper idea in subset G ' is broadcast to
All Media;
S23: to any one Media, using the Nash Equilibrium policy calculation on Continuous action space, it returns optimal idea,
And updated idea is broadcast in entire social networks.
6. social networks public opinion evolution model according to claim 5, it is characterised in that: in the step s 21, described
The operating method of Gossiper class intelligent body are as follows:
A1: idea initialization: xi τ=xi τ-1;
A2: the renewal of ideas: it is less than given threshold when the intelligent body is differed with the idea of the intelligent body of selection, updates the intelligent body
Idea;
A3: the intelligent body compares the difference of oneself and other Media ideas, follows according to one Media of probability selection.
7. social networks public opinion evolution model according to claim 6, it is characterised in that: in step A2, if currently
The neighbours of selection are Gossiper j, and | xj τ-xi τ| < dg, then xi τ←xi τ+αg(xj τ-xi τ);If the neighbours currently selected
It is Media k, andThenWherein, dgAnd dmRespectively it is directed to
The threshold value of the idea setting of different types of neighbours, ɑgAnd ɑmRespectively it is directed to the learning rate of different types of neighbours.
8. social networks public opinion evolution model according to claim 7, it is characterised in that: in step A3, according to probability
Media k is followed,Wherein,
9. social networks public opinion evolution model according to claim 8, it is characterised in that: in step S23, Media j
Current return rjIt is defined as the ratio of the middle total number of persons of G ' shared by the number of the middle Gossiper for selecting to follow j of G ',PijIndicate that Gossiper i follows the probability of Media j.
10. according to the described in any item social networks public opinion evolution models of claim 4-9, it is characterised in that: Media's
In the presence of the public opinion of each Gossiper intelligent body can be accelerated to tend to unified;In the environment of being competed there are multiple Media, respectively
The dynamic change of Gossiper intelligent body idea is the weighted average influenced by each Media.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2018/098101 WO2020024170A1 (en) | 2018-08-01 | 2018-08-01 | Nash equilibrium strategy and social network consensus evolution model in continuous action space |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109496305A true CN109496305A (en) | 2019-03-19 |
CN109496305B CN109496305B (en) | 2022-05-13 |
Family
ID=65713809
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201880001570.9A Active CN109496305B (en) | 2018-08-01 | 2018-08-01 | Social network public opinion evolution method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109496305B (en) |
WO (1) | WO2020024170A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362754A (en) * | 2019-06-11 | 2019-10-22 | 浙江大学 | The method that social network information source is detected on line based on intensified learning |
CN111445291A (en) * | 2020-04-01 | 2020-07-24 | 电子科技大学 | Method for providing dynamic decision for social network influence maximization problem |
CN112862175A (en) * | 2021-02-01 | 2021-05-28 | 天津天大求实电力新技术股份有限公司 | Local optimization control method and device based on P2P power transaction |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112801299B (en) * | 2021-01-26 | 2023-12-01 | 西安电子科技大学 | Method, system and application for constructing game model of evolution of reward and punishment mechanism |
CN113572548B (en) * | 2021-06-18 | 2023-07-07 | 南京理工大学 | Unmanned plane network cooperative fast frequency hopping method based on multi-agent reinforcement learning |
CN113645589B (en) * | 2021-07-09 | 2024-05-17 | 北京邮电大学 | Unmanned aerial vehicle cluster route calculation method based on inverse fact policy gradient |
CN113568954B (en) * | 2021-08-02 | 2024-03-19 | 湖北工业大学 | Parameter optimization method and system for preprocessing stage of network flow prediction data |
CN113778619B (en) * | 2021-08-12 | 2024-05-14 | 鹏城实验室 | Multi-agent state control method, device and terminal for multi-cluster game |
CN113687657B (en) * | 2021-08-26 | 2023-07-14 | 鲁东大学 | Method and storage medium for multi-agent formation dynamic path planning |
CN114021456A (en) * | 2021-11-05 | 2022-02-08 | 沈阳飞机设计研究所扬州协同创新研究院有限公司 | Intelligent agent invalid behavior switching inhibition method based on reinforcement learning |
CN114065916A (en) * | 2021-11-11 | 2022-02-18 | 西安工业大学 | DQN-based agent training method |
CN114845359A (en) * | 2022-03-14 | 2022-08-02 | 中国人民解放军军事科学院战争研究院 | Multi-intelligent heterogeneous network selection method based on Nash Q-Learning |
CN115515101A (en) * | 2022-09-23 | 2022-12-23 | 西北工业大学 | Decoupling Q learning intelligent codebook selection method for SCMA-V2X system |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090055268A1 (en) * | 2007-08-20 | 2009-02-26 | Ads-Vantage, Ltd. | System and method for auctioning targeted advertisement placement for video audiences |
CN103490413A (en) * | 2013-09-27 | 2014-01-01 | 华南理工大学 | Intelligent electricity generation control method based on intelligent body equalization algorithm |
CN106358308A (en) * | 2015-07-14 | 2017-01-25 | 北京化工大学 | Resource allocation method for reinforcement learning in ultra-dense network |
CN106899026A (en) * | 2017-03-24 | 2017-06-27 | 三峡大学 | Intelligent power generation control method based on the multiple agent intensified learning with time warp thought |
CN107135224A (en) * | 2017-05-12 | 2017-09-05 | 中国人民解放军信息工程大学 | Cyber-defence strategy choosing method and its device based on Markov evolutionary Games |
US20180033081A1 (en) * | 2016-07-27 | 2018-02-01 | Aristotle P.C. Karas | Auction management system and method |
CN107832882A (en) * | 2017-11-03 | 2018-03-23 | 上海交通大学 | A kind of taxi based on markov decision process seeks objective policy recommendation method |
CN107979540A (en) * | 2017-10-13 | 2018-05-01 | 北京邮电大学 | A kind of load-balancing method and system of SDN network multi-controller |
CN109511277A (en) * | 2018-08-01 | 2019-03-22 | 东莞理工学院 | The cooperative method and system of multimode Continuous action space |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106936855B (en) * | 2017-05-12 | 2020-01-10 | 中国人民解放军信息工程大学 | Network security defense decision-making determination method and device based on attack and defense differential game |
CN108092307A (en) * | 2017-12-15 | 2018-05-29 | 三峡大学 | Layered distribution type intelligent power generation control method based on virtual wolf pack strategy |
-
2018
- 2018-08-01 WO PCT/CN2018/098101 patent/WO2020024170A1/en active Application Filing
- 2018-08-01 CN CN201880001570.9A patent/CN109496305B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090055268A1 (en) * | 2007-08-20 | 2009-02-26 | Ads-Vantage, Ltd. | System and method for auctioning targeted advertisement placement for video audiences |
CN103490413A (en) * | 2013-09-27 | 2014-01-01 | 华南理工大学 | Intelligent electricity generation control method based on intelligent body equalization algorithm |
CN106358308A (en) * | 2015-07-14 | 2017-01-25 | 北京化工大学 | Resource allocation method for reinforcement learning in ultra-dense network |
US20180033081A1 (en) * | 2016-07-27 | 2018-02-01 | Aristotle P.C. Karas | Auction management system and method |
CN106899026A (en) * | 2017-03-24 | 2017-06-27 | 三峡大学 | Intelligent power generation control method based on the multiple agent intensified learning with time warp thought |
CN107135224A (en) * | 2017-05-12 | 2017-09-05 | 中国人民解放军信息工程大学 | Cyber-defence strategy choosing method and its device based on Markov evolutionary Games |
CN107979540A (en) * | 2017-10-13 | 2018-05-01 | 北京邮电大学 | A kind of load-balancing method and system of SDN network multi-controller |
CN107832882A (en) * | 2017-11-03 | 2018-03-23 | 上海交通大学 | A kind of taxi based on markov decision process seeks objective policy recommendation method |
CN109511277A (en) * | 2018-08-01 | 2019-03-22 | 东莞理工学院 | The cooperative method and system of multimode Continuous action space |
Non-Patent Citations (1)
Title |
---|
宋玉坚等: "多智能体布谷鸟算法的网络计划资源均衡优化", 《计算机工程与应用》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362754A (en) * | 2019-06-11 | 2019-10-22 | 浙江大学 | The method that social network information source is detected on line based on intensified learning |
CN110362754B (en) * | 2019-06-11 | 2022-04-29 | 浙江大学 | Online social network information source detection method based on reinforcement learning |
CN111445291A (en) * | 2020-04-01 | 2020-07-24 | 电子科技大学 | Method for providing dynamic decision for social network influence maximization problem |
CN111445291B (en) * | 2020-04-01 | 2022-05-13 | 电子科技大学 | Method for providing dynamic decision for social network influence maximization problem |
CN112862175A (en) * | 2021-02-01 | 2021-05-28 | 天津天大求实电力新技术股份有限公司 | Local optimization control method and device based on P2P power transaction |
Also Published As
Publication number | Publication date |
---|---|
WO2020024170A1 (en) | 2020-02-06 |
CN109496305B (en) | 2022-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109496305A (en) | Nash equilibrium strategy on continuous action space and social network public opinion evolution model | |
Wang et al. | R-MADDPG for partially observable environments and limited communication | |
Russell et al. | Q-decomposition for reinforcement learning agents | |
Busoniu et al. | A comprehensive survey of multiagent reinforcement learning | |
Zhang et al. | Collective behavior coordination with predictive mechanisms | |
Abed-Alguni et al. | A comparison study of cooperative Q-learning algorithms for independent learners | |
WO2019127945A1 (en) | Structured neural network-based imaging task schedulability prediction method | |
Simões et al. | Multi-agent actor centralized-critic with communication | |
Xu et al. | Learning multi-agent coordination for enhancing target coverage in directional sensor networks | |
Mehta | State-of-the-art reinforcement learning algorithms | |
CN114510012A (en) | Unmanned cluster evolution system and method based on meta-action sequence reinforcement learning | |
JP7448683B2 (en) | Learning options for action selection using meta-gradient in multi-task reinforcement learning | |
Wang et al. | Distributed reinforcement learning for robot teams: A review | |
Liu et al. | Efficient exploration for multi-agent reinforcement learning via transferable successor features | |
Yun et al. | Multi-agent deep reinforcement learning using attentive graph neural architectures for real-time strategy games | |
Juang et al. | A self-generating fuzzy system with ant and particle swarm cooperative optimization | |
Choudhury et al. | Scalable Online planning for multi-agent MDPs | |
Han et al. | Multi-uav automatic dynamic obstacle avoidance with experience-shared a2c | |
Zhou et al. | Strategic interaction multi-agent deep reinforcement learning | |
Abed-Alguni | Cooperative reinforcement learning for independent learners | |
Dias et al. | Quantum-inspired neuro coevolution model applied to coordination problems | |
Lima et al. | Formal analysis in a cellular automata ant model using swarm intelligence in robotics foraging task | |
Subramanian et al. | Efficient exploration in monte carlo tree search using human action abstractions | |
Zhan et al. | Dueling network architecture for multi-agent deep deterministic policy gradient | |
Martín H et al. | Learning autonomous helicopter flight with evolutionary reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |