CN109496305A

CN109496305A - Nash equilibrium strategy on continuous action space and social network public opinion evolution model

Info

Publication number: CN109496305A
Application number: CN201880001570.9A
Authority: CN
Inventors: 侯韩旭; 郝建业; 张程伟
Original assignee: Dongguan University of Technology
Current assignee: Dongguan University of Technology
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2019-03-19
Anticipated expiration: 2038-08-01
Also published as: WO2020024170A1; CN109496305B

Abstract

The invention provides a Nash equilibrium strategy on a continuous action space and a social network public opinion evolution model, belonging to the field of a reinforcement learning method. The strategy of the invention comprises the following steps: initializing parameters; according to a normal distribution with a certain exploration rateRandomly selecting an action(ii) a And executes the execution and then obtains the reward from the environment(ii) a If an agentPerforming an actionThe return received laterGreater than the current cumulative average rewardThen, thenHas a learning rate ofOtherwise the learning rate isUpdating according to the selected learning rateVariance, variance、Finally, updating the cumulative average policy(ii) a If cumulative average policyConvergence, then output cumulative average strategyAs the final action of agent i. The invention has the beneficial effects that: maximize their interest in interacting with other agents and ultimately enable learning of nash equilibrium.

Description

Nash Equilibrium strategy and social networks public opinion evolution model on Continuous action space

Technical field

The present invention relates to the Nash Equilibrium strategy on a kind of Nash Equilibrium strategy more particularly to a kind of Continuous action space, Further relate to a kind of social networks public opinion evolution model based on the Nash Equilibrium strategy on the Continuous action space.

Background technique

In the environment of Continuous action space, on the one hand, intelligent body be to the selection of movement it is unlimited, it is traditional based on Q Table class algorithm can not also store the estimation of infinite number of return；On the other hand, in multiple agent environment, continuous movement is empty Between also will increase the difficulty of problem.

In multiple agent nitrification enhancement field, the motion space of intelligent body can be discrete finite aggregate, can also be with It is continuously to gather.Because the essence of intensified learning is to find optimal, and continuous motion space tool by continuous trial and error There is infinite more movement selection, and multiple agent environment increases the dimension of motion space, this makes general intensified learning Algorithm is difficult study to global optimum (or balanced).

Major part algorithm is all based on function approximation technology and solves continuous problem at present, and this kind of algorithm can be divided into two classes: value Approximate algorithm [1-5] and tactful approximate algorithm [6-9].It is worth approximate algorithm to explore motion space and estimate corresponding value according to return Function, and policy definition is the probability-distribution function on Continuous action space and direct learning strategy by tactful approximate algorithm.This The performance of class algorithm depends on the accuracy of the estimation to value function or strategy, asks in processing challenge such as nonlinear Control It is often unable to do what one wishes when topic.In addition, there are also a kind of algorithm [10,11] based on sampling, this kind of algorithm maintain one it is discrete dynamic Work collects, then using the optimal movement in conventional discrete class algorithms selection behavior aggregate, finally according to a kind of resampling new mechanism Behavior aggregate is to gradually learn to optimal.This kind of algorithm can be very easily in conjunction with conventional discrete class algorithm, the disadvantage is that algorithm Need longer convergence time.Above-mentioned all algorithms are all to calculate the optimal policy in single intelligent body environment as target design , it can not be directly applied in the study of multiple agent environment.

Many work develop [12-14] using the public opinion in intelligent body simulation Technique Study social networks in recent years.It is given Different groups there are the groups of different ideas distribution, research group its idea during mutual contacts is finally that can reach altogether Know or polarization still tanglewracks [15] always.The key for solving the problems, such as this is how to understand public opinion differentiation Dynamic, thus obtain cause public opinion move towards consistent immanent cause [15].Problem is developed for the public opinion in social networks, is ground The person of studying carefully proposes a variety of multi-agent Learning models [16-20] and has studied the factors such as different information sharings or exchange degree to public opinion The influence of differentiation.Wherein [21-23] have studied the influence that the factors such as different information sharings or exchange degree develop public opinion.[14 24-28] etc. work studied using evolutionary game theory model intelligent body behavior (such as betray and cooperation) it is how mutual from companion It is evolved in dynamic.These work are to the behavior modeling of intelligent body, and it is identical for assuming all intelligent bodies all.However, in reality In the situation of border, individual can play the part of different roles (for example, leader or follower) in society, and this is according to the above method It is unable to accurate modeling.For this purpose, social groups are divided into media and public two parts and difference by Quattrociochi et al. [12] Modeling, wherein media and other masses that public idea is followed by it are influenced, and the idea of media is by outstanding in media Person influences.Then, Zhao et al. [29] proposes the public opinion mould based on leader follower (leader-follower) type Type explores the formation of public opinion.In the two work, the adjustable strategies of intelligent body idea are all to imitate leader or successfully Colleague.Based on the related work of imitation, there are also Local Majority [30], Conformity [31] and Imitating Neighbor[32].However, in actual environment, strategy that people take in doing decision is complicated more than simply imitating. People are combined and oneself carry out decision factum with the knowledge of grasp often by constantly interacting with circumstances not known.This Outside, what the strategy based on imitation cannot guarantee that algorithm can learn is global optimum, because of the quality of its intelligent body strategy The strategy of leader or the person of being imitated are depended on, and the strategy of leader is also not necessarily all best.

Summary of the invention

To solve the problems of the prior art, the present invention provides the Nash Equilibrium strategy on a kind of Continuous action space, this Invention additionally provides a kind of social networks public opinion evolution model based on the Nash Equilibrium strategy on the Continuous action space.

The present invention includes the following steps:

(1) constant α is set_ubAnd α_us, wherein α_ub> α_us,α_Q,α_σ∈ (0,1) is learning rate；

(2) initiation parameter, wherein the parameter includes the mean value u of intelligent body i expectation movement u_i, accumulative Average StrategyConstant C, variances sigma_iWith accumulative average return Q_i；

(3) the accumulative Average Strategy until the sampling action of intelligent body i is repeated the steps ofConvergence,

(3.1) by certain exploration rate according to normal distribution N (u_i,σ_j) one movement x of random selection_i；

(3.2) execution acts x_i, return r is then obtained from environment_i；

(3.3) if intelligent body i execution acts x_iThe return r received afterwards_iGreater than current accumulative average return Q_i, then u_iLearning rate be α_ub, otherwise learning rate is α_us, u is updated according to selected learning rate_i；

(3.4) u is arrived according to study_iUpdate variances sigma_i；

(3.5) if intelligent body i execution acts x_iThe return r received afterwards_iGreater than current accumulative average return Q_i, then u_iLearning rate be α_ub, otherwise learning rate is α_us, Q is updated according to selected learning rate_i；

(3.6) according to constant C and movement x_iIt updates

(4) accumulative Average Strategy is exportedFinal movement as intelligent body i.

The present invention is further improved, in step (3.3) and step (3.5), the update step-length of Q and the update step-length of u It is synchronous, in u_iNeighborhood in, Q_iAbout u_iMapping can linearly turn to Q_i=Ku_i+ C, wherein slope

The present invention is further improved, and gives positive number σ_LWith positive number K, receiving on the Continuous action space of two intelligent bodies is assorted Balance policy may finally converge to Nash Equilibrium, wherein σ_LIt is the lower bound of variances sigma.

The present invention also provides a kind of social networks public opinions based on the Nash Equilibrium strategy on the Continuous action space Evolution model, the social networks public opinion evolution model include two class intelligent bodies, respectively ordinary populace in simulation social networks Gossiper class intelligent body and simulation social networks in media or public figure for the purpose of attracting ordinary populace Media Class intelligent body, wherein the Media class intelligent body is using the Nash Equilibrium policy calculation on the Continuous action space to its time Optimal idea is reported, its idea is updated and is broadcasted in social networks.

The present invention is further improved, and is included the following steps:

S1: the idea of each Gossiper and Media is by a random value being initialized as on motion space [0,1]；

S2: each time interact in, each intelligent body according to following Developing Tactics oneself idea, until each intelligent body not Changing concept again；S21: it to any one Gossiper class intelligent body, is selected at random in Gossiper network according to setting probability Select a neighbours, according to BCM (the bounded confidence model, bounded confidence model) its idea of policy update and The Media followed；

S22: a subset of stochastical sampling Gossiper network GGossiper idea in subset G ' is wide It broadcasts to all Media；

S23: to any one Media, using the Nash Equilibrium policy calculation on Continuous action space, it is returned optimal Idea, and updated idea is broadcast in entire social networks.

The present invention is further improved, in the step s 21, the operating method of the Gossiper class intelligent body are as follows:

A1: idea initialization: x_i ^τ=x_i ^τ-1；

A2: the renewal of ideas: it is less than given threshold when the intelligent body is differed with the idea of the intelligent body of selection, updates the intelligence The idea of body；

A3: the intelligent body compares the difference of oneself and other Media ideas, follows according to one Media of probability selection.

The present invention is further improved, in step A2, if the neighbours currently selected are Gossiper j, and | x_j ^τ- x_i ^τ| < d_g, then x_i ^τ←x_i ^τ+α_g(x_j ^τ-x_i ^τ)；If the neighbours currently selected are Media k, and | y_k ^τ-x_i ^τ| < d_m, then x_i ^τ ←x_i ^τ+α_m(y_k ^τ-x_i ^τ), wherein d_gAnd d_mThe threshold value respectively set for the idea of different types of neighbours, ɑ_gAnd ɑ_mRespectively For the learning rate for different types of neighbours.

The present invention is further improved, in step A3, according to probabilityMedia k is followed, Wherein,

The present invention is further improved, in step S23, Media j current return r_jIt is defined as the middle selection of G ' to chase after The ratio of the middle total number of persons of G ' shared by number with the Gossiper of j, P_ijIndicate that Gossiper i follows the probability of Media j.

The present invention is further improved, the presence of a Media, and the public opinion of each Gossiper intelligent body can be accelerated to tend to system One；When there are in the environment of multiple Media competition, the dynamic change of each Gossiper intelligent body idea is to be influenced by each Media Weighted average.

Compared with prior art, the beneficial effects of the present invention are: in the environment of Continuous action space, intelligent body with it is other The interests of oneself can either be maximized during intelligent body interaction, and finally can learn to arrive Nash Equilibrium.

Detailed description of the invention

Fig. 1 is that r=0.7 > of the present invention 2/3, a=0.4, b=0.6, two intelligent body converge to the signal of Nash Equilibrium point Figure；

Fig. 2 is that r=0.6 < of the present invention 2/3, a=0.4, b=0.6, two intelligent body converge to the signal of Nash Equilibrium point Figure；

Fig. 3 is that the public opinion of Gossiper-Media model each network when full-mesh network does not have Media develops schematic diagram；

Fig. 4 is that the public opinion of Gossiper-Media model each network when small-world network does not have Media develops schematic diagram；

Fig. 5 is that the public opinion differentiation of Gossiper-Media model each network when full-mesh network has a Media is shown It is intended to；

Fig. 6 is that the public opinion differentiation of Gossiper-Media model each network when small-world network has a Media is shown It is intended to；

The public opinion of Fig. 7 each network when being the Media competed there are two Gossiper-Media model has in full-mesh network Develop schematic diagram；

The public opinion of Fig. 8 each network when being the Media competed there are two Gossiper-Media model has in small-world network Develop schematic diagram.

Specific embodiment

The present invention is described in further details with reference to the accompanying drawings and examples.

Nash Equilibrium strategy on Continuous action space of the invention is extended from single intelligent body nitrification enhancement CALA [7] (Continuous Action Learning Automata, continuous action learning automaton), by introducing WoLS (Win or Learn Slow wins then Fast Learning) study mechanism, the study that algorithm is effectively handled in multiple agent environment asks Topic, therefore, Nash Equilibrium strategy of the invention is referred to as are as follows: WoLS-CALA (Win or Learn Slow Continuous Action Learning Automaton wins then fast-continuous action learning automaton).The present invention first carries out the CALA It is described in detail.

Continuous action learning automaton (CALA) [7] is the Policy-Gradient of the problem concerning study of a solution Continuous action space Nitrification enhancement.Wherein, the strategy of intelligent body is defined as the Normal Distribution N (u on motion space^t,σ^t) probability it is close Spend function.

The policy update of CALA intelligent body is as follows: in moment t, intelligent body is according to normal distribution N (u^t,σ^t) select one to move Make x^t；Execution acts x^tAnd u^t, then obtain corresponding return V (x respectively from environment^t) and V (u^t), it means that algorithm is each It is acted twice with being needed to be implemented during environmental interaction；Finally, updating normal distribution N (u according to following formula^t,σ^t) mean value And variance,

Wherein,

Here α_uAnd α_σFor learning rate, K is a normal number, is used to control algolithm convergence.Specifically, the size and calculation of K The study number of method is related, is typically set to the order of magnitude of 1/N, and N is algorithm iteration number, σ_LIt is the lower bound of variances sigma.Algorithm continues Mean value and variance are updated until u is constant and σ^tIt is intended to σ_L.Mean value u is by a most optimal solution of problem of being directed toward after algorithmic statement.Side The size of σ determines the exploring ability of CALA algorithm: σ in journey (2)^tBigger, CALA is more possible to search out potential better Movement.

By definition, CALA algorithm is the learning algorithm based on Policy-Gradient class.The algorithm is confirmed returning by theory In the case where reporting function V (x) smooth enough, CALA algorithm, which can be sought, looks for local optimum [7].De Jong et al. [34] passes through Reward Program is improved, CALA is extended and is applied under multiple agent environment, and its modified hydrothermal process can by experimental verification To converge to Nash Equilibrium.WoLS-CALA proposed by the present invention introduces " WoLS " mechanism and solves the problems, such as multi-agent Learning, and from Theoretically analyze and prove that algorithm can learn to arrive Nash Equilibrium in continuous motion space.

Since CALA requires intelligent body to need disposable to obtain sampling action simultaneously and expectation acts in each study Return, however this be in most of intensified learning environment it is infeasible, usual intelligent body be in the interaction of environment every time only A movement can be executed.For this purpose, the present invention extends CALA in terms of Q value function estimation and variable learning rate two, propose WoLS-CALA algorithm.

1, Q Function Estimation

In free-standing multiple agent intensified learning environment, intelligent body once selects a movement, then obtains from environment Return.To the exploring mode of normal distribution, one naturally mode be exactly using Q value to expectation movement u average return into Row estimation.Specifically, in formula (1) intelligent body i movement u_iExpected returnsIt can be estimated with following formula,

HereFor the sampling action of t moment.It is that intelligent body i is acted in selectionWhen the return that receives, when by $ t $ Carve the teamwork of each intelligent bodyIt determines.It is intelligent body i about the learning rate to Q.Update side in formula (3) Formula is the normal method for the value function that intensified learning estimates single state, and essence is to use r_iAssembly average go to estimateThis It is outer it is a further advantage thatIt can update for one time one, and the return newly received is all α forever to the accounting that Q value is estimated.

According to formula (3), the renewal process (formula (1)) of u and the renewal process (formula (2)) of σ are represented by,

HereFor the sampling action of t moment.It is that intelligent body i is acted in selectionWhen the return that receives, it is each by t moment The teamwork of intelligent bodyIt determines.WithIt is intelligent body i about u_iAnd σ_iLearning rate.

However new problem directly can be brought to algorithm using Q Function Estimation in multiple agent environment.Because in more intelligence In energy body environment, the return of intelligent body is influenced by other intelligent bodies, and the strategy change of other intelligent bodies will lead to environment not Stablize.Update mode in formula (4) does not ensure that u can adapt to the dynamic change of environment.Here it cites a plain example, Assuming that $ t $ moment intelligent body i has acquired the optimal movement at current timeAndIt is exactly rightAccurate estimation According to definition, in t moment, to arbitrary x_iHaveFormula (3) is brought into (4) to obtain,

If environment remains unchanged, then havingContinue to set up；If however environment changes, it is assumed thatAndIt is no longer optimal movement, then can existSo that its corresponding returnIn this case continue according to the update mode in formula (5), u_iIt can be far from x_i, However theoretically becauseTo guarantee accurately estimation u_iIt should be close to x_i.Because Q is the statistical estimate of r, institute It is slower than the variation of r with the update of Q, cause below at no point in the update processIt sets up always, u under multiple repairing weld_iIt will It is continuously maintained inIt is constant nearby.Theoretically u_iStandard, which should be changed, looks for new optimal movement just right.Cause the original of these problems Because of unstability caused by being primarily due to multiple agent environment, and traditional estimation method (such as Q study) can not be answered effectively Variation to environment.

2, WoLS rule and analysis

In order to more accurately estimate the expected returns of u in multiple agent environment, the present invention passes through variable learning rate Mode updates expectation movement u.Formally, it is expected that acting u_iLearning rate be more newly defined as following formula according to the following formula,

Then u_iUpdate be represented by

WoLS rule can be intuitively construed to, if the return V (x) of intelligent body movement x is greater than the return V (u) of desired u, So it should learn fastly, on the contrary then slow.It can be seen that WoLS and WoLF (Win or Learn Fast) [35] Strategy is exactly the opposite.Difference is that the target of WoLF design is to guarantee convergence, and WoLS strategy of the invention is In order to enable algorithm to update according to increased direction is returned while guaranteeing the expected returns of correct estimation movement u u.By the inherent dynamic characteristic of analysis WoLS strategy, following conclusion can be obtained,

On 1 Continuous action space of theorem, the learning dynamics using the CALA algorithm of WoLS rule can be approximately that gradient rises Tactful (GA, gradient ascent).

It proves: according to definition (4), it is known that x^tIt is intelligent body in moment t according to normal distribution N (u^t,σ^t) selection movement, V (x^t) and V ({ u^t) it is to correspond respectively to movement x^tAnd u^tReturn.Define f (x)=E [V (x^t)|x^t=x] it is about movement x Expected returns function.Assuming that α_uInfinitesimal, then u in WoLS-CALA algorithm^tDynamic change can be indicated by following ODE,

Here N (u, σ_u) be just be distributed very much probability density function (dN (a, b) indicate mean value be a, variance b²Normal state It is distributed the differential about a).X=u+y is enabled, then by f (x) Taylor expansion in formula (8) at y=0, and abbreviation arrangement can obtain,

It notices in formula (9), itemAnd σ²It is that weighing apparatus is positive.

The conclusion of CALA can be used directly as original CALA algorithm in the renewal process (formula (4)) of standard deviation sigma: to Fixed sufficiently large positive number a K, σ will eventually converge to σ_L.Convolution (9), the present invention can obtain following conclusion:

To a small positive number σ_L(such as 1/10000), after the enough time, about u^tODE can be approximately,

WhereinFor a small normal number.F ' (u) is gradient direction of the function f (u) at u.Formula (10) table Bright u can change towards the gradient direction of f (u), i.e. the fastest-rising direction f (u).I.e. the dynamic trajectory of u can be approximately in gradient Rise strategy.

In the presence of only one intelligent body, the dynamic of u finally will converge to an optimum point, because working as u= u^*When for an optimum point,And

It can be seen that from theorem 1, the learning dynamics of the CALA intelligent body expectation movement of WoLS rule are similar to previously described Gradient rise strategy, i.e., they about the differential of time can all be expressed as shaped likeForm.If f (u) is deposited In multiple local optimums, can algorithm finally converge to global optimum depending on algorithm to exploration-utilization (Exploration- Exploitation distribution [36]), and this is a problem that can not be satisfactory to both parties in intensified learning field.To explore to the overall situation most It is excellent the common approach is that the initial exploration rate σ (i.e. standard deviation) of algorithm is taken biggish value, and to the initial learning rate of σ Especially small value is taken, to guarantee that algorithm there can be enough sampling numbers within the scope of entire motion space.Furthermore it is advised plus WoLS The expectation movement u of CALA algorithm after then can also be restrained when standard deviation sigma is not 0 itself, therefore, in order to ensure that enough spies The lower bound σ of rope rate σ_LA biggish value can be taken.In summary tactful, it may learn by choosing suitable parameter algorithm Global optimum.

Another problem is to may result in algorithm using pure gradient rising strategy under multiple agent environment not restrain, thus Present invention combination PHC (Policy Hill Climbing, strategy are climbed the mountain) [35] algorithm, proposes an Actor-Critic type Free-standing multiple agent nitrification enhancement, referred to as WoLS-CALA.The main thought of Actor-Critic framework is strategy Estimation and strategy update learn respectively in independent process, processing strategie estimating part is known as Critic, policy update Part be known as Actor.Specific learning process is following (algorithm 1),

The learning strategy of 1 WoLS-CALA intelligent body i of algorithm

For simplicity, with two constant αs in algorithm 1_ubAnd α_us, (α_ub> α_us), instead of u_iLearning rateIf intelligent body I execution acts x_iThe return r received afterwards_iGreater than current accumulative average return Q_i, then u_jLearning rate be α_ub(winning), (otherwise losing) is α_us(the 3.3rd step).Because containing denominator φ (σ in formula (7) and (4)_i ^t), one when denominator very little Point tolerance all can cause very big influence to the update of u and σ.More held during specific experiment using two fixed step-lengths The renewal process of algorithm easy to control, it is also easy to accomplish.It is furthermore noted that the update step-length of Q and the step-length of u in the 3.5th step of algorithm It is synchronous, i.e., in r_i> Q_iShi Douwei α_ub, otherwise it is all α_us.Because of α_ubAnd α_usIt is the number of two very littles, in u_iVery little neighborhood It is interior, Q_iAbout u_iMapping available linearization be Q_i=Ku_i+ C, wherein slopeIf u_iIt changesThenThe purpose done so is also for more accurately estimating Count the expected returns of u.Finally (step 4), algorithm withConvergence is exported as loop termination condition and algorithm.The purpose done so Primarily to preventing in the environment of competition, u_iIt will appear periodic solution and cause algorithm that cannot terminate.Here it should be noted variable And u_iRepresent different meanings:For the cumulative statistics average value of the sampling action of intelligent body i, it is most under multiple agent environment Termination fruit can converge to Nash Equilibrium strategy；And u_jIt is the expectation mean value of the policy distributed of intelligent body i, it may under competitive environment It can periodically be shaken near equilibrium point.Detailed explanation will provide in theorem 2 later.

Because the dynamic trajectory in higher dimensional space might have chaos phenomenon, cause to be difficult to algorithm with multiple intelligence Qualitative analysis is done in dynamic behaviour when body.It is essentially all based on two to the dynamic analysis of multiple agent related algorithm in field [3537-39] of intelligent body.Therefore here there are two Main Analysis tools the case where WoLS-CALA intelligent body.

Theorem 2 gives positive number σ_LWith a sufficiently large positive number K, the strategy of two WoLS-CALA intelligent bodies may finally Converge to Nash Equilibrium.

It proves: two classes can be divided by the position Nash Equilibrium of equilibrium point: being located at Continuous action space (bounded closed set) side Equilibrium point in boundary and it is another kind of be equalization point inside Continuous action space.In view of borderline equalization point can wait Valence is the equalization point inside the lower one-dimensional space, and this example inquires into emphatically the second class equalization point here.One ODE it is dynamic State feature depends on the stability property [40] of its internal balance point, therefore the equalization point in this example calculating formula first (10), then Analyze the stability of these equalization points.

It enablesFor intelligent body i in t moment according to normal distributionThe movement of stochastical sampling.WithPoint It Wei not actWithCorresponding expected returns.Such as fruit dotIt is an equalization point of equation (10), thenHaveAccording to nonlinear dynamics theory [40], the stability of point eq can be under The characteristic value decision of face matrix,

WhereinAs i ≠ j.

Furthermore according to the definition of Nash Equilibrium, Nash Equilibrium pointMeet lower surface properties,

Formula (12) is brought into M, it is known that the characteristic value of Nash equilibrium point belongs to one of following three kinds of possibilities:

(a) all characteristic values of matrix M have negative real part.This kind of equalization point is asymptotically stability, i.e., all $ eq $ are attached Close track finally can all converge to this equalization point.

(b) all characteristic values of matrix M have non-positive real part, and the characteristic root containing a pair of pure void.This kind of balance Point is stable, but the alpha limit set of the track near it is periodic solution, and alpha limit set is noncountable.In addition, being easy to prove I.e.The Nash Equilibrium will finally be converged to.In view of WoLS-CALA To add up average valueTo export, therefore algorithm can also handle this kind of equalization point problem.

(c) there are the characteristic value of positive real part, i.e. equalization point are unstable by matrix M.To this kind of equalization point, according to Nonlinear Dynamic Theory of mechanics, the track around the unstable equilibrium point can be divided into two kinds: track and other tracks in stable manifold cite {Shilnikov1998Methods}.Stable manifold is the subspace generated by the stable corresponding feature vector of characteristic value.Place Track in stable manifold theoretically finally can all converge to this equalization point.In view of due to randomness and calculate error, It is 0 that algorithm, which maintains the probability that do not go out in the subspace,.And all tracks for being not belonging to the stable manifold all will be gradually remote The above-mentioned other kinds of equalization point analyzed from the equalization point and finally is converged to, that is, converges to borderline equalization point or One and the second class equalization point.

In addition, being similar to single intelligent body environment, closed according to the analysis to theorem 1 given if there is multiple equalization points When suitable exploration-utilization rate, such as σ_LSufficiently large, σ takes big initial value and small learning rate, algorithm can converge to one receive it is assorted It weighs point (global optimum of each intelligent body when other intelligent body strategies are constant).In conclusion the present invention is completed to algorithm Converge to the proof of Nash Equilibrium.

The present invention also provides a kind of social networks public opinions based on the Nash Equilibrium strategy on the Continuous action space Evolution model, the social networks public opinion evolution model include two class intelligent bodies, respectively ordinary populace in simulation social networks Gossiper class intelligent body and simulation social networks in media or public figure for the purpose of attracting ordinary populace Media Class intelligent body, therefore, social networks public opinion evolution model of the invention are also Gossiper-Media model.Wherein, described Media class intelligent body returns it optimal idea using the Nash Equilibrium policy calculation on the Continuous action space, updates Its idea is simultaneously broadcasted in social networks.WoLS-CALA algorithm is applied to the public opinion in true social networks and developed by the present invention Research in, by being modeled to the media in network using WoLS-CALA, inquire into competition media public opinion can be caused it is assorted The influence of sample.

It is described in detail below:

1.Gossiper-Media model

The present invention proposes a multiple agent intensified learning frame, Gossiper-Media model, to study group's public opinion Differentiation.Gossiper-Media model includes two class intelligent bodies, Gossiper class intelligent body and Media class intelligent body.Wherein Gossiper class intelligent body is used to simulate ordinary populace in live network, and idea (public opinion) is simultaneously by Media and other The influence of Gossiper；And Media class intelligent body is used to simulate the media in social networks for the purpose of attracting masses or the public The idea of personage, the selection oneself of such intelligent body active go to maximize the follower of oneself.Considering one has N number of intelligent body Network, wherein the number of Gossiper be | G |, the number of Media is | M | (N=G ∪ M).Assuming that Gossiper and Media Between be full connection, i.e., each Gossiper can equiprobable any one Media of selection interaction.And between Gossiper Full connection is not provided, i.e., each Gossiper is only possible to interact with the neighbours of oneself.Network between Gossiper is by between it Social networks determine.Particularly, in emulation experiment below, it is imitative to do that this example respectively defines two kinds of Gossiper networks True experiment: Quan Liantong network (fully connected network) and small-world network (small-world network). The idea of note Gossiper i and Media j is denoted as x respectively_iAnd y_j.The interactive process of each intelligent body is in accordance with algorithm 2 in model.

The learning model of idea in 2 Gossiper-Media network of algorithm

Firstly, the idea of each Gossiper and Media is by a random value being initialized as on motion space [0,1] (step 1).Then in interacting each time, each intelligent body adjusts separately the idea of oneself until algorithmic statement according to Different Strategies (each intelligent body all no longer Changing concept).To each Gossiper intelligent body, selection first selects the object interacted with it: according to Select a Gossiper in the probability ξ random neighbours from it, or according to probability 1- ξ random one Media of selection (the 2.1 steps).This subsequent Gossiper updates its idea according to algorithm 3, and is selected according to the idea difference of itself and each Media Follow a Media closest to oneself idea.Assuming that Media intelligent body can be by sampling random acquisition a part The idea of Gossiper, and all Media are broadcast to, it is denoted as G ' (the 2.2nd step) here.Then each Media uses WoLS-CALA The mutual game of algorithm calculates the idea that can maximize the follower of oneself, and updated idea is broadcast to entire net In network (the 2.3rd step).Each Media can also be sampled alone in principle, so that the G ' that they obtain is different, this is to below The study influence of WoLS-CALA algorithm is simultaneously little, because theoretically the idea distribution of G ' is identical as G.Environmental postulates of the invention Primarily to easy consider, while also reducing the possible uncertainty as caused by stochastical sampling.

1.1Gossiper tactful

The strategy of each Gossiper includes two parts: 1) how to break with the conventional idea；2) Media followed how is selected.Tool Body is described as follows (algorithm 3):

The strategy that 3 Gossiper i of algorithm takes turns in τ

To Gossiper i, its idea: x is initialized first_i ^τ=x_i ^τ-1(step 1).Then according to BCM (the bounded Confidence model, bounded confidence model) tactful [12,33] update its idea (step 2).BCM is a kind of more typical The model of group's idea is described, the idea of the intelligent body based on BCM is only influenced by intelligent body similar in idea therewith.In algorithm In 3, is only differed with the idea of the intelligent body of its selection and be less than threshold value d_g(or d_m) when, Gossiper just will be updated its idea. Here d_gAnd d_mThe intelligent body for corresponding respectively to selection is Gossiper and Media.Threshold value d_g(or d_m) size represent Gossiper receives the degree of new idea.Intuitively, d is bigger, and Gossiper is easier to be influenced [41- by other intelligent bodies 43].Then the Gossiper compares the difference of oneself and other Media ideas, follows the (the 3rd according to one Media of probability selection Step).Here probability P is used_ij ^τIt indicates to follow the probability of Media j in τ moment Gossiper i selection, meets following characteristic:

(i) as | x_i-y_j| > d_mWhen, P_ij=0；

(ii)(ii)P_ijIdea y of the > 0 and if only if Media j_jMeet | x_i-y_j|≤d_m；

(iii)(iii)P_ijWith idea x_iAnd y_jDistance | x_i-y_j| increase and reduce.

It notices if rightHave | x_i-y_j| > d_m, thenThis means that there are this possibility, one A Gossiper will not follow any one Media.Equation λ_ijIn parameter δ be a small positive number, for preventing score Denominator is 0.

1.2 Media strategy

To the idea sample information of one group of given Gossiper, each Media can be by learning adjustment appropriate oneself Idea, to cater to the hobby of Gossiper, so that more Gossiper be attracted to follow it.There are the more of multiple Media In multiagent system, Nash Equilibrium is that multiple intelligent bodies are vied each other the stable state being finally achieved.In this condition, each intelligence Energy body cannot obtain higher return by the one-side strategy for changing oneself.In view of the motion space of Media is to connect Continuous (idea is defined as any point on section [0,1]), builds the behavior of Media used here as WoLS-CALA algorithm Mould, algorithm 4 are the Media strategies based on WoLS-CALA building.

The strategy that 4 Media j of algorithm takes turns in τ

Media j current return r_jIt is defined as total people in G ' shared by the number of the middle Gossiper for selecting to follow j of G ' Several ratios,

Here λ_ijDefinition with algorithm 3.P_ijIndicate that Gossiper i follows the probability of Media j.

2, group's public opinion dynamic analysis

Remember { y_j}_j∈M, y_j∈ (0,1) is the idea of Media j.Assuming that Gossiper network is infinitely great, then Gossiper Idea distribution can be indicated by a continuous distribution density function, indicate Gossiper group in t moment with p (x, t) here The probability density function of idea distribution.Then Gossiper public opinion differentiation can be expressed as probability density function p (x, t) about when Between partial derivative.First this example consider only one Media there are the case where.

Theorem 3 is contained only at one in the Gosiper-Media network of a Media, and the distribution of Gossiper idea is drilled Become and obey following formula,

Wherein,

Here I₁=x | | and x-y | < (1- α_m)d_m, I₂=x | d_m≥|x-y|≥(1-α_m)d_m}。

It proves:, Gossiper based on BCM theoretical based on MF approximate [40] (Mean Field approximations) The probability distribution of idea about t local derviation p (x, t) can with [12] are expressed below,

Here W_x+y→xIndicate Gossiper of the idea equal to x+y can probability of the Changing concept to x, and W_x+y→xp(x+y)dy Indicate the ratio for being transferred to x from section (x+y, x+y+dy) in the idea of time interval (t, t+dt) interior intelligent body.Similar W_x→x+yIndicate probability of the intelligent body meeting Changing concept of idea x to x+y, W_x→x+yP (x) dy indicates that idea is equal to the Gossiper of x It is transferred to section (x+y, x+y+dy) ratio.

According to the definition of algorithm 3, intelligent body Gossiper is according to probability ξ by the ideal effect of other Gossiper, Huo Zheyi Then probability 1- ξ is made the decision of oneself by the ideal effect of Media.By W_x+y→xAnd W_x→x+yIt is refined as by other Gossiper Idea and the two parts influenced by Media idea, are denoted as w respectively^[g]And w^[m], then W_x→x+yAnd W_x+y→xIt is represented by,

Formula (18), which is brought into formula (17), to be obtained,

Definition

Wherein Ψ_g(x, t) indicates the variation that the probability density function p (x, t) of intelligent body g idea is influenced by Gossiper Rate.Weisbuch G [45] et al. is proved Ψ_g(x, t) obeys following formula,

HereIt is second order local derviation of the p about x.α_gBe one between 0 to 0.5 real number.d_gFor The threshold value of Gossiper.

Formula Ψ_m(x, t) represents the change rate that the distribution density function p (x, t) of idea is influenced by media.Assuming that Media j Idea be u_j(u_j=x+d_j), then the idea distribution of Media can use Dirac delta equation q (x)=δ (x-u_j) table Show.Dirac delta equation δ (x) [46] is commonly used for simulating the narrow peaking function (pulse) of a height and other similar pumping As concepts, charge, point mass or electronics are such as put, is defined as follows,

The then rate of transform from x+y to xIt is represented by

δ (x- [(x+y)+α in formula (21)_m((x+z)-(x+y))]) indicate that following event occurs, idea x+y is by idea x+z Influence and be transferred to x.Q (x+z) is distribution density of the Media at idea x+z.Similarly, w_x→x+yIt can be expressed as,

Convolution (21)-(22) calculate and arrange and can obtain,

Wherein I₁=x | | and x-y | < (1- α_m)d_m, I₂=x | d_m≥|x-y|≥(1-α_m)d_m}。

Composite type (20) is completed to prove.

This example can be seen that from formula (14), and the change rate of p (x, t) is formula Ψ_g(x, t) and Ψ_mThe weighted average of (x, t). The former, which represents public opinion variation, is influenced part by Gossiper network, and the latter, which represents, is influenced part by Media network.Only Formula Ψ containing Gossiper_g(x, t) was researched and analysed by the work [45] of Weisbuch G.Its obtain one it is important Property be from any one distribution, the point of local optimum can gradually be strengthened in distribution density, this shows pure Gossiper net The development of public opinion can gradually tend to always in network.In addition, can be seen that from theorem 3, formula Ψ_g(x, t) and formula Ψ_m(x, t) all with The specific network of Gossiper is unrelated, and when this shows network infinity, the development of public opinion is not influenced by network structure.

The second part of following analysis equation (14), Ψ_m(x, t) (formula (23)).Assuming that y is constant, analysis (23) can ,

Intuitively, formula (24) shows that the viewpoint of Gossiper similar with Media idea can all converge to this Media, It therefore follows that following conclusion,

The presence of 1 one Media of inference can accelerate the public opinion of Gossiper to tend to unified.

Below this example consider multiple Media there are the case where.Define P_j(x) idea for being Gossiper is at x by Media The probability that j influences, then

So Gossiper with multiple Media competition in the environment of, the dynamic change of idea can be expressed as by The weighted average that each Media influences.Following conclusion can be obtained,

The dynamic change of the distribution function of 2 Gossiper idea of inference obeys following formula:

Wherein Ψ_g(x, t) and Ψ_m(x, t) is defined respectively by formula (20) and (23).

3, emulation experiment and analysis

First verify that WoLS-CALA algorithm may learn Nash Equilibrium.Then provide Gossiper-Media model Experiment simulation, for verifying the theoretical analysis result of front.

3.1 WoLS-CALA algorithm performances are examined

This example considers the Gossiper-Media model of a simplified version, to examine whether WoLS-CALA algorithm can be learned Practise Nash Equilibrium strategy.Specifically, the problem of two Media are competed follower is modeled as following objective optimisation problems,

max(f₁(x,y),f₂(x,y))

S.t., (s.t. indicates constraint condition to x, y ∈ [0,1], is the standard literary style for optimizing class problem.)(26)

Wherein

And

r∈[0,1].A, b ∈ [0,1] ∧ | a-b | >=0.2 is the idea of Gossiper.

Here function f₁(x, y) and f₂R in (x, y) simulation algorithm 4, respectively represent Media 1 and 2 teamwork be < The return for x, y > be.This example uses two WoLS-CALA intelligent bodies, controls x and y respectively by independent study, each to maximize From Reward Program f₁(x, y) and f₂(x,y).In the model, the idea of Gossiper can according to various forms of Nash Equilibriums It is divided into two classes:

(i) as r > 2/3, equilibrium point is that (a, a), as r < 1/3, equilibrium point is (b, b)；

(ii) (ii) as 1/3≤r≤2/3, equilibrium point be set | x-a | 0.1 ∧ of < | y-b | < 0.1 or | x-b | < 0.1 ∧ | y-a | any point on < 0.1.

In specific emulation experiment, this example has respectively taken a point, i.e. r=0.7 > 2/3 and r=0.6 in the two types < 2/3.Then it observes when the idea distribution of Gossiper is different, can algorithm can learn as expected to Nash Equilibrium.Table 1 is The parameter setting of WoLS-CALA.

1 parameter setting of table

Fig. 1 and 2 is the simulation result of two experiments, can be evident that, Media intelligent body is passing through in two experiments After crossing 3000 times or so study, Nash Equilibrium has all been converged to, that is to say, that converge to<0.4,0.4 when r=0.6>, r <0.4,0.57>has been converged to when=0.7.As shown in Figure 1, when r=0.7 > 2/3, a=0.4, b=0.6, two intelligent body are received Hold back Nash Equilibrium point (0.4,0.4)；As shown in Fig. 2, working as r=0.6 < 2/3, a=0.4, b=0.6, intelligent body 1 (agent1) X=0.4 is converged to, intelligent body 2 (agent2) converges to y=0.57.

The experiment simulation of 3.2 Gossiper-Media models

The simulation result of this trifle displaying Gossiper-Media model.Consider 200 Gossiper and there is difference The experimental situation of number Media is respectively as follows: (i) without Media；(ii) only one Media；(iii) there are two competitions Media.To each environment, this example considers two kinds of representative Gossiper networks, full-mesh network (Fully respectively Connected Network) and small-world network [47] (Small-World Network).Pass through these comparative experimentss, this example Inquire into the influence that Media develops Gossiper public opinion.

To even things up, each experimental situation uses same parameter setting.Same net is used in three experimental situations The initial idea of network and identical Gossiper and Media.Here, small-world network is constructed using Watts-Strogatz Method [47] is generated at random by degree of communication p=0.2.The initial idea of each Gossiper is by being evenly distributed on section [0,1] Stochastical sampling.The initial idea of Media is 0.5.In view of the observation for crossing conference interference experiment of threshold value, here will Gossiper-Media threshold value d_mWith Gossiper-Gossiper threshold value d_gIt is set as a small positive number 0.1.Gossiper Habit rate α_gAnd α_mIt is set as 0.5.Set G ' is sampled from G at random, and is met | G ' |=80 % | G |.

Because there are two types of Gossiper network modes under each environment: full-mesh network and small-world network.Therefore, Fig. 3- 4 respectively show under full-mesh network and small-world network, and the public opinion of each network develops when without Media；Fig. 5-6 is opened up respectively Show under full-mesh network and small-world network, the public opinion of each network develops when with a Media；Fig. 7-8 is shown respectively Under the full-mesh network and small-world network, there are two the public opinions of network each when the Media competed to develop for tool.From these figures In, it can be seen firstly that under three kinds of all Media environment, the number of the final convergent point of different Gossiper networks It is identical: to converge to 5 in zero Media environment；4 are converged in one Media environment；3 are converged in two Media environment It is a.This phenomenon is consistent with the conclusion in theorem 3 and inference 2, the topology of the public opinion dynamic and Gossiper network of Gossiper Structure is unrelated, because the public opinion dynamic under heterogeneous networks of Gossiper can be modeled with identical formula.

Second, it can be observed from Fig. 3-6, when there are in the case where a Media, the carriage of Gossiper in two networks 4 all are reduced to from 5 by last convergent points.This shows that the presence of Media can accelerate the generation of Gossiper public opinion unification, Meet conclusion of this example in inference 1.Meanwhile from Fig. 5-8, when the number of Media increases to 2 from 1, in two networks The last convergent points of the public opinion of Gossiper are further reduced to 3 from 4.This shows that the Media of competition can be further speeded up The unification of Gossiper public opinion.

In addition, experimental result is also able to verify that the performance of WoLS-CALA algorithm.In fig. 5 and fig., the sight of Media intelligent body Thought maintains always around the idea with most Gossiper (N in full-mesh network_max=69, N in small-world network_max =68).This phenomenon meets the expection of algorithm design, i.e. WoLS-CALA intelligent body can learn to global optimum.In Fig. 7 and In Fig. 8, it can be seen that when there are two Media, the idea of a Media is maintained around the idea with most Gossiper (N in two networks_maxIt is all that 89), another Media maintains around the idea with more than second Gossiper (full-mesh net N ' in network_max=70, N ' in small-world network_max=66).This also complies with the expection of theorem 2, and two WoLS-CALA intelligent bodies are most Nash Equilibrium can be converged to eventually.The idea of Media vibrates up and down around Gossiper idea always in Fig. 3-8, be because In Gossiper-Media model, the optimal strategy of Media is not unique (to be less than d around Gossiper idea_mIn the range of all It is the optimum point of Media).

4, it summarizes

The invention proposes the nitrification enhancement WoLS- of the Continuous action space of the multiple agent of an independent study CALA, demonstrating the algorithm in terms of theoretical proof and experimental verification two respectively may learn Nash Equilibrium.Then should Algorithm is applied in the research that public opinion develops in network environment.Here by the individual in social networks be divided into Gossiper and Two class of Media models respectively, and wherein Gossiper class represents ordinary populace, and Media represents society using the modeling of WoLS-CALA algorithm Hand over the individual for the purpose of attracting public concern such as media.By modeling respectively to two kinds of intelligent bodies, the present invention has discussed difference The influence that the competition of number Media generates Gossiper public opinion.Last theoretical and experiment shows that the competition of Media can add Fast public opinion reaches unanimity.

The specific embodiment of the above is better embodiment of the invention, is not limited with this of the invention specific Practical range, the scope of the present invention includes being not limited to present embodiment, all equal according to equivalence changes made by the present invention Within the scope of the present invention.

The corresponding bibliography of the label being related in the present invention is as follows:

[1]PazisJ,LagoudakisMG.Binary Action Search for Learning Continuous- action Control Policies[C].In Proceedings of the 26th Annual International Conference on Machine Learning,New York,NY,USA,2009:793–800.

[2]Pazis J,Lagoudakis M G.Reinforcement learning in multidimensional continuous action spaces[C].In IEEE Symposiumon Adaptive Dynamic Programming& Reinforcement Learning,2011:97–104.

[3]Sutton R S,Maei H R,Precup D,et al.Fast Gradient-descent Methods for Temporal-difference Learning with Linear Function Approximation[C].In Proceedings of the 26th Annual International Conference on Machine Learning, 2009:993–1000.

[4]Pazis J,Parr R.Generalized Value Functions for Large Action Sets [C].In International Conference on Machine Learning,ICML 2011,Bellevue, Washington,USA,2011:1185–1192.

[5]Lillicrap T P,Hunt J J,Pritzel A,et al.Continuous control with deep reinforcement learning[J].Computer Science,2015,8(6):A187.

[6]KONDA V R.Actor-critic algorithms[J].SIAM Journal on Control and Optimization,2003,42(4).

[7]Thathachar M A L,Sastry P S.Networks of Learning Automata: Techniques for Online Stochastic Optimization[J].Kluwer Academic Publishers, 2004.

[8]Peters J,Schaal S.2008Special Issue:Reinforcement Learning of Motor Skills with Policy Gradients[J].Neural Netw.,2008,21(4).

[9]van Hasselt H.Reinforcement Learning in Continuous State and Action Spaces[M].In Reinforcement Learning:State-of-the-Art.Berlin, Heidelberg:Springer Berlin Heidelberg,2012:207–251.

[10]Sallans B,Hinton G E.Reinforcement Learning with Factored States and Actions [J].J.Mach.Learn.Res.,2004,5:1063–1088.

[11]Lazaric A,Restelli M,Bonarini A.Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods[C].In Conference on Neural Information Processing Systems,Vancouver,British Columbia,Canada,2007:833–840.

[12]Quattrociocchi W,Caldarelli G,Scala A.Opinion dynamics on interacting networks:media competition and social influence[J].Scientific Reports,2014,4(21):4938–4938.

[13]Yang H X,Huang L.Opinion percolation in structured population[J] .Computer Physics Communications,2015,192(2):124–129.

[14]Chao Y,Tan G,Lv H,et al.Modelling Adaptive Learning Behaviours for Consensus Formation in Human Societies[J].Scientific Reports,2016,6: 27626.

[15]De Vylder B.The evolution of conventions in multi-agent systems [J].Unpublished doctoral dissertation,Vrije Universiteit Brussel,Brussels, 2007.

[16]Holley R A,Liggett T M.Ergodic Theorems for Weakly Interacting Infinite Systems and the Voter Model[J].Annals of Probability,1975,3(4):643– 663.

[17] Nowak A, Szamrej J, Latan thatch B.From private attitude to public opinion:A dynamic theory of social impact.[J].Psychological Review,1990,97 (3):362–376.

[18]Tsang A,Larson K.Opinion dynamics of skeptical agents[C].In Proceedings of the 2014international conference on Autonomous agents and multi-agent systems,2014:277–284.

[19]Ghaderi J,Srikant R.Opinion dynamics in social networks with stubborn agents:Equilibrium and convergence rate[J].Automatica,2014,50(12): 3209–3215.

[20]Kimura M,Saito K,Ohara K,et al.Learning to Predict Opinion Share in Social Networks.[C].In Twenty-Fourth AAAI Conference on Artificial Intelligence,AAAI 2010,Atlanta,Georgia,Usa,July,2010.

[21]Liakos P,Papakonstantinopoulou K.On the Impact of Social Cost in Opinion Dynamics [C].In Tenth International AAAI Conference on Web and Social Media ICWSM,2016.

[22]Bond R M,Fariss C J,Jones J J,et al.A 61-million-person experiment in social influence and political mobilization[J].Nature,2012,489 (7415):295–8.

[23]Szolnoki A,Perc M.Information sharing promotes prosocial behaviour[J].New Journal of Physics,2013,15(15):1–5.

[24]Hofbauer J,Sigmund K.Evolutionary games and population dynamics [M].Cambridge；New York,NY:Cambridge University Press,1998.

[25]Tuyls K,Nowe A,Lenaerts T,et al.An Evolutionary Game Theoretic Perspective on Learning in Multi-Agent Systems[J].Synthese,2004,139(2):297– 330.

[26]Szabo B G.Fath G(2007)Evolutionary games on graphs[C].In Physics Reports,2010.

[27]Han T A,Santos F C.The role of intention recognition in the evolution of cooperative behavior[C].In International Joint Conference on Artificial Intelligence,2011:1684–1689.

[28]Santos F P,Santos F C,Pacheco J M.Social Norms of Cooperation in Small-Scale Societies[J].PLoS computational biology,2016,12(1):e1004709.

[29]Zhao Y,Zhang L,Tang M,et al.Bounded confidence opinion dynamics with opinion leaders and environmental noises[J].Computers and Operations Research,2016,74(C):205–213.

[30]Pujol J M,Delgado J,Sang,et al.The role of clustering on the emergence of efficient social conventions[C].In International Joint Conference on Artificial Intelligence,2005:965–970.

[31]Nori N,Bollegala D,Ishizuka M.Interest Prediction on Multinomial, Time-Evolving Social Graph.[C].In IJCAI 2011,Proceedings of the International Joint Conference on Artificial Intelligence,Barcelona,Catalonia,Spain,July, 2011:2507–2512.

[32]Fang H.Trust modeling for opinion evaluation by coping with subjectivity and dishonesty[C].In International Joint Conference on Artificial Intelligence,2013:3211–3212.

[33]Deffuant G,Neau D,Amblard F,et al.Mixing beliefs among interacting agents[J].Advances in Complex Systems,2011,3(1n04):87–98.

[34]De Jong S,Tuyls K,Verbeeck K.Artificial agents learning human fairness[C].In International Joint Conference on Autonomous Agents and Multiagent Systems,2008:863–870.

[35]BowlingM,Veloso.Multiagent learning using a variable learning rate[J].Artificial Intelligence,2002,136(2):215–250.

[36]Sutton R S,Barto A G.Reinforcement learning:an introduction[M] .Cambridge,Mass:MIT Press,1998.

[37]Abdallah S,Lesser V.A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics[J].J.Artif.Int.Res.,2008,33(1):521–549.

[38]Singh S P,Kearns M J,Mansour Y.Nash Convergence of Gradient Dynamics in General-Sum Games[J],2000:541–548.

[39]Zhang C,Lesser V R.Multi-agent learning with policy prediction [J],2010:927–934.

[40]Shilnikov L P,Shilnikov A L,Turaev D,et al.Methods of qualitative theory in nonlinear dynamics/[M].World Scientific,1998.

[41]Dittmer J C.Consensus formation under bounded confidence[J] .Nonlinear Analysis Theory Methods and Applications,2001,47(7):4615–4621.

[42]LORENZ J.CONTINUOUS OPINION DYNAMICS UNDER BOUNDED CONFIDENCE:A SURVEY[J].International Journal of Modern Physics C,2007,18(12):2007.

[43]Krawczyk M J,Malarz K,Korff R,et al.Communication and trust in the bounded confidence model[J].Computational Collective Intelligence.Technologies and Applications,2010,6421:90–99.

[44]Lasry J M,Lions P L.Mean field games[J].Japanese Journal of Mathematics,2007,2(1):229–260.

[45]WeisbuchG,DeffuantG,AmblardF,etal.Interacting Agents and Continuous Opinions Dynamics[M].Springer Berlin Heidelberg,2003.

[46]Hassani S.Dirac Delta Function[M].Springer New York,2000.

[47]DJ W,SH S.Collectivedynamics of’small-world’networks[C].In Nature,1998:440–442.

Claims

1. the Nash Equilibrium strategy on Continuous action space, it is characterised in that include the following steps:

(2) initiation parameter, wherein the parameter includes the mean value u of intelligent body i expectation movement u_i, accumulative Average StrategyOften Number C, variances sigma_iWith accumulative average return Q_i；

(3.2) execution acts x_i, return r is then obtained from environment_i；

(3.3) if intelligent body i execution acts x_iThe return r received afterwards_iGreater than current accumulative average return Q_i, then u_i? Habit rate is α_ub, otherwise learning rate is α_us, u is updated according to selected learning rate_i；

(3.4) u is arrived according to study_iUpdate variances sigma_i；

(3.5) if intelligent body i execution acts x_iThe return r received afterwards_iGreater than current accumulative average return Q_i, then u_i? Habit rate is α_ub, otherwise learning rate is α_us, Q is updated according to selected learning rate_i；

(3.6) according to constant C and movement x_iIt updates

2. the Nash Equilibrium strategy on Continuous action space according to claim 1, it is characterised in that: in step (3.3) In step (3.5), the update step-length of Q and the update step-length of u are synchronous, in u_iNeighborhood in, Q_iAbout u_iMapping can be linear Turn to Q_i=Ku_i+ C, wherein slope

3. the Nash Equilibrium strategy on Continuous action space according to claim 2, it is characterised in that: given positive number σ_LWith One positive number K, the Nash Equilibrium strategy on the Continuous action space of two intelligent bodies may finally converge to Nash Equilibrium, In, σ_LIt is the lower bound of variances sigma.

4. the social networks public opinion based on the Nash Equilibrium strategy on the described in any item Continuous action spaces of claim 1-3 is drilled Varying model, it is characterised in that: the social networks public opinion evolution model includes two class intelligent bodies, respectively in simulation social networks Media or public people in the Gossiper class intelligent body and simulation social networks of ordinary populace for the purpose of attracting ordinary populace The Media class intelligent body of object, wherein the Media class intelligent body is using the Nash Equilibrium strategy on the Continuous action space Optimal idea is returned it in calculating, is updated its idea and is broadcasted in social networks.

5. social networks public opinion evolution model according to claim 4, it is characterised in that include the following steps:

S2: in interacting each time, each intelligent body according to following Developing Tactics oneself idea, until each intelligent body all no longer changes Become idea；

S21: to any one Gossiper class intelligent body, a neighbour is randomly choosed in Gossiper network according to setting probability It occupies, according to Media BCM policy update its idea and followed；

S22: a subset of stochastical sampling Gossiper network GGossiper idea in subset G ' is broadcast to All Media；

S23: to any one Media, using the Nash Equilibrium policy calculation on Continuous action space, it returns optimal idea, And updated idea is broadcast in entire social networks.

6. social networks public opinion evolution model according to claim 5, it is characterised in that: in the step s 21, described The operating method of Gossiper class intelligent body are as follows:

A1: idea initialization: x_i ^τ=x_i ^τ-1；

A2: the renewal of ideas: it is less than given threshold when the intelligent body is differed with the idea of the intelligent body of selection, updates the intelligent body Idea；

7. social networks public opinion evolution model according to claim 6, it is characterised in that: in step A2, if currently The neighbours of selection are Gossiper j, and | x_j ^τ-x_i ^τ| < d_g, then x_i ^τ←x_i ^τ+α_g(x_j ^τ-x_i ^τ)；If the neighbours currently selected It is Media k, andThenWherein, d_gAnd d_mRespectively it is directed to The threshold value of the idea setting of different types of neighbours, ɑ_gAnd ɑ_mRespectively it is directed to the learning rate of different types of neighbours.

8. social networks public opinion evolution model according to claim 7, it is characterised in that: in step A3, according to probability Media k is followed,Wherein,

9. social networks public opinion evolution model according to claim 8, it is characterised in that: in step S23, Media j Current return r_jIt is defined as the ratio of the middle total number of persons of G ' shared by the number of the middle Gossiper for selecting to follow j of G ',P_ijIndicate that Gossiper i follows the probability of Media j.

10. according to the described in any item social networks public opinion evolution models of claim 4-9, it is characterised in that: Media's In the presence of the public opinion of each Gossiper intelligent body can be accelerated to tend to unified；In the environment of being competed there are multiple Media, respectively The dynamic change of Gossiper intelligent body idea is the weighted average influenced by each Media.