CN111415198B - Tourist behavior preference modeling method based on reverse reinforcement learning - Google Patents

Tourist behavior preference modeling method based on reverse reinforcement learning Download PDF

Info

Publication number
CN111415198B
CN111415198B CN202010195068.5A CN202010195068A CN111415198B CN 111415198 B CN111415198 B CN 111415198B CN 202010195068 A CN202010195068 A CN 202010195068A CN 111415198 B CN111415198 B CN 111415198B
Authority
CN
China
Prior art keywords
tourist
data
ibeacon
function
return
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010195068.5A
Other languages
Chinese (zh)
Other versions
CN111415198A (en
Inventor
常亮
宣闻
宾辰忠
陈源鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202010195068.5A priority Critical patent/CN111415198B/en
Publication of CN111415198A publication Critical patent/CN111415198A/en
Application granted granted Critical
Publication of CN111415198B publication Critical patent/CN111415198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0203Market surveys; Market polls
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/14Travel agencies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/80Services using short range communication, e.g. near-field communication [NFC], radio-frequency identification [RFID] or low energy communication
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • Data Mining & Analysis (AREA)
  • Game Theory and Decision Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Primary Health Care (AREA)
  • Human Resources & Organizations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a tourist behavior preference modeling method based on reverse reinforcement learning, which is characterized in that a display is positioned based on iBeacon, the number of times of photographing broadcasting and the position identification of iBeacon are combined by a smart phone, the tourist behavior data are uploaded and stored, five elements in a Markov decision process are obtained, a Markov decision process model is constructed, a return function is constructed by utilizing a function approximation method, the normalized photographing number of times and residence time are obtained and added into the return function, the tourist data are converted into expert example data, a Boltzmann distribution is adopted to calculate a strategy, after a log likelihood estimation function is obtained, the derivation and the updating of weight vectors are carried out, and when the set condition is met, the learning of preferences is ended, and the accurate tourist preference can be learned according to limited tourist data.

Description

Tourist behavior preference modeling method based on reverse reinforcement learning
Technical Field
The invention relates to the technical field of location awareness and machine learning, in particular to a guest behavior preference modeling method based on reverse reinforcement learning.
Background
The travel recommendation technology is utilized to provide personalized service for users and improve recommendation performance and guest satisfaction, and is one of the hot spots of current intelligent travel field research. In travel recommendations, it is important to understand the patterns of guest behavior and learn guest preferences. The current travel recommendation technology mainly uses data such as scoring, check-in data, access frequency and the like of tourists on the tourist exhibits as a judgment basis for the preference degree of the tourists on the exhibits. However, specific scoring data of tourists for tourists at tourist spots or exhibits is not generally available inside specific scenic spots, such as museums, theme parks, etc., and thus fine-grained preference learning cannot be performed on the tourists, and thus tourist recommendations for the inside of specific scenic spots cannot be obtained. And many recommendation algorithms need a large amount of tourist history data to train, so that the tourist preference is learned and then recommended, however, the tourist data in the exhibition hall are rare and incomplete, so that the accurate preference cannot be learned according to limited tourist data.
Disclosure of Invention
The invention aims to provide a tourist behavior preference modeling method based on reverse reinforcement learning, which can learn accurate tourist preferences according to limited tourist tour data.
In order to achieve the above object, the present invention provides a modeling method for guest behavior preference based on reverse reinforcement learning, including:
based on the combination of iBeacon and a smart phone, the tourist behavior data of tourists are acquired and stored;
carrying out Markov decision process modeling according to the tour behavior data and constructing a return function;
acquiring and adding photographing times and residence time into the return function, and converting the tour data into expert example data;
and utilizing a maximum likelihood reverse reinforcement learning algorithm to learn preferences of tourist tour trajectories.
The iBeacon-based intelligent mobile phone comprises the following steps of:
and acquiring and grouping iBeacon equipment in the indoor exhibition hall, simultaneously combining the Minor and the Major in the iBeacon protocol data to position the exhibited article, simultaneously receiving the broadcast signal of the iBeacon equipment by an application program in the smart phone, reading the sensor data, monitoring photographing broadcasting, and uploading the acquired data to a system server through a wireless network.
The iBeacon-based intelligent mobile phone comprises a base, a smart phone, a storage unit and a storage unit, wherein the iBeacon-based intelligent mobile phone is used for acquiring and storing tourist behavior data of tourists, and the storage unit further comprises:
according to the times of receiving photographing broadcasting and the position identification of the iBeacon, the system server counts the photographing times of the tourists on the target exhibits, and stores the collected tourist behavior data through the file.
The method for modeling the Markov decision process according to the tour behavior data and constructing a return function comprises the following steps:
and acquiring S, A, P, r and gamma five elements in a Markov decision process, constructing a Markov decision process model, and combining a set strategy to obtain an interaction sequence of the tourist, wherein S represents a recorded state space of the current browse exhibit of the tourist, A represents an action space of the exhibit to be browsed next by the tourist in a corresponding state, P represents a state transition probability, r represents a return function, and gamma represents a discount factor.
Wherein, according to the tour behavior data, performing Markov decision process modeling and constructing a return function, and further comprising:
and acquiring a characteristic base function, the number and weight vectors of the characteristic base and the characteristic vector of each state, and constructing a return function by utilizing a function approximation method.
The method for obtaining and adding photographing times and residence time in the return function and converting the tour data into expert example data comprises the following steps:
and acquiring photographing times and residence time when any exhibit is browsed, respectively carrying out normalization processing, adding the obtained normalized times and residence time with instantaneous return data in a corresponding state to obtain a return function value in the corresponding state, and simultaneously converting the obtained sightseeing behavior data into expert example data in a sequence format.
Wherein, the learning of preference is carried out on tourist tour trajectories by using a maximum likelihood reverse reinforcement learning algorithm, which comprises the following steps:
and calculating strategies by using Boltzmann distribution based on the accumulated return expectation of actions made in any state obtained by the expert example data, thereby obtaining a log-likelihood estimation function based on the existing expert example data.
Wherein, the learning of preference is carried out to tourist tour trajectories by utilizing a maximum likelihood reverse reinforcement learning algorithm, and the method further comprises the following steps:
and deriving the log likelihood estimation function, updating the weight vector according to the gradient obtained by adding 0.01 times to the current weight vector until the absolute value of the difference of the current weight vector subtracted by the next weight vector is smaller than or equal to 0.01, finishing learning, outputting a weight vector value, and re-acquiring the accumulated return expectation until the absolute value is smaller than or equal to 0.01 if the absolute value is larger than 0.01.
According to the tourist behavior preference modeling method based on reverse reinforcement learning, a display is positioned based on iBeacon, the times of receiving photographing broadcasting and the position identification of iBeacon are combined with a smart phone, the information is uploaded to a system server, tourist behavior data are stored, S, A, P, r and gamma five elements in a Markov decision process are obtained, a Markov decision process model is constructed, a return function is defined, normalized photographing times and stay time are added in the return function, the return function is approximated by a function approximation method, the tourist data are converted into expert example data in a 'state-action-behavior feature' sequence format, meanwhile, accumulated return expectations of actions made in any state are obtained based on the expert example data, a Boltzmann distribution is adopted to calculate a strategy, a log likelihood estimation function based on the existing example data is obtained, the log likelihood estimation function is conducted and weight vectors are updated, and when set conditions are met, learning of preference is finished, and the tourist can be learned according to limited tourist data.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic step diagram of a modeling method of guest behavior preference based on reverse reinforcement learning.
FIG. 2 is a flow chart of the overall structure of learning guest fine-grained preferences provided by the present invention.
FIG. 3 is a flow chart of data acquisition and processing provided by the present invention.
FIG. 4 is a flow chart of a method for constructing a Markov decision process model provided by the present invention.
FIG. 5 is a schematic diagram of a Markov decision interaction process provided by the present invention.
Fig. 6 is an overall flowchart of reverse reinforcement learning provided by the present invention.
Fig. 7 is a flowchart of a maximum likelihood reverse reinforcement learning algorithm provided by the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
Referring to fig. 1 and 2, the invention provides a modeling method for guest behavior preference based on reverse reinforcement learning, comprising:
s101, based on combination of iBeacon and a smart phone, tourist behavior data of tourists are acquired and stored.
Specifically, the scene is first arranged in an indoor exhibition hall. The data acquisition and processing flow chart shown in fig. 3 is that a tour guide APP is installed on a smart phone of a tourist, and simultaneously, iBeacon (Chinese name: must Ken, a very accurate micro positioning technology by a low-power Bluetooth technology) is arranged at an exhibition hall entrance and every exhibit in the exhibition hall for acquiring the position information of the tourist; in the iBeacon protocol data, two identifiers, namely Minor and Major, are included. In our application scenario, the iBeacon devices are grouped, wherein the Major is used for identifying which group the iBeacon devices belong to, and the Minor is used for identifying different iBeacon devices in the same group, namely, the Minor is set as the ID of the exhibits in the exhibition hall, and the Major is set as the partition to which the exhibits belong, so that the combination of two identifications of the Minor and the Major can be used as the identification of the browse exhibits, and the position information of the present tourist exhibits can be positioned; and the tour APP on the tourist smart phone receives signals sent by the iBeacon through the mobile phone camera and the acceleration sensor, so that various tour behavior data (such as photographing, residence time and the like) of the tourist are collected, an application program in the smart phone receives broadcasting signals of the iBeacon equipment, then the smart phone reads the sensor data and monitors photographing broadcasting, and finally the collected data is uploaded to a system server through a wireless network. When a tourist takes a picture, an application program in the smart phone immediately detects the occurrence of a picture taking behavior and then sends a broadcast to a system server; and the system server counts the shooting times, browsing time and the like of the tourists on the target exhibits according to the times of receiving shooting broadcasting and the position identification of the iBeacon, and stores the acquired tourist behavior data through the file. The file storage data comprises a time stamp sequence of interaction between tourists and iBeacon, three-axis (X, Y, Z) acceleration data of the user and identification of a browse exhibit. The data are collected by adopting the mode that iBeacon and smart phone are combined, and the collection mode is more convenient. The adopted data set is the real behavior data generated when tourists visit scenic spots in scenic spots, and the data also contains the browsing behaviors of the tourists, so the data is richer and more real.
S102, carrying out Markov decision process modeling according to the tour behavior data and constructing a return function.
Specifically, S, A, P, r and gamma five elements in a Markov decision process are obtained, a Markov decision process model is constructed, and interaction sequences of tourists are obtained by combining a set strategy, as shown in a flow chart for constructing the Markov decision process model in FIG. 4, wherein five elements in the Markov decision process are defined: the state S represents the record of the current browse exhibit of the tourist, and the state space is S; action a represents the next to-be-browsed exhibit of tourists in state s, and the action space is A; probability of state transition P(s) t+1 |s t ,a t ) Representing slave state s t By action a t Transition to state s t+1 Wherein s is t ∈S,a t E A. For example, tourists browse exhibit records s 1 In the case of (a) then wanting to browse exhibit a 2 Or exhibit a 3 Then the state transition probability may be defined as P(s) 2 |s 1 ,a 2 )=0.5,P(s 3 |s 1 ,a 3 )=0.5;r(s t ,a t ) The report function is represented by the current browse exhibit record s of tourists t Browse exhibit a t The return that can be obtained is then. Wherein s is t ∈S,a t E A. This return value is proportional to the guest preference value, that is, guest versus exhibit a t The higher the preference of (c), the higher the return value. For ease of calculation we define r (s t ,a t )≤1;γ∈[0,1]Representing a discount factor, for calculating the cumulative return.
The interaction process of the tourist and the exhibits in the exhibition hall can be regarded as a markov decision process, and then the interaction process of the tourist and the exhibits in the exhibition hall is described, and the schematic of the markov decision interaction process is shown in fig. 5:
the tourist starts from entering the exhibition hall, and the browsing record defaults to s 0 . When browsing exhibit a 1 When the method is used, corresponding photographing times and residence time are generated; the shooting times and the stay time are taken as characteristic values and added into a return function to calculate a return value r 1 And updates the guest viewing record s 1 The method comprises the steps of carrying out a first treatment on the surface of the Then the tourist browses the next exhibit a 2 In the same way calculate the return value r 2 The corresponding change of the tourist browsing record into s 2 The interaction is continued, so that the interaction sequence when the tourist browses is shown as (1), wherein
s 0 ,s 1 ,s 2 ,......,s t-1 ,s t ∈S;
s 0 ,a 1 ,r 1 ,s 1 ,a 2 ,r 2 ,......,s t-1 ,a t ,r t ,s t (1)
Markov, as used herein, refers to a record of exhibits that a guest browses at the next time t+1 Only depends on the exhibit record s browsed by tourists at the current moment t And the exhibit a being browsed t All other history browsed exhibit records can be discarded; as shown in formula (2), wherein P(s) t+1 |s t ,a t ) Transition probability for browsing exhibits for tourists:
P(s t+1 |s t ,a t ,......s 1 ,a 1 )=P(s t+1 |s t ,a t ) (2)
and how to select action a in each state t Is determined by the policy pi. Policy (policy) is defined as pi: S-A represents the behavior mapping from the state space of the tourist browsing the exhibit record to the next tourist browsing the exhibit. As can be seen from formula (3), the policy pi refers to the conditional probability distribution over the set of actions given a state s, i.e. the probability that the policy pi can specify an action on each state s; the strategy pi can determine the exhibit a recommended to the tourist in the next step according to the record s of the tourist browsing the exhibits;
π(a|s)=P(A t =a|S t =s) (3)
for example, a guest may browse exhibits using a policy pi (a 2 |s 1 )=0.3,π(a 3 |s 1 ) =0.7, which means that the guest is browsing the record s 1 Under the condition of (a), the next browse exhibit a 2 The probability of (a) is 0.3, browse exhibit a 3 Is of (1)The rate is 0.7, obviously tourists browse exhibit a 3 Is more likely to occur;
based on the given strategy pi and the Markov decision process model, the interaction sequence tau of a tourist tour exhibit can be determined:
τ=s 0 ,a 1 ,r 1 ,s 1 ,a 2 ,r 2 ,s 2 ,......,s t-1 ,a t ,r t ,s t (4)
since guest preferences are unknown, i.e. the return function r (s t ,a t ) Unknown, so we can obtain the feature basis function, the number and weight vector of the feature basis and the feature vector of each state, and use the function approximation method to perform parameter approximation to the feature basis function, and construct the return function, the approximation form is shown in formula (5):
Figure BDA0002417308300000061
phi = (phi) in the above 1 ,φ 2 ,,......,φ d ) T ,φ:S×A→R d Is a finite number and fixed finite characteristic base function, wherein d is the number of characteristic bases, phi 1 Feature vectors for each state. θ= (θ) 12 ,......θ d ) Representing the weight vector between the various feature bases. By such a linear representation, we can adjust the weights to change the return function value.
S103, acquiring and adding photographing times and residence time in the return function, and converting the tour data into expert example data.
Specifically, the photographing times and the residence time when any exhibit is browsed are obtained, normalization processing is carried out respectively, then the obtained photographing times and residence time are added with instantaneous regression data in a corresponding state to obtain a return function value in the corresponding state, and meanwhile, the obtained browsing behavior data conversion sequence format is expert example data, and the sequence format of the expert example data is state-action-behavior characteristics. Since the guest's preference is unknown, thenWe can consider that the return available to the guest in the current browsing state s for the next browse exhibit is unknown, that is, the return R (s, a) available to the guest in the state s for the selection action a is often unknown; it is therefore necessary to learn the back reward functions by browsing the trajectory data of the exhibits for the expert examples (existing relevant guests). In the learning process, two guest behavior characteristics of photographing times and residence time are added in the return function to train; finally, a return function R is learned by a reverse reinforcement learning algorithm θ (s, a), a reverse reinforcement learning overall flowchart as shown in fig. 6, comprising the following steps:
in our application scenario, there are 15 exhibits in total. We count the shooting times img of a certain exhibit in the current state s s And residence time stage s Two guest behavior features (in seconds). Therefore, we define the return function as the sum of the instantaneous return generated when viewing the exhibit and the return generated by the number of shots and the dwell time when the guest browses the exhibit in that state. For ease of calculation, we normalize the data by equation (6) for the return generated by the number of shots and dwell time, where x * Values representing the number of photographing times or the stay time in the current state, and min and max represent minimum and maximum values of the number of photographing times or the stay time in all the states;
Figure BDA0002417308300000071
the return function in the current state can be expressed by equation (7):
Figure BDA0002417308300000072
the existing guest browsing trajectories are then processed into a "state-action-behavior feature" sequence as expert example data. Assume that N guest trajectory data d= { ζ are provided 1 ,......,ζ N Each track data length is H, thenA set of trajectory data sequences may be represented as:
ζ 1 =((s 1 ,a 1 ,img 1 ,stay 1 ),(s 2 ,a 2 ,img 2 ,stay 2 ),......(s H ,a H ,img H ,stay H ))
wherein s is H ∈S,a H E A. In the present invention, we define each track data length H as 15. For example, the browsing track of a guest u is:
ζ u =((s 1 ,a 2 ,img 1 ,stay 1 ),(s 2 ,a 4 ,img 2 ,stay 2 ),(s 3 ,a 3 ,img 3 ,stay 3 ),......(s 15 ,a 15 ,img 15 ,stay 15 ))
then represent guest u in state s 1 Browse under exhibit a 2 Wherein in exhibit a 2 The number of times of photographing is img 1 The residence time is stay 1 The method comprises the steps of carrying out a first treatment on the surface of the Then browse exhibit a 4 Wherein in exhibit a 4 The number of times of photographing is img 2 Residence time stage 2
S104, utilizing a maximum likelihood reverse reinforcement learning algorithm to learn preferences of tourist tour trajectories.
Specifically, the accumulated return expectations of actions made under any state obtained based on the expert example data are calculated by adopting boltzmann distribution, so that a log-likelihood estimation function based on the existing expert example data is obtained, the maximum likelihood reverse reinforcement learning is integrated with the characteristics of other reverse reinforcement learning models, the return function can be estimated under the condition that expert tracks are fewer, the maximum likelihood model is found out through the expert tracks, the initial return function is continuously adjusted, the strategy pi is continuously optimized through gradients, and the whole algorithm flow is shown as a maximum likelihood reverse reinforcement learning algorithm flow chart shown in fig. 7. The method comprises the following specific steps:
firstly, under the condition that the state s of the tourist is obtained through the expert example data, a behavior a is made to obtain a cumulative return expectation Q, wherein the cumulative return expectation can be represented by a formula (8):
Figure BDA0002417308300000081
whereas in MDP, the action is defined as the next browsed exhibit, so the action space is not large, so we use boltzmann distribution as the policy pi, and calculate the policy pi, which can be expressed by equation (9):
π θ (a|s)=e βQ(s,a) /∑ a' e βQ(s,a') (9)
under this strategy, the log-likelihood estimation function based on the existing tourist browsing exhibit-related trajectory demonstration data can be represented by the formula (10):
Figure BDA0002417308300000082
deriving the log likelihood estimation function to obtain a gradient
Figure BDA0002417308300000083
Then, the weight vector is updated according to the current weight vector plus 0.01 times of the gradient, namely +.>
Figure BDA0002417308300000084
Until the absolute value of the difference of the next weight vector minus the current weight vector is less than or equal to 0.01, namely ||theta t - 1t If the level is less than or equal to 0.01, the learning is finished, and a weight vector value theta=argmax is output θ L (D|θ), if the absolute value is greater than 0.01, i.e. |θ t-1t And (3) re-acquiring the accumulated return expectation until the absolute value is less than or equal to 0.01. On the basis of collecting the real tourist behavior data, combining the tourist behavior with reverse reinforcement learning, and designing a reverse reinforcement learning algorithm for the collected behavior data based on the obtained real tourist behaviorThe data is subject to fine-grained preference learning.
The complete flow is as provided in the flow chart of the overall structure of learning guest fine-grained preference of fig. 2: based on the combination of iBeacon and a combined smart phone, the tourist behavior data are collected and stored in a text file, five elements in a Markov decision process are obtained and defined, a Markov decision process model is constructed, a return function is constructed, two characteristics of normalized photographing times and residence time are added into the return function, the tourist browsing track data are taken as expert example data, and finally, the tourist preference is learned by utilizing a maximum likelihood reverse reinforcement learning algorithm, so that the accurate tourist preference can be learned according to limited tourist data.
According to the tourist behavior preference modeling method based on reverse reinforcement learning, a display is positioned based on iBeacon, the times of receiving photographing broadcasting and the position identification of iBeacon are combined with a smart phone, the information is uploaded to a system server, tourist behavior data are stored, S, A, P, r and gamma five elements in a Markov decision process are obtained, a Markov decision process model is constructed, a return function is constructed by utilizing a function approximation method, normalized photographing times and residence time are added to the return function, the tourist data are converted into expert example data in a 'state-action-behavior feature' sequence format, meanwhile, accumulated return expectations of actions made in any state are obtained based on the expert example data, a Boltzmann distribution is adopted to calculate a strategy, a log likelihood estimation function based on the existing expert example data is obtained, the log likelihood estimation function is conducted, weight vectors are updated, learning of preference is finished when set conditions are met, and accurate tourist preference can be learned according to limited tourist data.
The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.

Claims (4)

1. The guest behavior preference modeling method based on reverse reinforcement learning is characterized by comprising the following steps of:
based on the combination of iBeacon and a smart phone, the tourist behavior data of tourists are acquired and stored;
carrying out Markov decision process modeling according to the tour behavior data and constructing a return function;
acquiring and adding photographing times and residence time into the return function, and converting the tour behavior data into expert example data;
utilizing a maximum likelihood reverse reinforcement learning algorithm to learn preferences of tourist tour trajectories;
based on iBeacon combines together with the smart mobile phone, acquire and save tourist's tour behavior data, include:
the method comprises the steps of acquiring and grouping iBeacon equipment in an indoor exhibition hall, simultaneously positioning exhibits by combining a Minor and a Major in iBeacon protocol data, simultaneously receiving broadcast signals of the iBeacon equipment by an application program in a smart phone, reading sensor data, monitoring photographing broadcasting, and uploading acquired data to a system server through a wireless network;
based on iBeacon combines together with the smart mobile phone, acquire and save tourist's tour behavior data, still include:
according to the times of receiving photographing broadcasting and the position identification of the iBeacon, the system server counts the photographing times of the tourists on the target exhibits and stores the collected tourist behavior data through the file;
the learning of preferences of tourist tour trajectories by using a maximum likelihood reverse reinforcement learning algorithm comprises the following steps:
the accumulated return expectation of actions made in any state based on the expert example data is obtained, and a Boltzmann distribution is adopted to calculate a strategy, so that a log-likelihood estimation function based on the existing expert example data is obtained;
and deriving the log likelihood estimation function, updating the weight vector according to the gradient obtained by adding 0.01 times to the current weight vector until the absolute value of the difference of the current weight vector subtracted by the next weight vector is smaller than or equal to 0.01, finishing learning, outputting a weight vector value, and re-acquiring the accumulated return expectation until the absolute value is smaller than or equal to 0.01 if the absolute value is larger than 0.01.
2. A method of modeling guest behavior preferences based on reverse reinforcement learning as defined in claim 1 wherein modeling a markov decision process from the tour behavior data and constructing a return function comprises:
and acquiring S, A, P, r and gamma five elements in a Markov decision process, constructing a Markov decision process model, and combining a set strategy to obtain an interaction sequence of the tourist, wherein S represents a recorded state space of the current browse exhibit of the tourist, A represents an action space of the exhibit to be browsed next by the tourist in a corresponding state, P represents a state transition probability, r represents a return function, and gamma represents a discount factor.
3. A guest behavior preference modeling method based on reverse reinforcement learning as defined in claim 2, wherein the markov decision process modeling and the construction of a return function are performed according to the tour behavior data, further comprising:
and acquiring a characteristic base function, the number and weight vectors of the characteristic base and the characteristic vector of each state, and constructing a return function by utilizing a function approximation method.
4. A guest behavior preference modeling method based on reverse reinforcement learning as claimed in claim 3, wherein acquiring and adding the number of shots and the stay time in the return function, and converting the tour behavior data into expert example data, comprises:
and acquiring photographing times and residence time when any exhibit is browsed, respectively carrying out normalization processing, adding the obtained normalized times and residence time with instantaneous return data in a corresponding state to obtain a return function value in the corresponding state, and simultaneously converting the obtained sightseeing behavior data into expert example data in a sequence format.
CN202010195068.5A 2020-03-19 2020-03-19 Tourist behavior preference modeling method based on reverse reinforcement learning Active CN111415198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010195068.5A CN111415198B (en) 2020-03-19 2020-03-19 Tourist behavior preference modeling method based on reverse reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010195068.5A CN111415198B (en) 2020-03-19 2020-03-19 Tourist behavior preference modeling method based on reverse reinforcement learning

Publications (2)

Publication Number Publication Date
CN111415198A CN111415198A (en) 2020-07-14
CN111415198B true CN111415198B (en) 2023-04-28

Family

ID=71494548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010195068.5A Active CN111415198B (en) 2020-03-19 2020-03-19 Tourist behavior preference modeling method based on reverse reinforcement learning

Country Status (1)

Country Link
CN (1) CN111415198B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158086B (en) * 2021-04-06 2023-05-05 浙江贝迩熊科技有限公司 Personalized customer recommendation system and method based on deep reinforcement learning
CN114355786A (en) * 2022-01-17 2022-04-15 北京三月雨文化传播有限责任公司 Big data-based regulation cloud system of multimedia digital exhibition hall
CN117033800A (en) * 2023-10-08 2023-11-10 法琛堂(昆明)医疗科技有限公司 Intelligent interaction method and system for visual cloud exhibition system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010048146A1 (en) * 2008-10-20 2010-04-29 Carnegie Mellon University System, method and device for predicting navigational decision-making behavior
CN107358471A (en) * 2017-07-17 2017-11-17 桂林电子科技大学 A kind of tourist resources based on visit behavior recommends method and system
CN108819948A (en) * 2018-06-25 2018-11-16 大连大学 Driving behavior modeling method based on reverse intensified learning
CN108875005A (en) * 2018-06-15 2018-11-23 桂林电子科技大学 A kind of tourist's preferential learning system and method based on visit behavior
WO2019145952A1 (en) * 2018-01-25 2019-08-01 Splitty Travel Ltd. Systems, methods and computer program products for optimization of travel technology target functions, including when communicating with travel technology suppliers under technological constraints
CN110288436A (en) * 2019-06-19 2019-09-27 桂林电子科技大学 A kind of personalized recommending scenery spot method based on the modeling of tourist's preference

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10872322B2 (en) * 2008-03-21 2020-12-22 Dressbot, Inc. System and method for collaborative shopping, business and entertainment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010048146A1 (en) * 2008-10-20 2010-04-29 Carnegie Mellon University System, method and device for predicting navigational decision-making behavior
CN107358471A (en) * 2017-07-17 2017-11-17 桂林电子科技大学 A kind of tourist resources based on visit behavior recommends method and system
WO2019145952A1 (en) * 2018-01-25 2019-08-01 Splitty Travel Ltd. Systems, methods and computer program products for optimization of travel technology target functions, including when communicating with travel technology suppliers under technological constraints
CN108875005A (en) * 2018-06-15 2018-11-23 桂林电子科技大学 A kind of tourist's preferential learning system and method based on visit behavior
CN108819948A (en) * 2018-06-25 2018-11-16 大连大学 Driving behavior modeling method based on reverse intensified learning
CN110288436A (en) * 2019-06-19 2019-09-27 桂林电子科技大学 A kind of personalized recommending scenery spot method based on the modeling of tourist's preference

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
刘建伟 ; 高峰 ; 罗雄麟 ; .基于值函数和策略梯度的深度强化学习综述.计算机学报.2018,(06),全文. *
孙磊等.基于游览行为的游客偏好学习方法.计算机工程与设计.2019,全文. *
宣闻.基于逆向强化学习的细粒度游客行为偏好研究.中国优秀硕士学位论文全文数据库信息科技.2022,(第06期),全文. *
范长杰. 基于马尔可夫决策理论的规划问题的研究.中国博士学位论文全文数据库信息科技基础科学.2009,(第07期),全文. *
陈希亮 ; 曹雷 ; 何明 ; 李晨溪 ; 徐志雄 ; .深度逆向强化学习研究综述.计算机工程与应用.2018,(05),全文. *

Also Published As

Publication number Publication date
CN111415198A (en) 2020-07-14

Similar Documents

Publication Publication Date Title
CN111415198B (en) Tourist behavior preference modeling method based on reverse reinforcement learning
CN110609903B (en) Information presentation method and device
US9235263B2 (en) Information processing device, determination method, and non-transitory computer readable storage medium
JP4497236B2 (en) Detection information registration device, electronic device, detection information registration device control method, electronic device control method, detection information registration device control program, electronic device control program
CN107680010B (en) Scenic spot route recommendation method and system based on touring behavior
JP4902270B2 (en) How to assemble a collection of digital images
CN104737523B (en) The situational model in mobile device is managed by assigning for the situation label of data clustering
US8650242B2 (en) Data processing apparatus and data processing method
CN107018333A (en) Shoot template and recommend method, device and capture apparatus
CN103944804B (en) Contact recommending method and device
CN103455472B (en) Information processing apparatus and information processing method
CN101855633A (en) Video analysis apparatus and method for calculating inter-person evaluation value using video analysis
CN103914559A (en) Network user screening method and network user screening device
JPWO2014129042A1 (en) Information processing apparatus, information processing method, and program
CN107666540B (en) Terminal control method, device and storage medium
CN115654675A (en) Air conditioner operation parameter recommendation method and related equipment
CN113495487A (en) Terminal and method for adjusting operation parameters of target equipment
JP2016129309A (en) Object linking method, device and program
JP2022145054A (en) Recommendation information providing method and recommendation information providing system
JP2014225061A (en) Information provision device, information provision system, information provision method, and program
CN116503209A (en) Digital twin system based on artificial intelligence and data driving
JP2015153157A (en) virtual information management system
CN108616919A (en) A kind of public domain stream of people monitoring method and device
KR100880001B1 (en) Mobile device for managing personal life and method for searching information using the mobile device
CN113158086B (en) Personalized customer recommendation system and method based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant