CN111415198B - Tourist behavior preference modeling method based on reverse reinforcement learning - Google Patents
Tourist behavior preference modeling method based on reverse reinforcement learning Download PDFInfo
- Publication number
- CN111415198B CN111415198B CN202010195068.5A CN202010195068A CN111415198B CN 111415198 B CN111415198 B CN 111415198B CN 202010195068 A CN202010195068 A CN 202010195068A CN 111415198 B CN111415198 B CN 111415198B
- Authority
- CN
- China
- Prior art keywords
- tourist
- data
- ibeacon
- function
- return
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 230000006399 behavior Effects 0.000 title claims abstract description 51
- 230000002787 reinforcement Effects 0.000 title claims abstract description 30
- 230000006870 function Effects 0.000 claims abstract description 64
- 230000008569 process Effects 0.000 claims abstract description 31
- 239000013598 vector Substances 0.000 claims abstract description 26
- 230000009471 action Effects 0.000 claims description 16
- 238000007476 Maximum Likelihood Methods 0.000 claims description 11
- 230000003993 interaction Effects 0.000 claims description 11
- 230000007704 transition Effects 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 2
- 238000010276 construction Methods 0.000 claims 1
- 238000009795 derivation Methods 0.000 abstract 1
- 230000001186 cumulative effect Effects 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
- G06Q30/0203—Market surveys; Market polls
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0631—Item recommendations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/14—Travel agencies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/80—Services using short range communication, e.g. near-field communication [NFC], radio-frequency identification [RFID] or low energy communication
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- General Business, Economics & Management (AREA)
- Economics (AREA)
- Marketing (AREA)
- Physics & Mathematics (AREA)
- Entrepreneurship & Innovation (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Tourism & Hospitality (AREA)
- Data Mining & Analysis (AREA)
- Game Theory and Decision Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Primary Health Care (AREA)
- Human Resources & Organizations (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a tourist behavior preference modeling method based on reverse reinforcement learning, which is characterized in that a display is positioned based on iBeacon, the number of times of photographing broadcasting and the position identification of iBeacon are combined by a smart phone, the tourist behavior data are uploaded and stored, five elements in a Markov decision process are obtained, a Markov decision process model is constructed, a return function is constructed by utilizing a function approximation method, the normalized photographing number of times and residence time are obtained and added into the return function, the tourist data are converted into expert example data, a Boltzmann distribution is adopted to calculate a strategy, after a log likelihood estimation function is obtained, the derivation and the updating of weight vectors are carried out, and when the set condition is met, the learning of preferences is ended, and the accurate tourist preference can be learned according to limited tourist data.
Description
Technical Field
The invention relates to the technical field of location awareness and machine learning, in particular to a guest behavior preference modeling method based on reverse reinforcement learning.
Background
The travel recommendation technology is utilized to provide personalized service for users and improve recommendation performance and guest satisfaction, and is one of the hot spots of current intelligent travel field research. In travel recommendations, it is important to understand the patterns of guest behavior and learn guest preferences. The current travel recommendation technology mainly uses data such as scoring, check-in data, access frequency and the like of tourists on the tourist exhibits as a judgment basis for the preference degree of the tourists on the exhibits. However, specific scoring data of tourists for tourists at tourist spots or exhibits is not generally available inside specific scenic spots, such as museums, theme parks, etc., and thus fine-grained preference learning cannot be performed on the tourists, and thus tourist recommendations for the inside of specific scenic spots cannot be obtained. And many recommendation algorithms need a large amount of tourist history data to train, so that the tourist preference is learned and then recommended, however, the tourist data in the exhibition hall are rare and incomplete, so that the accurate preference cannot be learned according to limited tourist data.
Disclosure of Invention
The invention aims to provide a tourist behavior preference modeling method based on reverse reinforcement learning, which can learn accurate tourist preferences according to limited tourist tour data.
In order to achieve the above object, the present invention provides a modeling method for guest behavior preference based on reverse reinforcement learning, including:
based on the combination of iBeacon and a smart phone, the tourist behavior data of tourists are acquired and stored;
carrying out Markov decision process modeling according to the tour behavior data and constructing a return function;
acquiring and adding photographing times and residence time into the return function, and converting the tour data into expert example data;
and utilizing a maximum likelihood reverse reinforcement learning algorithm to learn preferences of tourist tour trajectories.
The iBeacon-based intelligent mobile phone comprises the following steps of:
and acquiring and grouping iBeacon equipment in the indoor exhibition hall, simultaneously combining the Minor and the Major in the iBeacon protocol data to position the exhibited article, simultaneously receiving the broadcast signal of the iBeacon equipment by an application program in the smart phone, reading the sensor data, monitoring photographing broadcasting, and uploading the acquired data to a system server through a wireless network.
The iBeacon-based intelligent mobile phone comprises a base, a smart phone, a storage unit and a storage unit, wherein the iBeacon-based intelligent mobile phone is used for acquiring and storing tourist behavior data of tourists, and the storage unit further comprises:
according to the times of receiving photographing broadcasting and the position identification of the iBeacon, the system server counts the photographing times of the tourists on the target exhibits, and stores the collected tourist behavior data through the file.
The method for modeling the Markov decision process according to the tour behavior data and constructing a return function comprises the following steps:
and acquiring S, A, P, r and gamma five elements in a Markov decision process, constructing a Markov decision process model, and combining a set strategy to obtain an interaction sequence of the tourist, wherein S represents a recorded state space of the current browse exhibit of the tourist, A represents an action space of the exhibit to be browsed next by the tourist in a corresponding state, P represents a state transition probability, r represents a return function, and gamma represents a discount factor.
Wherein, according to the tour behavior data, performing Markov decision process modeling and constructing a return function, and further comprising:
and acquiring a characteristic base function, the number and weight vectors of the characteristic base and the characteristic vector of each state, and constructing a return function by utilizing a function approximation method.
The method for obtaining and adding photographing times and residence time in the return function and converting the tour data into expert example data comprises the following steps:
and acquiring photographing times and residence time when any exhibit is browsed, respectively carrying out normalization processing, adding the obtained normalized times and residence time with instantaneous return data in a corresponding state to obtain a return function value in the corresponding state, and simultaneously converting the obtained sightseeing behavior data into expert example data in a sequence format.
Wherein, the learning of preference is carried out on tourist tour trajectories by using a maximum likelihood reverse reinforcement learning algorithm, which comprises the following steps:
and calculating strategies by using Boltzmann distribution based on the accumulated return expectation of actions made in any state obtained by the expert example data, thereby obtaining a log-likelihood estimation function based on the existing expert example data.
Wherein, the learning of preference is carried out to tourist tour trajectories by utilizing a maximum likelihood reverse reinforcement learning algorithm, and the method further comprises the following steps:
and deriving the log likelihood estimation function, updating the weight vector according to the gradient obtained by adding 0.01 times to the current weight vector until the absolute value of the difference of the current weight vector subtracted by the next weight vector is smaller than or equal to 0.01, finishing learning, outputting a weight vector value, and re-acquiring the accumulated return expectation until the absolute value is smaller than or equal to 0.01 if the absolute value is larger than 0.01.
According to the tourist behavior preference modeling method based on reverse reinforcement learning, a display is positioned based on iBeacon, the times of receiving photographing broadcasting and the position identification of iBeacon are combined with a smart phone, the information is uploaded to a system server, tourist behavior data are stored, S, A, P, r and gamma five elements in a Markov decision process are obtained, a Markov decision process model is constructed, a return function is defined, normalized photographing times and stay time are added in the return function, the return function is approximated by a function approximation method, the tourist data are converted into expert example data in a 'state-action-behavior feature' sequence format, meanwhile, accumulated return expectations of actions made in any state are obtained based on the expert example data, a Boltzmann distribution is adopted to calculate a strategy, a log likelihood estimation function based on the existing example data is obtained, the log likelihood estimation function is conducted and weight vectors are updated, and when set conditions are met, learning of preference is finished, and the tourist can be learned according to limited tourist data.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic step diagram of a modeling method of guest behavior preference based on reverse reinforcement learning.
FIG. 2 is a flow chart of the overall structure of learning guest fine-grained preferences provided by the present invention.
FIG. 3 is a flow chart of data acquisition and processing provided by the present invention.
FIG. 4 is a flow chart of a method for constructing a Markov decision process model provided by the present invention.
FIG. 5 is a schematic diagram of a Markov decision interaction process provided by the present invention.
Fig. 6 is an overall flowchart of reverse reinforcement learning provided by the present invention.
Fig. 7 is a flowchart of a maximum likelihood reverse reinforcement learning algorithm provided by the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
Referring to fig. 1 and 2, the invention provides a modeling method for guest behavior preference based on reverse reinforcement learning, comprising:
s101, based on combination of iBeacon and a smart phone, tourist behavior data of tourists are acquired and stored.
Specifically, the scene is first arranged in an indoor exhibition hall. The data acquisition and processing flow chart shown in fig. 3 is that a tour guide APP is installed on a smart phone of a tourist, and simultaneously, iBeacon (Chinese name: must Ken, a very accurate micro positioning technology by a low-power Bluetooth technology) is arranged at an exhibition hall entrance and every exhibit in the exhibition hall for acquiring the position information of the tourist; in the iBeacon protocol data, two identifiers, namely Minor and Major, are included. In our application scenario, the iBeacon devices are grouped, wherein the Major is used for identifying which group the iBeacon devices belong to, and the Minor is used for identifying different iBeacon devices in the same group, namely, the Minor is set as the ID of the exhibits in the exhibition hall, and the Major is set as the partition to which the exhibits belong, so that the combination of two identifications of the Minor and the Major can be used as the identification of the browse exhibits, and the position information of the present tourist exhibits can be positioned; and the tour APP on the tourist smart phone receives signals sent by the iBeacon through the mobile phone camera and the acceleration sensor, so that various tour behavior data (such as photographing, residence time and the like) of the tourist are collected, an application program in the smart phone receives broadcasting signals of the iBeacon equipment, then the smart phone reads the sensor data and monitors photographing broadcasting, and finally the collected data is uploaded to a system server through a wireless network. When a tourist takes a picture, an application program in the smart phone immediately detects the occurrence of a picture taking behavior and then sends a broadcast to a system server; and the system server counts the shooting times, browsing time and the like of the tourists on the target exhibits according to the times of receiving shooting broadcasting and the position identification of the iBeacon, and stores the acquired tourist behavior data through the file. The file storage data comprises a time stamp sequence of interaction between tourists and iBeacon, three-axis (X, Y, Z) acceleration data of the user and identification of a browse exhibit. The data are collected by adopting the mode that iBeacon and smart phone are combined, and the collection mode is more convenient. The adopted data set is the real behavior data generated when tourists visit scenic spots in scenic spots, and the data also contains the browsing behaviors of the tourists, so the data is richer and more real.
S102, carrying out Markov decision process modeling according to the tour behavior data and constructing a return function.
Specifically, S, A, P, r and gamma five elements in a Markov decision process are obtained, a Markov decision process model is constructed, and interaction sequences of tourists are obtained by combining a set strategy, as shown in a flow chart for constructing the Markov decision process model in FIG. 4, wherein five elements in the Markov decision process are defined: the state S represents the record of the current browse exhibit of the tourist, and the state space is S; action a represents the next to-be-browsed exhibit of tourists in state s, and the action space is A; probability of state transition P(s) t+1 |s t ,a t ) Representing slave state s t By action a t Transition to state s t+1 Wherein s is t ∈S,a t E A. For example, tourists browse exhibit records s 1 In the case of (a) then wanting to browse exhibit a 2 Or exhibit a 3 Then the state transition probability may be defined as P(s) 2 |s 1 ,a 2 )=0.5,P(s 3 |s 1 ,a 3 )=0.5;r(s t ,a t ) The report function is represented by the current browse exhibit record s of tourists t Browse exhibit a t The return that can be obtained is then. Wherein s is t ∈S,a t E A. This return value is proportional to the guest preference value, that is, guest versus exhibit a t The higher the preference of (c), the higher the return value. For ease of calculation we define r (s t ,a t )≤1;γ∈[0,1]Representing a discount factor, for calculating the cumulative return.
The interaction process of the tourist and the exhibits in the exhibition hall can be regarded as a markov decision process, and then the interaction process of the tourist and the exhibits in the exhibition hall is described, and the schematic of the markov decision interaction process is shown in fig. 5:
the tourist starts from entering the exhibition hall, and the browsing record defaults to s 0 . When browsing exhibit a 1 When the method is used, corresponding photographing times and residence time are generated; the shooting times and the stay time are taken as characteristic values and added into a return function to calculate a return value r 1 And updates the guest viewing record s 1 The method comprises the steps of carrying out a first treatment on the surface of the Then the tourist browses the next exhibit a 2 In the same way calculate the return value r 2 The corresponding change of the tourist browsing record into s 2 The interaction is continued, so that the interaction sequence when the tourist browses is shown as (1), wherein
s 0 ,s 1 ,s 2 ,......,s t-1 ,s t ∈S;
s 0 ,a 1 ,r 1 ,s 1 ,a 2 ,r 2 ,......,s t-1 ,a t ,r t ,s t (1)
Markov, as used herein, refers to a record of exhibits that a guest browses at the next time t+1 Only depends on the exhibit record s browsed by tourists at the current moment t And the exhibit a being browsed t All other history browsed exhibit records can be discarded; as shown in formula (2), wherein P(s) t+1 |s t ,a t ) Transition probability for browsing exhibits for tourists:
P(s t+1 |s t ,a t ,......s 1 ,a 1 )=P(s t+1 |s t ,a t ) (2)
and how to select action a in each state t Is determined by the policy pi. Policy (policy) is defined as pi: S-A represents the behavior mapping from the state space of the tourist browsing the exhibit record to the next tourist browsing the exhibit. As can be seen from formula (3), the policy pi refers to the conditional probability distribution over the set of actions given a state s, i.e. the probability that the policy pi can specify an action on each state s; the strategy pi can determine the exhibit a recommended to the tourist in the next step according to the record s of the tourist browsing the exhibits;
π(a|s)=P(A t =a|S t =s) (3)
for example, a guest may browse exhibits using a policy pi (a 2 |s 1 )=0.3,π(a 3 |s 1 ) =0.7, which means that the guest is browsing the record s 1 Under the condition of (a), the next browse exhibit a 2 The probability of (a) is 0.3, browse exhibit a 3 Is of (1)The rate is 0.7, obviously tourists browse exhibit a 3 Is more likely to occur;
based on the given strategy pi and the Markov decision process model, the interaction sequence tau of a tourist tour exhibit can be determined:
τ=s 0 ,a 1 ,r 1 ,s 1 ,a 2 ,r 2 ,s 2 ,......,s t-1 ,a t ,r t ,s t (4)
since guest preferences are unknown, i.e. the return function r (s t ,a t ) Unknown, so we can obtain the feature basis function, the number and weight vector of the feature basis and the feature vector of each state, and use the function approximation method to perform parameter approximation to the feature basis function, and construct the return function, the approximation form is shown in formula (5):
phi = (phi) in the above 1 ,φ 2 ,,......,φ d ) T ,φ:S×A→R d Is a finite number and fixed finite characteristic base function, wherein d is the number of characteristic bases, phi 1 Feature vectors for each state. θ= (θ) 1 ,θ 2 ,......θ d ) Representing the weight vector between the various feature bases. By such a linear representation, we can adjust the weights to change the return function value.
S103, acquiring and adding photographing times and residence time in the return function, and converting the tour data into expert example data.
Specifically, the photographing times and the residence time when any exhibit is browsed are obtained, normalization processing is carried out respectively, then the obtained photographing times and residence time are added with instantaneous regression data in a corresponding state to obtain a return function value in the corresponding state, and meanwhile, the obtained browsing behavior data conversion sequence format is expert example data, and the sequence format of the expert example data is state-action-behavior characteristics. Since the guest's preference is unknown, thenWe can consider that the return available to the guest in the current browsing state s for the next browse exhibit is unknown, that is, the return R (s, a) available to the guest in the state s for the selection action a is often unknown; it is therefore necessary to learn the back reward functions by browsing the trajectory data of the exhibits for the expert examples (existing relevant guests). In the learning process, two guest behavior characteristics of photographing times and residence time are added in the return function to train; finally, a return function R is learned by a reverse reinforcement learning algorithm θ (s, a), a reverse reinforcement learning overall flowchart as shown in fig. 6, comprising the following steps:
in our application scenario, there are 15 exhibits in total. We count the shooting times img of a certain exhibit in the current state s s And residence time stage s Two guest behavior features (in seconds). Therefore, we define the return function as the sum of the instantaneous return generated when viewing the exhibit and the return generated by the number of shots and the dwell time when the guest browses the exhibit in that state. For ease of calculation, we normalize the data by equation (6) for the return generated by the number of shots and dwell time, where x * Values representing the number of photographing times or the stay time in the current state, and min and max represent minimum and maximum values of the number of photographing times or the stay time in all the states;
the return function in the current state can be expressed by equation (7):
the existing guest browsing trajectories are then processed into a "state-action-behavior feature" sequence as expert example data. Assume that N guest trajectory data d= { ζ are provided 1 ,......,ζ N Each track data length is H, thenA set of trajectory data sequences may be represented as:
ζ 1 =((s 1 ,a 1 ,img 1 ,stay 1 ),(s 2 ,a 2 ,img 2 ,stay 2 ),......(s H ,a H ,img H ,stay H ))
wherein s is H ∈S,a H E A. In the present invention, we define each track data length H as 15. For example, the browsing track of a guest u is:
ζ u =((s 1 ,a 2 ,img 1 ,stay 1 ),(s 2 ,a 4 ,img 2 ,stay 2 ),(s 3 ,a 3 ,img 3 ,stay 3 ),......(s 15 ,a 15 ,img 15 ,stay 15 ))
then represent guest u in state s 1 Browse under exhibit a 2 Wherein in exhibit a 2 The number of times of photographing is img 1 The residence time is stay 1 The method comprises the steps of carrying out a first treatment on the surface of the Then browse exhibit a 4 Wherein in exhibit a 4 The number of times of photographing is img 2 Residence time stage 2 。
S104, utilizing a maximum likelihood reverse reinforcement learning algorithm to learn preferences of tourist tour trajectories.
Specifically, the accumulated return expectations of actions made under any state obtained based on the expert example data are calculated by adopting boltzmann distribution, so that a log-likelihood estimation function based on the existing expert example data is obtained, the maximum likelihood reverse reinforcement learning is integrated with the characteristics of other reverse reinforcement learning models, the return function can be estimated under the condition that expert tracks are fewer, the maximum likelihood model is found out through the expert tracks, the initial return function is continuously adjusted, the strategy pi is continuously optimized through gradients, and the whole algorithm flow is shown as a maximum likelihood reverse reinforcement learning algorithm flow chart shown in fig. 7. The method comprises the following specific steps:
firstly, under the condition that the state s of the tourist is obtained through the expert example data, a behavior a is made to obtain a cumulative return expectation Q, wherein the cumulative return expectation can be represented by a formula (8):
whereas in MDP, the action is defined as the next browsed exhibit, so the action space is not large, so we use boltzmann distribution as the policy pi, and calculate the policy pi, which can be expressed by equation (9):
π θ (a|s)=e βQ(s,a) /∑ a' e βQ(s,a') (9)
under this strategy, the log-likelihood estimation function based on the existing tourist browsing exhibit-related trajectory demonstration data can be represented by the formula (10):
deriving the log likelihood estimation function to obtain a gradientThen, the weight vector is updated according to the current weight vector plus 0.01 times of the gradient, namely +.>Until the absolute value of the difference of the next weight vector minus the current weight vector is less than or equal to 0.01, namely ||theta t - 1 -θ t If the level is less than or equal to 0.01, the learning is finished, and a weight vector value theta=argmax is output θ L (D|θ), if the absolute value is greater than 0.01, i.e. |θ t-1 -θ t And (3) re-acquiring the accumulated return expectation until the absolute value is less than or equal to 0.01. On the basis of collecting the real tourist behavior data, combining the tourist behavior with reverse reinforcement learning, and designing a reverse reinforcement learning algorithm for the collected behavior data based on the obtained real tourist behaviorThe data is subject to fine-grained preference learning.
The complete flow is as provided in the flow chart of the overall structure of learning guest fine-grained preference of fig. 2: based on the combination of iBeacon and a combined smart phone, the tourist behavior data are collected and stored in a text file, five elements in a Markov decision process are obtained and defined, a Markov decision process model is constructed, a return function is constructed, two characteristics of normalized photographing times and residence time are added into the return function, the tourist browsing track data are taken as expert example data, and finally, the tourist preference is learned by utilizing a maximum likelihood reverse reinforcement learning algorithm, so that the accurate tourist preference can be learned according to limited tourist data.
According to the tourist behavior preference modeling method based on reverse reinforcement learning, a display is positioned based on iBeacon, the times of receiving photographing broadcasting and the position identification of iBeacon are combined with a smart phone, the information is uploaded to a system server, tourist behavior data are stored, S, A, P, r and gamma five elements in a Markov decision process are obtained, a Markov decision process model is constructed, a return function is constructed by utilizing a function approximation method, normalized photographing times and residence time are added to the return function, the tourist data are converted into expert example data in a 'state-action-behavior feature' sequence format, meanwhile, accumulated return expectations of actions made in any state are obtained based on the expert example data, a Boltzmann distribution is adopted to calculate a strategy, a log likelihood estimation function based on the existing expert example data is obtained, the log likelihood estimation function is conducted, weight vectors are updated, learning of preference is finished when set conditions are met, and accurate tourist preference can be learned according to limited tourist data.
The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.
Claims (4)
1. The guest behavior preference modeling method based on reverse reinforcement learning is characterized by comprising the following steps of:
based on the combination of iBeacon and a smart phone, the tourist behavior data of tourists are acquired and stored;
carrying out Markov decision process modeling according to the tour behavior data and constructing a return function;
acquiring and adding photographing times and residence time into the return function, and converting the tour behavior data into expert example data;
utilizing a maximum likelihood reverse reinforcement learning algorithm to learn preferences of tourist tour trajectories;
based on iBeacon combines together with the smart mobile phone, acquire and save tourist's tour behavior data, include:
the method comprises the steps of acquiring and grouping iBeacon equipment in an indoor exhibition hall, simultaneously positioning exhibits by combining a Minor and a Major in iBeacon protocol data, simultaneously receiving broadcast signals of the iBeacon equipment by an application program in a smart phone, reading sensor data, monitoring photographing broadcasting, and uploading acquired data to a system server through a wireless network;
based on iBeacon combines together with the smart mobile phone, acquire and save tourist's tour behavior data, still include:
according to the times of receiving photographing broadcasting and the position identification of the iBeacon, the system server counts the photographing times of the tourists on the target exhibits and stores the collected tourist behavior data through the file;
the learning of preferences of tourist tour trajectories by using a maximum likelihood reverse reinforcement learning algorithm comprises the following steps:
the accumulated return expectation of actions made in any state based on the expert example data is obtained, and a Boltzmann distribution is adopted to calculate a strategy, so that a log-likelihood estimation function based on the existing expert example data is obtained;
and deriving the log likelihood estimation function, updating the weight vector according to the gradient obtained by adding 0.01 times to the current weight vector until the absolute value of the difference of the current weight vector subtracted by the next weight vector is smaller than or equal to 0.01, finishing learning, outputting a weight vector value, and re-acquiring the accumulated return expectation until the absolute value is smaller than or equal to 0.01 if the absolute value is larger than 0.01.
2. A method of modeling guest behavior preferences based on reverse reinforcement learning as defined in claim 1 wherein modeling a markov decision process from the tour behavior data and constructing a return function comprises:
and acquiring S, A, P, r and gamma five elements in a Markov decision process, constructing a Markov decision process model, and combining a set strategy to obtain an interaction sequence of the tourist, wherein S represents a recorded state space of the current browse exhibit of the tourist, A represents an action space of the exhibit to be browsed next by the tourist in a corresponding state, P represents a state transition probability, r represents a return function, and gamma represents a discount factor.
3. A guest behavior preference modeling method based on reverse reinforcement learning as defined in claim 2, wherein the markov decision process modeling and the construction of a return function are performed according to the tour behavior data, further comprising:
and acquiring a characteristic base function, the number and weight vectors of the characteristic base and the characteristic vector of each state, and constructing a return function by utilizing a function approximation method.
4. A guest behavior preference modeling method based on reverse reinforcement learning as claimed in claim 3, wherein acquiring and adding the number of shots and the stay time in the return function, and converting the tour behavior data into expert example data, comprises:
and acquiring photographing times and residence time when any exhibit is browsed, respectively carrying out normalization processing, adding the obtained normalized times and residence time with instantaneous return data in a corresponding state to obtain a return function value in the corresponding state, and simultaneously converting the obtained sightseeing behavior data into expert example data in a sequence format.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010195068.5A CN111415198B (en) | 2020-03-19 | 2020-03-19 | Tourist behavior preference modeling method based on reverse reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010195068.5A CN111415198B (en) | 2020-03-19 | 2020-03-19 | Tourist behavior preference modeling method based on reverse reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111415198A CN111415198A (en) | 2020-07-14 |
CN111415198B true CN111415198B (en) | 2023-04-28 |
Family
ID=71494548
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010195068.5A Active CN111415198B (en) | 2020-03-19 | 2020-03-19 | Tourist behavior preference modeling method based on reverse reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111415198B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113158086B (en) * | 2021-04-06 | 2023-05-05 | 浙江贝迩熊科技有限公司 | Personalized customer recommendation system and method based on deep reinforcement learning |
CN114355786A (en) * | 2022-01-17 | 2022-04-15 | 北京三月雨文化传播有限责任公司 | Big data-based regulation cloud system of multimedia digital exhibition hall |
CN117033800A (en) * | 2023-10-08 | 2023-11-10 | 法琛堂(昆明)医疗科技有限公司 | Intelligent interaction method and system for visual cloud exhibition system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010048146A1 (en) * | 2008-10-20 | 2010-04-29 | Carnegie Mellon University | System, method and device for predicting navigational decision-making behavior |
CN107358471A (en) * | 2017-07-17 | 2017-11-17 | 桂林电子科技大学 | A kind of tourist resources based on visit behavior recommends method and system |
CN108819948A (en) * | 2018-06-25 | 2018-11-16 | 大连大学 | Driving behavior modeling method based on reverse intensified learning |
CN108875005A (en) * | 2018-06-15 | 2018-11-23 | 桂林电子科技大学 | A kind of tourist's preferential learning system and method based on visit behavior |
WO2019145952A1 (en) * | 2018-01-25 | 2019-08-01 | Splitty Travel Ltd. | Systems, methods and computer program products for optimization of travel technology target functions, including when communicating with travel technology suppliers under technological constraints |
CN110288436A (en) * | 2019-06-19 | 2019-09-27 | 桂林电子科技大学 | A kind of personalized recommending scenery spot method based on the modeling of tourist's preference |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10872322B2 (en) * | 2008-03-21 | 2020-12-22 | Dressbot, Inc. | System and method for collaborative shopping, business and entertainment |
-
2020
- 2020-03-19 CN CN202010195068.5A patent/CN111415198B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010048146A1 (en) * | 2008-10-20 | 2010-04-29 | Carnegie Mellon University | System, method and device for predicting navigational decision-making behavior |
CN107358471A (en) * | 2017-07-17 | 2017-11-17 | 桂林电子科技大学 | A kind of tourist resources based on visit behavior recommends method and system |
WO2019145952A1 (en) * | 2018-01-25 | 2019-08-01 | Splitty Travel Ltd. | Systems, methods and computer program products for optimization of travel technology target functions, including when communicating with travel technology suppliers under technological constraints |
CN108875005A (en) * | 2018-06-15 | 2018-11-23 | 桂林电子科技大学 | A kind of tourist's preferential learning system and method based on visit behavior |
CN108819948A (en) * | 2018-06-25 | 2018-11-16 | 大连大学 | Driving behavior modeling method based on reverse intensified learning |
CN110288436A (en) * | 2019-06-19 | 2019-09-27 | 桂林电子科技大学 | A kind of personalized recommending scenery spot method based on the modeling of tourist's preference |
Non-Patent Citations (5)
Title |
---|
刘建伟 ; 高峰 ; 罗雄麟 ; .基于值函数和策略梯度的深度强化学习综述.计算机学报.2018,(06),全文. * |
孙磊等.基于游览行为的游客偏好学习方法.计算机工程与设计.2019,全文. * |
宣闻.基于逆向强化学习的细粒度游客行为偏好研究.中国优秀硕士学位论文全文数据库信息科技.2022,(第06期),全文. * |
范长杰. 基于马尔可夫决策理论的规划问题的研究.中国博士学位论文全文数据库信息科技基础科学.2009,(第07期),全文. * |
陈希亮 ; 曹雷 ; 何明 ; 李晨溪 ; 徐志雄 ; .深度逆向强化学习研究综述.计算机工程与应用.2018,(05),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN111415198A (en) | 2020-07-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111415198B (en) | Tourist behavior preference modeling method based on reverse reinforcement learning | |
CN110609903B (en) | Information presentation method and device | |
US9235263B2 (en) | Information processing device, determination method, and non-transitory computer readable storage medium | |
JP4497236B2 (en) | Detection information registration device, electronic device, detection information registration device control method, electronic device control method, detection information registration device control program, electronic device control program | |
CN107680010B (en) | Scenic spot route recommendation method and system based on touring behavior | |
JP4902270B2 (en) | How to assemble a collection of digital images | |
CN104737523B (en) | The situational model in mobile device is managed by assigning for the situation label of data clustering | |
US8650242B2 (en) | Data processing apparatus and data processing method | |
CN107018333A (en) | Shoot template and recommend method, device and capture apparatus | |
CN103944804B (en) | Contact recommending method and device | |
CN103455472B (en) | Information processing apparatus and information processing method | |
CN101855633A (en) | Video analysis apparatus and method for calculating inter-person evaluation value using video analysis | |
CN103914559A (en) | Network user screening method and network user screening device | |
JPWO2014129042A1 (en) | Information processing apparatus, information processing method, and program | |
CN107666540B (en) | Terminal control method, device and storage medium | |
CN115654675A (en) | Air conditioner operation parameter recommendation method and related equipment | |
CN113495487A (en) | Terminal and method for adjusting operation parameters of target equipment | |
JP2016129309A (en) | Object linking method, device and program | |
JP2022145054A (en) | Recommendation information providing method and recommendation information providing system | |
JP2014225061A (en) | Information provision device, information provision system, information provision method, and program | |
CN116503209A (en) | Digital twin system based on artificial intelligence and data driving | |
JP2015153157A (en) | virtual information management system | |
CN108616919A (en) | A kind of public domain stream of people monitoring method and device | |
KR100880001B1 (en) | Mobile device for managing personal life and method for searching information using the mobile device | |
CN113158086B (en) | Personalized customer recommendation system and method based on deep reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |