CN111415198B

CN111415198B - Tourist behavior preference modeling method based on reverse reinforcement learning

Info

Publication number: CN111415198B
Application number: CN202010195068.5A
Authority: CN
Inventors: 常亮; 宣闻; 宾辰忠; 陈源鹏
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2023-04-28
Anticipated expiration: 2040-03-19
Also published as: CN111415198A

Abstract

The invention discloses a tourist behavior preference modeling method based on reverse reinforcement learning, which is characterized in that a display is positioned based on iBeacon, the number of times of photographing broadcasting and the position identification of iBeacon are combined by a smart phone, the tourist behavior data are uploaded and stored, five elements in a Markov decision process are obtained, a Markov decision process model is constructed, a return function is constructed by utilizing a function approximation method, the normalized photographing number of times and residence time are obtained and added into the return function, the tourist data are converted into expert example data, a Boltzmann distribution is adopted to calculate a strategy, after a log likelihood estimation function is obtained, the derivation and the updating of weight vectors are carried out, and when the set condition is met, the learning of preferences is ended, and the accurate tourist preference can be learned according to limited tourist data.

Description

Tourist behavior preference modeling method based on reverse reinforcement learning

Technical Field

The invention relates to the technical field of location awareness and machine learning, in particular to a guest behavior preference modeling method based on reverse reinforcement learning.

Background

The travel recommendation technology is utilized to provide personalized service for users and improve recommendation performance and guest satisfaction, and is one of the hot spots of current intelligent travel field research. In travel recommendations, it is important to understand the patterns of guest behavior and learn guest preferences. The current travel recommendation technology mainly uses data such as scoring, check-in data, access frequency and the like of tourists on the tourist exhibits as a judgment basis for the preference degree of the tourists on the exhibits. However, specific scoring data of tourists for tourists at tourist spots or exhibits is not generally available inside specific scenic spots, such as museums, theme parks, etc., and thus fine-grained preference learning cannot be performed on the tourists, and thus tourist recommendations for the inside of specific scenic spots cannot be obtained. And many recommendation algorithms need a large amount of tourist history data to train, so that the tourist preference is learned and then recommended, however, the tourist data in the exhibition hall are rare and incomplete, so that the accurate preference cannot be learned according to limited tourist data.

Disclosure of Invention

The invention aims to provide a tourist behavior preference modeling method based on reverse reinforcement learning, which can learn accurate tourist preferences according to limited tourist tour data.

In order to achieve the above object, the present invention provides a modeling method for guest behavior preference based on reverse reinforcement learning, including:

based on the combination of iBeacon and a smart phone, the tourist behavior data of tourists are acquired and stored;

carrying out Markov decision process modeling according to the tour behavior data and constructing a return function;

acquiring and adding photographing times and residence time into the return function, and converting the tour data into expert example data;

and utilizing a maximum likelihood reverse reinforcement learning algorithm to learn preferences of tourist tour trajectories.

The iBeacon-based intelligent mobile phone comprises the following steps of:

and acquiring and grouping iBeacon equipment in the indoor exhibition hall, simultaneously combining the Minor and the Major in the iBeacon protocol data to position the exhibited article, simultaneously receiving the broadcast signal of the iBeacon equipment by an application program in the smart phone, reading the sensor data, monitoring photographing broadcasting, and uploading the acquired data to a system server through a wireless network.

The iBeacon-based intelligent mobile phone comprises a base, a smart phone, a storage unit and a storage unit, wherein the iBeacon-based intelligent mobile phone is used for acquiring and storing tourist behavior data of tourists, and the storage unit further comprises:

according to the times of receiving photographing broadcasting and the position identification of the iBeacon, the system server counts the photographing times of the tourists on the target exhibits, and stores the collected tourist behavior data through the file.

The method for modeling the Markov decision process according to the tour behavior data and constructing a return function comprises the following steps:

and acquiring S, A, P, r and gamma five elements in a Markov decision process, constructing a Markov decision process model, and combining a set strategy to obtain an interaction sequence of the tourist, wherein S represents a recorded state space of the current browse exhibit of the tourist, A represents an action space of the exhibit to be browsed next by the tourist in a corresponding state, P represents a state transition probability, r represents a return function, and gamma represents a discount factor.

Wherein, according to the tour behavior data, performing Markov decision process modeling and constructing a return function, and further comprising:

and acquiring a characteristic base function, the number and weight vectors of the characteristic base and the characteristic vector of each state, and constructing a return function by utilizing a function approximation method.

The method for obtaining and adding photographing times and residence time in the return function and converting the tour data into expert example data comprises the following steps:

and acquiring photographing times and residence time when any exhibit is browsed, respectively carrying out normalization processing, adding the obtained normalized times and residence time with instantaneous return data in a corresponding state to obtain a return function value in the corresponding state, and simultaneously converting the obtained sightseeing behavior data into expert example data in a sequence format.

Wherein, the learning of preference is carried out on tourist tour trajectories by using a maximum likelihood reverse reinforcement learning algorithm, which comprises the following steps:

and calculating strategies by using Boltzmann distribution based on the accumulated return expectation of actions made in any state obtained by the expert example data, thereby obtaining a log-likelihood estimation function based on the existing expert example data.

Wherein, the learning of preference is carried out to tourist tour trajectories by utilizing a maximum likelihood reverse reinforcement learning algorithm, and the method further comprises the following steps:

and deriving the log likelihood estimation function, updating the weight vector according to the gradient obtained by adding 0.01 times to the current weight vector until the absolute value of the difference of the current weight vector subtracted by the next weight vector is smaller than or equal to 0.01, finishing learning, outputting a weight vector value, and re-acquiring the accumulated return expectation until the absolute value is smaller than or equal to 0.01 if the absolute value is larger than 0.01.

According to the tourist behavior preference modeling method based on reverse reinforcement learning, a display is positioned based on iBeacon, the times of receiving photographing broadcasting and the position identification of iBeacon are combined with a smart phone, the information is uploaded to a system server, tourist behavior data are stored, S, A, P, r and gamma five elements in a Markov decision process are obtained, a Markov decision process model is constructed, a return function is defined, normalized photographing times and stay time are added in the return function, the return function is approximated by a function approximation method, the tourist data are converted into expert example data in a 'state-action-behavior feature' sequence format, meanwhile, accumulated return expectations of actions made in any state are obtained based on the expert example data, a Boltzmann distribution is adopted to calculate a strategy, a log likelihood estimation function based on the existing example data is obtained, the log likelihood estimation function is conducted and weight vectors are updated, and when set conditions are met, learning of preference is finished, and the tourist can be learned according to limited tourist data.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic step diagram of a modeling method of guest behavior preference based on reverse reinforcement learning.

FIG. 2 is a flow chart of the overall structure of learning guest fine-grained preferences provided by the present invention.

FIG. 3 is a flow chart of data acquisition and processing provided by the present invention.

FIG. 4 is a flow chart of a method for constructing a Markov decision process model provided by the present invention.

FIG. 5 is a schematic diagram of a Markov decision interaction process provided by the present invention.

Fig. 6 is an overall flowchart of reverse reinforcement learning provided by the present invention.

Fig. 7 is a flowchart of a maximum likelihood reverse reinforcement learning algorithm provided by the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

Referring to fig. 1 and 2, the invention provides a modeling method for guest behavior preference based on reverse reinforcement learning, comprising:

s101, based on combination of iBeacon and a smart phone, tourist behavior data of tourists are acquired and stored.

Specifically, the scene is first arranged in an indoor exhibition hall. The data acquisition and processing flow chart shown in fig. 3 is that a tour guide APP is installed on a smart phone of a tourist, and simultaneously, iBeacon (Chinese name: must Ken, a very accurate micro positioning technology by a low-power Bluetooth technology) is arranged at an exhibition hall entrance and every exhibit in the exhibition hall for acquiring the position information of the tourist; in the iBeacon protocol data, two identifiers, namely Minor and Major, are included. In our application scenario, the iBeacon devices are grouped, wherein the Major is used for identifying which group the iBeacon devices belong to, and the Minor is used for identifying different iBeacon devices in the same group, namely, the Minor is set as the ID of the exhibits in the exhibition hall, and the Major is set as the partition to which the exhibits belong, so that the combination of two identifications of the Minor and the Major can be used as the identification of the browse exhibits, and the position information of the present tourist exhibits can be positioned; and the tour APP on the tourist smart phone receives signals sent by the iBeacon through the mobile phone camera and the acceleration sensor, so that various tour behavior data (such as photographing, residence time and the like) of the tourist are collected, an application program in the smart phone receives broadcasting signals of the iBeacon equipment, then the smart phone reads the sensor data and monitors photographing broadcasting, and finally the collected data is uploaded to a system server through a wireless network. When a tourist takes a picture, an application program in the smart phone immediately detects the occurrence of a picture taking behavior and then sends a broadcast to a system server; and the system server counts the shooting times, browsing time and the like of the tourists on the target exhibits according to the times of receiving shooting broadcasting and the position identification of the iBeacon, and stores the acquired tourist behavior data through the file. The file storage data comprises a time stamp sequence of interaction between tourists and iBeacon, three-axis (X, Y, Z) acceleration data of the user and identification of a browse exhibit. The data are collected by adopting the mode that iBeacon and smart phone are combined, and the collection mode is more convenient. The adopted data set is the real behavior data generated when tourists visit scenic spots in scenic spots, and the data also contains the browsing behaviors of the tourists, so the data is richer and more real.

S102, carrying out Markov decision process modeling according to the tour behavior data and constructing a return function.

Specifically, S, A, P, r and gamma five elements in a Markov decision process are obtained, a Markov decision process model is constructed, and interaction sequences of tourists are obtained by combining a set strategy, as shown in a flow chart for constructing the Markov decision process model in FIG. 4, wherein five elements in the Markov decision process are defined: the state S represents the record of the current browse exhibit of the tourist, and the state space is S; action a represents the next to-be-browsed exhibit of tourists in state s, and the action space is A; probability of state transition P(s) _t+1 |s _t ,a _t ) Representing slave state s _t By action a _t Transition to state s _t+1 Wherein s is _t ∈S，a _t E A. For example, tourists browse exhibit records s ₁ In the case of (a) then wanting to browse exhibit a ₂ Or exhibit a ₃ Then the state transition probability may be defined as P(s) ₂ |s ₁ ,a ₂ )＝0.5，P(s ₃ |s ₁ ,a ₃ )＝0.5；r(s _t ,a _t ) The report function is represented by the current browse exhibit record s of tourists _t Browse exhibit a _t The return that can be obtained is then. Wherein s is _t ∈S，a _t E A. This return value is proportional to the guest preference value, that is, guest versus exhibit a _t The higher the preference of (c), the higher the return value. For ease of calculation we define r (s _t ,a _t )≤1；γ∈[0,1]Representing a discount factor, for calculating the cumulative return.

The interaction process of the tourist and the exhibits in the exhibition hall can be regarded as a markov decision process, and then the interaction process of the tourist and the exhibits in the exhibition hall is described, and the schematic of the markov decision interaction process is shown in fig. 5:

the tourist starts from entering the exhibition hall, and the browsing record defaults to s ₀ . When browsing exhibit a ₁ When the method is used, corresponding photographing times and residence time are generated; the shooting times and the stay time are taken as characteristic values and added into a return function to calculate a return value r ₁ And updates the guest viewing record s ₁ The method comprises the steps of carrying out a first treatment on the surface of the Then the tourist browses the next exhibit a ₂ In the same way calculate the return value r ₂ The corresponding change of the tourist browsing record into s ₂ The interaction is continued, so that the interaction sequence when the tourist browses is shown as (1), wherein

s ₀ ,s ₁ ,s ₂ ,......,s _t-1 ,s _t ∈S；

s ₀ ,a ₁ ,r ₁ ,s ₁ ,a ₂ ,r ₂ ,......,s _t-1 ,a _t ,r _t ,s _t (1)

Markov, as used herein, refers to a record of exhibits that a guest browses at the next time _t+1 Only depends on the exhibit record s browsed by tourists at the current moment _t And the exhibit a being browsed _t All other history browsed exhibit records can be discarded; as shown in formula (2), wherein P(s) _t+1 |s _t ,a _t ) Transition probability for browsing exhibits for tourists:

P(s _t+1 |s _t ,a _t ，......s ₁ ,a ₁ )＝P(s _t+1 |s _t ,a _t ) (2)

and how to select action a in each state _t Is determined by the policy pi. Policy (policy) is defined as pi: S-A represents the behavior mapping from the state space of the tourist browsing the exhibit record to the next tourist browsing the exhibit. As can be seen from formula (3), the policy pi refers to the conditional probability distribution over the set of actions given a state s, i.e. the probability that the policy pi can specify an action on each state s; the strategy pi can determine the exhibit a recommended to the tourist in the next step according to the record s of the tourist browsing the exhibits;

π(a|s)＝P(A _t ＝a|S _t ＝s) (3)

for example, a guest may browse exhibits using a policy pi (a ₂ |s ₁ )＝0.3，π(a ₃ |s ₁ ) =0.7, which means that the guest is browsing the record s ₁ Under the condition of (a), the next browse exhibit a ₂ The probability of (a) is 0.3, browse exhibit a ₃ Is of (1)The rate is 0.7, obviously tourists browse exhibit a ₃ Is more likely to occur;

based on the given strategy pi and the Markov decision process model, the interaction sequence tau of a tourist tour exhibit can be determined:

τ＝s ₀ ,a ₁ ,r ₁ ,s ₁ ,a ₂ ,r ₂ ,s ₂ ,......,s _t-1 ,a _t ,r _t ,s _t (4)

since guest preferences are unknown, i.e. the return function r (s _t ,a _t ) Unknown, so we can obtain the feature basis function, the number and weight vector of the feature basis and the feature vector of each state, and use the function approximation method to perform parameter approximation to the feature basis function, and construct the return function, the approximation form is shown in formula (5):

phi = (phi) in the above ₁ ，φ ₂ ,，......,φ _d ) ^T ，φ:S×A→R ^d Is a finite number and fixed finite characteristic base function, wherein d is the number of characteristic bases, phi ₁ Feature vectors for each state. θ= (θ) ₁ ,θ ₂ ,......θ _d ) Representing the weight vector between the various feature bases. By such a linear representation, we can adjust the weights to change the return function value.

S103, acquiring and adding photographing times and residence time in the return function, and converting the tour data into expert example data.

Specifically, the photographing times and the residence time when any exhibit is browsed are obtained, normalization processing is carried out respectively, then the obtained photographing times and residence time are added with instantaneous regression data in a corresponding state to obtain a return function value in the corresponding state, and meanwhile, the obtained browsing behavior data conversion sequence format is expert example data, and the sequence format of the expert example data is state-action-behavior characteristics. Since the guest's preference is unknown, thenWe can consider that the return available to the guest in the current browsing state s for the next browse exhibit is unknown, that is, the return R (s, a) available to the guest in the state s for the selection action a is often unknown; it is therefore necessary to learn the back reward functions by browsing the trajectory data of the exhibits for the expert examples (existing relevant guests). In the learning process, two guest behavior characteristics of photographing times and residence time are added in the return function to train; finally, a return function R is learned by a reverse reinforcement learning algorithm _θ (s, a), a reverse reinforcement learning overall flowchart as shown in fig. 6, comprising the following steps:

in our application scenario, there are 15 exhibits in total. We count the shooting times img of a certain exhibit in the current state s _s And residence time stage _s Two guest behavior features (in seconds). Therefore, we define the return function as the sum of the instantaneous return generated when viewing the exhibit and the return generated by the number of shots and the dwell time when the guest browses the exhibit in that state. For ease of calculation, we normalize the data by equation (6) for the return generated by the number of shots and dwell time, where x ^* Values representing the number of photographing times or the stay time in the current state, and min and max represent minimum and maximum values of the number of photographing times or the stay time in all the states;

the return function in the current state can be expressed by equation (7):

the existing guest browsing trajectories are then processed into a "state-action-behavior feature" sequence as expert example data. Assume that N guest trajectory data d= { ζ are provided ₁ ,......,ζ _N Each track data length is H, thenA set of trajectory data sequences may be represented as:

ζ ₁ ＝((s ₁ ,a ₁ ,img ₁ ,stay ₁ ),(s ₂ ,a ₂ ,img ₂ ,stay ₂ ),......(s _H ,a _H ,img _H ,stay _H ))

wherein s is _H ∈S，a _H E A. In the present invention, we define each track data length H as 15. For example, the browsing track of a guest u is:

ζ _u ＝((s ₁ ,a ₂ ,img ₁ ,stay ₁ ),(s ₂ ,a ₄ ,img ₂ ,stay ₂ ),(s ₃ ,a ₃ ,img ₃ ,stay ₃ ),......(s ₁₅ ,a ₁₅ ,img ₁₅ ,stay ₁₅ ))

then represent guest u in state s ₁ Browse under exhibit a ₂ Wherein in exhibit a ₂ The number of times of photographing is img ₁ The residence time is stay ₁ The method comprises the steps of carrying out a first treatment on the surface of the Then browse exhibit a ₄ Wherein in exhibit a ₄ The number of times of photographing is img ₂ Residence time stage ₂ 。

S104, utilizing a maximum likelihood reverse reinforcement learning algorithm to learn preferences of tourist tour trajectories.

Specifically, the accumulated return expectations of actions made under any state obtained based on the expert example data are calculated by adopting boltzmann distribution, so that a log-likelihood estimation function based on the existing expert example data is obtained, the maximum likelihood reverse reinforcement learning is integrated with the characteristics of other reverse reinforcement learning models, the return function can be estimated under the condition that expert tracks are fewer, the maximum likelihood model is found out through the expert tracks, the initial return function is continuously adjusted, the strategy pi is continuously optimized through gradients, and the whole algorithm flow is shown as a maximum likelihood reverse reinforcement learning algorithm flow chart shown in fig. 7. The method comprises the following specific steps:

firstly, under the condition that the state s of the tourist is obtained through the expert example data, a behavior a is made to obtain a cumulative return expectation Q, wherein the cumulative return expectation can be represented by a formula (8):

whereas in MDP, the action is defined as the next browsed exhibit, so the action space is not large, so we use boltzmann distribution as the policy pi, and calculate the policy pi, which can be expressed by equation (9):

π _θ (a|s)＝e ^βQ(s,a) /∑ _a' e ^βQ(s,a') (9)

under this strategy, the log-likelihood estimation function based on the existing tourist browsing exhibit-related trajectory demonstration data can be represented by the formula (10):

deriving the log likelihood estimation function to obtain a gradient

Then, the weight vector is updated according to the current weight vector plus 0.01 times of the gradient, namely +.>

Until the absolute value of the difference of the next weight vector minus the current weight vector is less than or equal to 0.01, namely ||theta _t - ₁ -θ _t If the level is less than or equal to 0.01, the learning is finished, and a weight vector value theta=argmax is output _θ L (D|θ), if the absolute value is greater than 0.01, i.e. |θ _t-1 -θ _t And (3) re-acquiring the accumulated return expectation until the absolute value is less than or equal to 0.01. On the basis of collecting the real tourist behavior data, combining the tourist behavior with reverse reinforcement learning, and designing a reverse reinforcement learning algorithm for the collected behavior data based on the obtained real tourist behaviorThe data is subject to fine-grained preference learning.

The complete flow is as provided in the flow chart of the overall structure of learning guest fine-grained preference of fig. 2: based on the combination of iBeacon and a combined smart phone, the tourist behavior data are collected and stored in a text file, five elements in a Markov decision process are obtained and defined, a Markov decision process model is constructed, a return function is constructed, two characteristics of normalized photographing times and residence time are added into the return function, the tourist browsing track data are taken as expert example data, and finally, the tourist preference is learned by utilizing a maximum likelihood reverse reinforcement learning algorithm, so that the accurate tourist preference can be learned according to limited tourist data.

According to the tourist behavior preference modeling method based on reverse reinforcement learning, a display is positioned based on iBeacon, the times of receiving photographing broadcasting and the position identification of iBeacon are combined with a smart phone, the information is uploaded to a system server, tourist behavior data are stored, S, A, P, r and gamma five elements in a Markov decision process are obtained, a Markov decision process model is constructed, a return function is constructed by utilizing a function approximation method, normalized photographing times and residence time are added to the return function, the tourist data are converted into expert example data in a 'state-action-behavior feature' sequence format, meanwhile, accumulated return expectations of actions made in any state are obtained based on the expert example data, a Boltzmann distribution is adopted to calculate a strategy, a log likelihood estimation function based on the existing expert example data is obtained, the log likelihood estimation function is conducted, weight vectors are updated, learning of preference is finished when set conditions are met, and accurate tourist preference can be learned according to limited tourist data.

The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.

Claims

1. The guest behavior preference modeling method based on reverse reinforcement learning is characterized by comprising the following steps of:

acquiring and adding photographing times and residence time into the return function, and converting the tour behavior data into expert example data;

utilizing a maximum likelihood reverse reinforcement learning algorithm to learn preferences of tourist tour trajectories;

based on iBeacon combines together with the smart mobile phone, acquire and save tourist's tour behavior data, include:

the method comprises the steps of acquiring and grouping iBeacon equipment in an indoor exhibition hall, simultaneously positioning exhibits by combining a Minor and a Major in iBeacon protocol data, simultaneously receiving broadcast signals of the iBeacon equipment by an application program in a smart phone, reading sensor data, monitoring photographing broadcasting, and uploading acquired data to a system server through a wireless network;

based on iBeacon combines together with the smart mobile phone, acquire and save tourist's tour behavior data, still include:

according to the times of receiving photographing broadcasting and the position identification of the iBeacon, the system server counts the photographing times of the tourists on the target exhibits and stores the collected tourist behavior data through the file;

the learning of preferences of tourist tour trajectories by using a maximum likelihood reverse reinforcement learning algorithm comprises the following steps:

the accumulated return expectation of actions made in any state based on the expert example data is obtained, and a Boltzmann distribution is adopted to calculate a strategy, so that a log-likelihood estimation function based on the existing expert example data is obtained;

2. A method of modeling guest behavior preferences based on reverse reinforcement learning as defined in claim 1 wherein modeling a markov decision process from the tour behavior data and constructing a return function comprises:

3. A guest behavior preference modeling method based on reverse reinforcement learning as defined in claim 2, wherein the markov decision process modeling and the construction of a return function are performed according to the tour behavior data, further comprising:

4. A guest behavior preference modeling method based on reverse reinforcement learning as claimed in claim 3, wherein acquiring and adding the number of shots and the stay time in the return function, and converting the tour behavior data into expert example data, comprises: