CN112818402B

CN112818402B - Method for realizing k anonymity of track data release based on point density segmentation track

Info

Publication number: CN112818402B
Application number: CN202110213797.3A
Authority: CN
Inventors: 徐红云; 杨丰源; 陆涛; 余宛书; 熊镔; 时浩南; 孙雨虹; 张紫怡
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2022-07-26
Anticipated expiration: 2041-02-26
Also published as: CN112818402A

Abstract

The invention discloses a method for realizing k anonymity of track data distribution based on a point density segmentation track, which comprises the following steps of: 1) acquiring basic track data and establishing a track data set model; 2) establishing a DGH tree of a track loss model; 3) adding virtual points in the track data set model to generate a track data set model containing the virtual points and a virtual point mark data set model; 4) clustering the track data set model containing the virtual points, marking the clustering center to which each point belongs, and generating a marked data set model; 5) traversing the track data set model, and segmenting the track through the marked data set model to generate a segmented track data set model; 6) and calculating loss of the segmented data set model by adopting a dynamic sequence alignment algorithm, and clustering based on information loss by using an iterative track k anonymous clustering algorithm. The method segments the track based on the point density of the track data set, and reduces information loss caused in the k anonymization process.

Description

Method for realizing k anonymity of track data release based on point density segmentation track

Technical Field

The invention relates to the technical field of track data privacy protection and release, in particular to a method for realizing k anonymity of track data release based on point density segmentation tracks.

Background

Today, with the rapid development of the technology level, mobile devices have become widely spread in people, and people widely collect movement trajectory data through cellular networks and applications provided by mobile devices, which are required when mobile devices are networked. The rapid development of mass storage technology and data processing technology makes the external public distribution of the track data extremely convenient.

The publicly released track data not only plays an important role in research of scientific research organizations, but also is very important for reflecting the transparency of the track data for departments such as operators, governments and the like. However, such trace data may also be utilized by malicious attackers.

In track data distribution, the privacy protection target is mainly the corresponding relation between sensitive data in a user track and a user individual. In order to ensure that private data of people are not attacked or leaked under the condition of publishing track data publicly, different organizations use a plurality of methods to process the track data before publishing the data. For the privacy protection problem of track data, a large amount of research is carried out by scholars, and some privacy protection methods are proposed, which are specifically as follows:

1) shaham et al, Privacy Preserving Location Data Publishing: a Machine Learning Approach uses a heuristic algorithm and a k-means algorithm to realize the clustering of the tracks respectively, proposes to use a regional generalized hierarchical tree to process the track data, calculates the loss according to the regional generalized hierarchical tree, and performs the generalized processing of the track data so as to realize k ' -anonymity (k is added with a ' ″ ' used for being distinguished from k in the k-means). However, this method may cause a problem of generalization loss of some points on the track in the practical application process.

2) An article of "Anonymization of longitudinal electronic medical records" of Tamersoy et al is based on a generalization concept, and a heuristic method is adopted to realize k-anonymity of a data set, but the algorithm has the problem of great information loss while realizing data Anonymization.

3) An article by Marco et al, Towards Privacy-forecasting Publishing of space technical project Data, proposes to use a k-merge algorithm to solve the problem of effective generalization encountered in the process of anonymization of space-time Trajectory Data sets, and proposes a method capable of realizing k-anonymity based on the concept of k-merge. The method realizes track anonymity while protecting the privacy of the user from being attacked, but can cause a large amount of information loss.

Although the method can protect the privacy of the user, the algorithm can cause great information loss in the implementation process.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a method for realizing k anonymity of track data distribution based on point density segmentation tracks, solves the problem of large information loss of the existing algorithm to a certain extent, and reduces the information loss under the condition of protecting the privacy of users.

In order to realize the purpose, the technical scheme provided by the invention is as follows: a method for realizing k anonymity of track data distribution based on point density segmentation tracks comprises the following steps:

1) acquiring basic track data including longitude and latitude information of a track and a time sequence relation of a track point set, and establishing a track data set model T;

2) building a DGH (regional generalization hierarchy) tree of a track loss model by utilizing longitude and latitude information of a track;

3) adding virtual points between adjacent points of each track in the track data set model T to generate the track data set model T containing the virtual points _virtual And a virtual point marker dataset model virtual;

4) will contain the locus of virtual pointsData set model T _virtual Clustering is carried out on the whole point set, the clustering center of each point is marked, and a marked data set model mark is generated;

5) traversing each track in the track data set model T, judging whether adjacent points of each track belong to the same clustering center by marking a data set model mark, if not, segmenting, if so, reserving, and generating a segmented track data set model T _partition ；

6) Model T for segmented trajectory dataset _partition And (3) calculating loss by adopting a dynamic sequence alignment algorithm, clustering based on information loss by using an iterative track k anonymous clustering algorithm, and generating a k anonymous data set serving as a data set for data distribution.

In step 1), the trajectory data set model T is defined as follows:

definition 1, trajectory data set model: t ═ tr ₁ ,tr ₂ ,tr ₃ ,...,tr _n ]Wherein tr is _i A set of points representing the ith trace, i ═ 1, 2, 3, …, n; tr is _i ＝[p ₁ ,p ₂ ,p ₃ ,...,p _m ]Wherein p is _j J-th point representing the trajectory tri, j being 1, 2, 3, …, m; p is a radical of _j ＝(p _j .X,p _j .Y)∈tr _i Wherein p is _j .X、p _j Y is the trajectory tr _i Midpoint p _j Longitude and latitude of (c).

In step 2), the maximum value and the minimum value of longitude and latitude are respectively solved through the longitude and latitude information of the track to obtain the area range, then the area is uniformly divided, and a track loss model DGH (regional generalized hierarchy) tree is established, which comprises the following steps:

2.1) respectively solving the maximum value and the minimum value of the longitude and the latitude through the longitude and latitude information of the track, and defining a rectangular area P;

2.2) for the rectangular region P, dividing the rectangular region P into Nx sections with equal length in the transverse direction and Ny sections with equal length in the longitudinal direction;

2.3) respectively establishing transverse and longitudinal DGH (regional generalized hierarchical) trees through the Nx, the Ny and the track data set model T;

wherein, the DGH (regional generalization hierarchy) tree is defined as follows:

definition 2, regional generalized hierarchical tree: the position attribute in the map is divided into a plurality of equally-long cells, the cells are used as leaf nodes to establish a full binary tree, and if the number of the leaf nodes is not enough to fill the bottom layer of the binary tree, a plurality of invalid points are added for filling.

In step 3), traversing each track in the track data set model T, adding a virtual point between each pair of adjacent points in the track, and generating a track data set model T containing the virtual points _virtual And a virtual point marker dataset model virtual; trajectory data set model T and trajectory data set model T containing virtual points _virtual The virtual point mark data set model virtual and the virtual points are defined as follows:

definition 3, trajectory data set model: t ═ tr ₁ ,tr ₂ ,tr ₃ ,...,tr _n ]Wherein tr is _i A set of points representing the ith trajectory, i ═ 1, 2, 3, …, n; tr _i ＝[p ₁ ,p ₂ ,p ₃ ,...,p _m ]Wherein p is _j Representative locus tr _i J is 1, 2, 3, …, m; p is a radical of formula _j ＝(p _j .X,p _j .Y)∈tr _i Wherein p is _j .X、p _j Y is the trace tr _i Midpoint p _j Longitude and latitude of (c);

defining 4, a track data set model containing virtual points: t is a unit of _virtual ＝[trv ₁ ,trv ₂ ,trv ₃ ,...,trv _n ]Wherein trv _i A set of points representing the ith trajectory containing the virtual points; trv _i ＝[pv ₁ ,pv ₂ ,pv ₃ ,...,pv _m ]Wherein pv _j Representative trajectory trv _i The jth point of (1); pv _j ＝(pv _j .X,pv _j .Y)∈trv _i Wherein pv is _j .X、pv _j Y is the locus trv _i Midpoint pv _j Longitude and latitude of (c);

definition 5, virtual point marking data set model: virtuall ═ vir [ -vir ₁ ,vir ₂ ,vir ₃ ,...,vir _n ]Wherein vir _i Is a trajectory data set model T containing virtual points _virtual A virtual point mark list corresponding to the ith track; vir _i ＝[q ₁ ,q ₂ ,q ₃ ,...q _m ]，q _j Represents vir _i The jth point; q. q.s _j When the value of (b) is 0, it represents a true point, q _j When the value of (1) represents a virtual point; virtual point mark data set model virtual and track data set model T containing virtual points _virtual There is a one-to-one mapping in position, vir _i Corresponds to trv _i ，q _j Corresponding to pv _j Wherein q is _j ∈vir _i ，pv _j ∈trv _i And q is _j And pv _j Each represents vir _i And trv _i The jth number in (1);

definition 6, virtual points: and for the line segment formed between adjacent points of a certain track, adding a virtual point from one point to the line segment along the line segment at a fixed distance, so that the line segments with different lengths have the same influence on the point density of the area where the line segments are located.

In step 4), a trajectory data set model T containing virtual points is formed _virtual Clustering is carried out on the whole point set, generated clustering centers are numbered, and the number of the clustering center to which each point belongs is recorded by using a mark data set model mark, and the method comprises the following steps:

4.1) model T of trajectory data set containing virtual points _virtual Clustering is carried out by regarding the whole point set, a clustering center is generated, the clustering center is numbered, and a track data set model T containing virtual points is recorded _virtual The number of the cluster center corresponding to each point in the cluster;

4.2) traversing the trajectory data set model T containing the virtual points _virtual Judging whether the virtual point is a virtual point or not through a virtual point marking data set model virtual, and recording the number of the clustering center to which the mark data set model mark belongs for the real point; including virtualPseudo-point trajectory dataset model T _virtual The mark data set model mark is defined as follows:

defining 7, a track data set model containing virtual points: t is _virtual ＝[trv ₁ ,trv ₂ ,trv ₃ ,...,trv _n ]Wherein trv _i A set of points representing the ith trajectory containing virtual points, where i is 1, 2, 3, …, n; trv _i ＝[pv ₁ ,pv ₂ ,pv ₃ ,...,pv _m ]Wherein pv _j Representative trajectory trv _i Wherein j is 1, 2, 3, …, m; pv _j ＝(pv _j .X,pv _j .Y)∈trv _i Wherein pv is _j .X、pv _j Y is the locus trv _i Midpoint pv _j Longitude and latitude of (c);

definition 8: labeling data set model: mark ═ mark [ [ mark ₁ ,mar ₂ ,mar ₃ ,...,mar _n ]Wherein mar _i The method comprises the steps that a virtual point mark list corresponding to the ith track in a mark data set model mark containing a virtual point is obtained; mar _i ＝[z ₁ ,z ₂ ,z ₃ ,...,z _m ]，z _j Represents mar _i The jth point in (1); mark data set model mark and track data set model T containing virtual point _virtual There is a one-to-one mapping relationship on the locations, mar _i Corresponds to trv _i ，z _j Corresponding to pv _j Wherein z is _j ∈mar _i ，pv _j ∈trv _i And z is _j And pv _j Respectively represent mar _i And trv _i The j-th number in (2).

In step 5), judging whether the cluster center numbers of the adjacent points of each track in the track data set model T are the same or not by combining the track data set model T and the mark data set model mark, if the cluster center numbers are different, segmenting the track, and if the cluster center numbers are different, keeping the track unchanged, and generating a segmented track data set model T _partition (ii) a Wherein the track data set model T marks the data set model mark, the segmented track data set model T _partition The definition is as follows:

definition 9, trajectory data set model: t ═ T[tr ₁ ,tr ₂ ,tr ₃ ,...,tr _n ]Wherein tr is _i A set of points representing the ith trace, where i is 1, 2, 3, …, n; tr _i ＝[p ₁ ,p ₂ ,p ₃ ,...,p _m ]Wherein p is _j Representative locus tr _i Wherein j is 1, 2, 3, …, m; p is a radical of _j ＝(p _j .X,p _j .Y)∈tr _i Wherein p is _j .X、p _j Y is the trajectory tr _i Midpoint p _j Longitude and latitude of (c);

definition 10: labeling the data set model: mark ═ mark [ [ mark ₁ ,mar ₂ ,mar ₃ ,...,mar _n ]Wherein mar _i The method comprises the steps that a virtual point mark list corresponding to the ith track in a mark data set model mark containing a virtual point is formed; mar _i ＝[z ₁ ,z ₂ ,z ₃ ,...,z _m ]，z _j Represents mar _i The jth point in (1); mark data set model mark and track data set model T containing virtual point _virtual There is a one-to-one mapping relationship on the locations, mar _i Corresponds to trv _i ，z _j Corresponding to pv _j Wherein z is _j ∈mar _i ，pv _j ∈trv _i And z is _j And pv _j Respectively represent mar _i And trv _i The jth number in (1); wherein, the track data set model T containing virtual points _virtual The definition is as follows:

defining 11, a trajectory data set model containing virtual points: t is _virtual ＝[trv ₁ ,trv ₂ ,trv ₃ ,...,trv _n ]Wherein trv _i A set of points representing the ith trajectory containing the virtual points; trv _i ＝[pv ₁ ,pv ₂ ,pv ₃ ,...,pv _m ]Wherein pv is _j Representative trajectory trv _i The jth point of (1); pv _j ＝(pv _j .X,pv _j .Y)∈trv _i Wherein pv is _j .X、pv _j Y is the locus trv _i Midpoint pv _j Longitude and latitude of (c);

defining 12, a segmented track data set model: t is a unit of _partition ＝[trp ₁ ,trp ₂ ,trp ₃ ,...,trp _n ]Wherein trp _i A set of points representing the ith segmented trajectory; trp _i ＝[pp ₁ ,pp ₂ ,pp ₃ ,...,pp _m ]Wherein pp _j Representative locus trp _i The jth point of (1); pp (polypropylene) sheet _j ＝(pp _j .X,pp _j .Y)∈trp _i Wherein pp _j .X、pp _j Y is the trajectory trp _i Midpoint pp _j Longitude and latitude of (c).

In step 6), model T is applied to the segmented trajectory data set _partition Clustering the tracks by using an iterative track k-anonymous clustering algorithm; segmented trajectory dataset model T _partition The definitions of the information loss, the dynamic sequence alignment algorithm, the progressive sequence alignment algorithm and the iterative track k-anonymous clustering algorithm are as follows:

defining 13, a segmented track data set model: t is a unit of _partition ＝[trp ₁ ,trp ₂ ,trp ₃ ,...,trp _n ]Wherein trp _i A set of points representing the ith segmented trajectory, where i is 1, 2, 3, …, n; trp _i ＝[pp ₁ ,pp ₂ ,pp ₃ ,...,pp _m ]Wherein pp _j Representative locus trp _i Wherein j is 1, 2, 3, …, m; pp (polypropylene) sheet _j ＝(pp _j .X,pp _j .Y)∈trp _i Wherein pp _j .X、pp _j Y is the trajectory trp _i Midpoint pp _j Longitude and latitude of (c);

definition 14, loss of information: node _i Generalization to parent or higher node _j The loss generated in time is calculated as the node _j Generalization to node _i The formula for the loss of information is:

Loss(node _i ,node _j )＝log ₂ (LF(node _i ))-log ₂ (LF(node _j )) (1)

in the formula, Loss (node) _i ,node _j ) Is a node _j Generalization to node _i Resulting in loss of information, node _i 、node _j The number of two nodes, the LF () function will return the number of the bottommost leaf nodes owned by one node;

defining 15, dynamic sequence alignment algorithm: acting on any two tracks A, B, wherein the length A is a, the length B is B, and a dynamic programming method is adopted, and the recurrence equation is as follows:

dp[i][j]＝min(dp[i-1][j-1]+Loss(node _i ,node _j ),

dp[i-1][j]+Loss(node _i ,node _root ),dp[i][j-1]+Loss(node _j ,node _root )) (2)

in the formula, a node _i 、node _j Is the number of two nodes, node _root Represents a root node, dp][]Is a two-dimensional matrix with the size of (a +1) × (b +1), dp [ i][j]Representing a two-dimensional matrix dp [ alpha ], [ beta ] and a][]Number of rows (i +1) < th > and columns (j +1) < th > dp [ i-1 >][j-1]Representing a two-dimensional matrix dp [2 ]][]Number of ith row and jth column in middle, dp [ i-1][j]Representing a two-dimensional matrix dp [ alpha ], [ beta ] and a][]Number of i row j +1 column dp [ i ]][j-1]Representing a two-dimensional matrix dp [ alpha ], [ beta ] and a][]The number of the (i +1) th row and the (j) th column; a sequence alignment loss matrix dp [2 ] of (a +1) (b +1) can be obtained by a recurrence equation][]Finding a strategy which can make the loss of the two synthesized tracks be minimum and generating the synthesized track by backtracking the sequence alignment loss matrix;

definition 16, Progressive Sequence Alignment algorithm (PSA): selecting the longest track from a group of tracks as a basic track, then selecting one track from the group of tracks in any order, wherein each track can only be selected once and synthesized with the track, and the track synthesized by dynamic sequence alignment is a new basic track;

definition 17, Iterative track k anonymous Clustering algorithm (Iterative track Clustering): in order to realize k-anonymity of the track, firstly, the number of generated clusters is determined according to a k value, empty clusters are created, and then, the following operations are performed by traversing each cluster: randomly extracting a track from the track set and placing the track into a cluster as a first track, performing a dynamic sequence alignment algorithm on all the remaining tracks and the track one by one to calculate the information loss of track alignment, and selecting k-1 tracks with the minimum information loss to place the k-1 tracks into the cluster; and after completing the operation of adding the tracks into all the clusters, calculating the final generalization loss of each track cluster through a progressive sequence alignment algorithm, and generating a k-anonymous track data set.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention realizes the method for dividing the single track based on the integral point density of the data set for the first time, and can effectively integrate the data sets with different sources.

2. According to the invention, unnecessary information loss in the track anonymization process is reduced by cutting the long track and the zigzag track in the data set.

3. The invention solves the problems of low availability and low anonymity efficiency of anonymous medium and long track data of the track by a method for segmenting the track based on point density.

4. According to the method, the track data set is segmented through analysis of the point density, so that the phenomenon that tracks with overlarge length difference form a cluster in the track clustering process is avoided, and the running time of the clustering process is obviously reduced.

5. The method has wide use space in the field of data publishing, has strong adaptability to data sets from different sources, has high availability and short running time, and has wide prospect in the field of track privacy protection.

Drawings

FIG. 1 is a logic flow diagram of the method of the present invention.

FIG. 2(a) is a flow chart of the data preprocessing of the present invention.

Fig. 2(b) is a flow chart of the invention for building a DGH (regional generalized hierarchical) tree.

FIG. 2(c) is a flow chart showing the pretreatment process of the present invention.

FIG. 3 is a track clustering flow chart of the present invention.

FIG. 4 is a road network model graph constructed by experimentally selected trajectories according to the present invention.

FIG. 5 is a graph of the results of the segmentation of the k-anonymous data set from the experiments of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

As shown in fig. 1 to fig. 3, the method for realizing anonymity of k in track data distribution based on point density segmentation tracks in the embodiment uses auxiliary devices such as mobile phone application software, a vehicle-mounted signal machine, a road surface positioning signal machine, and a cloud server, and includes the following steps:

1) acquiring basic track data including longitude and latitude information of a track and a time sequence relation of a track point set, and establishing a track data set model T; the track data refers to 270 moving tracks intercepted by a user in a range of 1KM × 1KM (corresponding to a longitude of 116.300000-116.316000 ° and a latitude of 39.989500-40.000000 °) in beijing, which are obtained through the Geolife data set, and a road network model formed by the 270 moving tracks is shown in fig. 4.

The trajectory dataset model T is defined as follows:

defining 1, a track data set model: t ═ tr ₁ ,tr ₂ ,tr ₃ ,...,tr _n ]Wherein tr is _i A set of points representing the ith trajectory, i ═ 1, 2, 3, …, n; tr is _i ＝[p ₁ ,p ₂ ,p ₃ ,...,p _m ]Wherein p is _j Representative locus tr _i J is 1, 2, 3, …, m; p is a radical of formula _j ＝(p _j .X,p _j .Y)∈tr _i Wherein p is _j .X、p _j Y is the trace tr _i Midpoint p _j Longitude and latitude of (c).

respectively solving the maximum value and the minimum value of longitude and latitude through the longitude and latitude information of the track to obtain the area range, then uniformly dividing the area and establishing a track loss model DGH (regional generalized hierarchy) tree, comprising the following steps:

2.2) for the rectangular region P, dividing the rectangular region P into Nx sections with equal length in the transverse direction and dividing the rectangular region P into Ny sections with equal length in the longitudinal direction;

2.3) respectively establishing transverse and longitudinal DGH (regional generalized hierarchical) trees by the Nx, Ny and the track data set model T;

wherein, the DGH (regional generalized hierarchical) tree is defined as follows:

definition 2, regional generalization hierarchical tree: the position attribute in the map is divided into a plurality of equally-long cells, the cells are used as leaf nodes to establish a full binary tree, and if the number of the leaf nodes is not enough to fill the bottom layer of the binary tree, a plurality of invalid points are added for filling;

according to the selected segmentation area number N and the latitude and longitude range x ₁ ～x ₂ And y ₁ ～y ₂ Calculating the height H of the DGH tree and the area size d:

H＝log ₂ N

calculated by adopting the steps, N is 100, x ₁ ＝116.300000°，x ₂ 116.316000 ° corresponding to a latitude of y ₁ ＝39.989500°，y ₂ Calculation at 40.000000 ° yields H7, d _x ＝0.000105，d _y ＝0.000160。

3) Adding virtual points between adjacent points of each track in the track data set model T to generate a track data set model T containing the virtual points _virtual And a virtual point marker dataset model virtual;

traversing each track in the track data set model T, adding virtual points between each pair of adjacent points in the track, and generating the track number containing the virtual pointsData set model T _virtual And a virtual point marker dataset model virtual; trajectory data set model T and trajectory data set model T containing virtual points _virtual The virtual point mark data set model virtual and the virtual points are defined as follows:

definition 3, trajectory data set model: t ═ tr ₁ ,tr ₂ ,tr ₃ ,...,tr _n ]Wherein tr is _i A set of points representing the ith trace, i ═ 1, 2, 3, …, n; tr is _i ＝[p ₁ ,p ₂ ,p ₃ ,...,p _m ]Wherein p is _j Representative locus tr _i J is 1, 2, 3, …, m; p is a radical of formula _j ＝(p _j .X,p _j .Y)∈tr _i Wherein p is _j .X、p _j Y is the trajectory tr _i Midpoint p _j Longitude and latitude of (c);

defining 4, a track data set model containing virtual points: t is a unit of _virtual ＝[trv ₁ ,trv ₂ ,trv ₃ ,...,trv _n ]Therein trv _i A set of points representing the ith trajectory containing the virtual points; trv _i ＝[pv ₁ ,pv ₂ ,pv ₃ ,...,pv _m ]Wherein pv is _j Representative trajectory trv _i The jth point of (1); pv _j ＝(pv _j .X,pv _j .Y)∈trv _i Wherein pv is _j .X、pv _j Y is the locus trv _i Midpoint pv _j Longitude and latitude of (c);

definition 5, virtual point marking dataset model: virtuall ═ vir [ -vir ₁ ,vir ₂ ,vir ₃ ,...,vir _n ]Wherein vir _i Is a trajectory data set model T containing virtual points _virtual A virtual point mark list corresponding to the ith track; vir _i ＝[q ₁ ,q ₂ ,q ₃ ,...q _m ]，q _j Represents vir _i The jth point; q. q of _j When the value of (b) is 0, it represents a true point, q _j A value of 1 represents a virtual point; virtual point mark data set model virtual and track data set model T containing virtual points _virtual There is a one-to-one mapping in position, e.g. vir _i Corresponds to trv _i ，q _j Corresponding to pv _j Wherein q is _j ∈vir _i ，pv _j ∈trv _i And q is _j And pv _j Each represents vir _i And trv _i The jth number in (1);

definition 6, virtual points: for a line segment formed between adjacent points of a certain track, adding a virtual point from one point to each fixed distance along the line segment, so that the line segments with different lengths have the same influence on the point density of the area where the line segments are located;

wherein, the virtual obtained from the experiment in this embodiment is [ [0,1,1,1,0,1,1,0,1,1,0,1,1,0,1,0, … … ], … … ].

4) A trajectory data set model T containing virtual points _virtual Clustering is carried out on the whole point set, the clustering center of each point is marked, and a marked data set model mark is generated;

a trajectory data set model T containing virtual points _virtual Clustering is carried out by regarding the point set as a whole, the generated clustering centers are numbered, and the number of the clustering center to which each point belongs is recorded by a mark data set model mark, which comprises the following steps:

4.1) model T of trajectory data set containing virtual points _virtual Clustering is carried out by regarding the whole point set, a clustering center is generated, the clustering center is numbered, and a track data set model T containing virtual points is recorded _virtual The serial number of the cluster center corresponding to each point in the cluster;

4.2) traversing the trajectory dataset model T containing virtual points _virtual Judging whether the virtual point is a virtual point or not through a virtual point mark data set model virtual, and recording the serial number of the clustering center to which the mark data set model mark belongs for the real point; trajectory data set model T containing virtual points _virtual The mark data set model mark is defined as follows:

defining 7 a trajectory data set model containing virtual points: t is _virtual ＝[trv ₁ ,trv ₂ ,trv ₃ ,...,trv _n ]Therein trv _i Representing the ith track containing virtual pointsA set of trace points, where i ═ 1, 2, 3, …, n; trv _i ＝[pv ₁ ,pv ₂ ,pv ₃ ,...,pv _m ]Wherein pv is _j Representative trajectory trv _i Wherein j is 1, 2, 3, …, m; pv _j ＝(pv _j .X,pv _j .Y)∈trv _i Wherein pv is _j .X、pv _j Y is the locus trv _i Midpoint pv _j Longitude and latitude of (c);

definition 8: labeling data set model: mark ═ mark [ [ mark ₁ ,mar ₂ ,mar ₃ ,...,mar _n ]Wherein mar _i The method comprises the steps that a virtual point mark list corresponding to the ith track in a mark data set model mark containing a virtual point is obtained; mar _i ＝[z ₁ ,z ₂ ,z ₃ ,...,z _m ]，z _j Represents mar _i The jth point in (1); mark data set model mark and track data set model T containing virtual point _virtual There is a one-to-one mapping over locations, e.g. mar _i Corresponds to trv _i ，z _j Corresponding to pv _j Wherein z is _j ∈mar _i ，pv _j ∈trv _i And z is _j And pv _j Respectively represent mar _i And trv _i The j-th number in (1);

the mark obtained by the experiment in this example is [ [27,27,8,8,33,33,33,33,33,33,33, … … ], … … ].

Judging whether the cluster center numbers of the adjacent points of each track in the track data set model T are the same or not by combining the track data set model T and the mark data set model mark, if the cluster center numbers are different, segmenting the track, and if the cluster center numbers are different, keeping the same, generating a segmented track data set model T _partition (ii) a Wherein the track data set model T, the mark data set model mark, and the divided track data set model T _partition Definition ofThe following:

definition 9, trajectory data set model: t ═ tr ₁ ,tr2,tr ₃ ,...,tr _n ]Wherein tri represents a point set of the ith track, wherein i is 1, 2, 3, …, n; tr is _i ＝[p ₁ ,p ₂ ,p ₃ ,...,p _m ]Wherein p is _j Representative locus tr _i Wherein j is 1, 2, 3, …, m; p is a radical of _j ＝(p _j .X,p _j .Y)∈tr _i Wherein p is _j .X、p _j Y is the trace tr _i Midpoint p _j Longitude and latitude of (c);

definition 10: labeling the data set model: mark ═ mark [ [ mark ₁ ,mar ₂ ,mar ₃ ,...,mar _n ]Wherein mar _i The method comprises the steps that a virtual point mark list corresponding to the ith track in a mark data set model mark containing virtual points is obtained; mar _i ＝[z ₁ ,z ₂ ,z ₃ ,...,z _m ]，z _j Represents mar _i J-th point in (1); mark data set model mark and track data set model T containing virtual point _virtual There is a one-to-one mapping relationship over locations, e.g. mar _i Corresponds to trv _i ，z _j Corresponding to pv _j Wherein z is _j ∈mar _i ，pv _j ∈trv _i And z is _j And pv _j Respectively represent mar _i And trv _i The j-th number in (1); wherein, the trajectory data set model T containing virtual points _virtual The definition is as follows:

defining 11, a trajectory data set model containing virtual points: t is a unit of _virtual ＝[trv ₁ ,trv ₂ ,trv ₃ ,...,trv _n ]Therein trv _i A set of points representing the ith trajectory containing the virtual points; trv _i ＝[pv ₁ ,pv ₂ ,pv ₃ ,...,pv _m ]Wherein pv is _j Representative trajectory trv _i The jth point of (1); pv _j ＝(pv _j .X,pv _j .Y)∈trv _i Wherein pv is _j .X、pv _j Y is the locus trv _i Midpoint pv _j Longitude and latitude of (c);

definition 12. Segmented trajectory dataset model: t is a unit of _partition ＝[trp ₁ ,trp ₂ ,trp ₃ ,...,trp _n ]Wherein trp _i A set of points representing the ith segmented trajectory; trp _i ＝[pp ₁ ,pp ₂ ,pp ₃ ,...,pp _m ]Wherein pp _j Representative locus trp _i The jth point of (1); pp (polypropylene) sheet _j ＝(pp _j .X,pp _j .Y)∈trp _i In which pp _j .X、pp _j Y is the trajectory trp _i Midpoint pp _j Longitude and latitude of (c);

wherein, T obtained from the experiment of this example _partition Is [ [27,27 ]],[8,8],[33,33,33,33,33,33,33,……],……]。

6) Model T for segmented trajectory dataset _partition In the track distribution method, a dynamic sequence alignment algorithm is adopted to calculate loss, an iterative track k anonymous clustering algorithm is used for clustering based on information loss, a k anonymous data set is generated to serve as a data set for data distribution, the k anonymous data set is shown in figure 5, and tracks in the same color area serve as the same k anonymous data set for track data distribution.

Modeling the segmented trajectory data set _partition Clustering the tracks by using an iterative track k-anonymous clustering algorithm; segmented trajectory dataset model T _partition The information loss, the dynamic sequence alignment algorithm, the progressive sequence alignment algorithm and the iterative track k-anonymous clustering algorithm are defined as follows:

defining 13, a segmented trajectory data set model: t is _partition ＝[trp ₁ ,trp ₂ ,trp ₃ ,...,trp _n ]Wherein trp _i A set of points representing the ith segmented trajectory, where i is 1, 2, 3, …, n; trp _i ＝[pp ₁ ,pp ₂ ,pp ₃ ,...,pp _m ]Wherein pp _j Representative locus trp _i Wherein j is 1, 2, 3, …, m; pp (polypropylene) sheet _j ＝(pp _j .X,pp _j .Y)∈trp _i In which pp _j .X、pp _j Y is the trace trp _i Midpoint pp _j Longitude and latitude of (c);

definition 14, loss of information: node _i Generalization to parent or higher node _j The loss generated in time, the node is calculated _j Generalization to node _i The formula for the loss of information is:

Loss(node _i ,node _j )＝log ₂ (LF(node _i ))-log ₂ (LF(node _j )) (1)

in the formula, Loss (node) _i ,node _j ) Is a node _j Generalization to node _i Resulting in loss of information, node _i 、node _j The number of two nodes, the LF () function returns the number of the bottommost leaf nodes owned by one node;

defining 15, dynamic sequence alignment algorithm: acting on any two tracks A, B, wherein the length A is a, the length B is B, adopting a dynamic programming method, and the recurrence equation is as follows:

dp[i][j]＝min(dp[i-1][j-1]+Loss(node _i ,node _j ),

in the formula, a node _i 、node _j Is the number of two nodes, node _root Dp representing a root node][]Is a two-dimensional matrix with the size of (a +1) × (b +1), dp [ i][j]Representing a two-dimensional matrix dp [ alpha ], [ beta ] and a][]Number of rows (i +1) < th > and columns (j +1) < th > dp [ i-1 >][j-1]Representing a two-dimensional matrix dp [2 ]][]Number of ith row and jth column in middle, dp [ i-1][j]Representing a two-dimensional matrix dp [2 ]][]Number of ith row j +1 th column, dp [ i][j-1]Representing a two-dimensional matrix dp [ alpha ], [ beta ] and a][]The number of the (i +1) th row and the (j) th column; a sequence alignment loss matrix dp [2 ] of (a +1) (b +1) can be obtained by a recurrence equation][]Finding a strategy which can make the loss of the two synthesized tracks be minimum and generating the synthesized track by backtracking the sequence alignment loss matrix;

definition 17, Iterative track k anonymous Clustering algorithm (Iterative track Clustering): in order to realize k-anonymity of the track, firstly, the number of generated clusters is determined according to a k value, empty clusters are created, and then, the following operations are performed by traversing each cluster: randomly extracting a track from the track set and placing the track into a cluster as a first track, performing a dynamic sequence alignment algorithm on all the remaining tracks and the track one by one to calculate the information loss of track alignment, and selecting k-1 tracks with the minimum information loss to place the k-1 tracks into the cluster; and after completing the operation of adding tracks into all the clusters, calculating the final generalization loss of each track cluster through a progressive sequence alignment algorithm, and generating a k-anonymous track data set.

In conclusion, after the scheme is adopted, the method provides a new method for track privacy protection, and the iterative track k anonymous clustering is performed after the tracks are segmented according to the density of the points, so that the information loss in the track clustering procedure is reduced, the track privacy of the user is effectively protected, the method has practical popularization value, and is worthy of popularization.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A method for realizing k anonymity of track data distribution based on a point density segmentation track is characterized by comprising the following steps:

2) building a track loss model DGH tree by utilizing longitude and latitude information of a track; wherein, the DGH tree is defined as follows:

definition 2, regional generalization hierarchical tree: the position attribute in the map is divided into a plurality of cells with equal length by using DGH tree representation, then the cells are used as leaf nodes to establish a full binary tree, and if the number of the leaf nodes is not enough to fill the bottom layer of the binary tree, some invalid points are added for filling;

4) a trajectory data set model T containing virtual points _virtual Clustering is carried out by regarding the point set as a whole, the clustering center of each point is marked, and a marked data set model mark is generated;

6) Model T for segmented trajectory dataset _partition Calculating loss by adopting a dynamic sequence alignment algorithm, clustering based on information loss by using an iterative track k anonymous clustering algorithm, and generating a k anonymous data set as a data set for data distribution;

Loss(node _i ,node _j )＝log ₂ (LF(node _i ))-log ₂ (LF(node _j )) (1)

in the formula, a node _i 、node _j Is the number of two nodes, node _root Represents a root node, dp][]Is a two-dimensional matrix with the size of (a +1) × (b +1), dp [ i][j]Representing a two-dimensional matrix dp [ alpha ], [ beta ] and a][]Number of rows (i +1) < th > and columns (j +1) < th > dp [ i-1 >][j-1]Representing a two-dimensional matrix dp [2 ]][]Number of ith row and jth column in middle, dp [ i-1][j]Representing a two-dimensional matrix dp [2 ]][]Number of ith row j +1 th column, dp [ i][j-1]Representing a two-dimensional matrix dp [2 ]][]The number of the (i +1) th row and the (j) th column; a sequence alignment loss matrix dp [ alpha ], (a +1) × (b +1) can be obtained by a recursion equation][]Finding a strategy which can enable the loss of the two synthesized tracks to be minimum and generating the synthesized track by backtracking the sequence alignment loss matrix;

defining 17, an iterative track k anonymous clustering algorithm: in order to realize k-anonymity of the track, firstly, the number of generated clusters is determined according to a k value, empty clusters are created, and then, the following operations are performed by traversing each cluster: randomly extracting a track from the track set and placing the track into a cluster as a first track, performing a dynamic sequence alignment algorithm on all the remaining tracks and the track one by one to calculate the information loss of track alignment, and selecting k-1 tracks with the minimum information loss to place the k-1 tracks into the cluster; and after completing the operation of adding tracks into all the clusters, calculating the final generalization loss of each track cluster through a progressive sequence alignment algorithm, and generating a k-anonymous track data set.

2. The method for realizing k anonymity of track data distribution based on the point density segmentation track as claimed in claim 1, wherein: in step 1), the trajectory dataset model T is defined as follows:

definition 1, trajectory data set model: t ═ tr ₁ ,tr ₂ ,tr ₃ ,...,tr _n ]Wherein tr is _i A set of points representing the ith trace, i ═ 1, 2, 3,. and n; tr _i ＝[p ₁ ,p ₂ ,p ₃ ,...,p _m ]Wherein p is _j Representative locus tr _i 1, 2, 3, a, m; p is a radical of _j ＝(p _j .X,p _j .Y)∈tr _i Wherein p is _j .X、p _j Y is the trajectory tr _i Midpoint p _j Longitude and latitude of (c).

3. The method for realizing k anonymity of track data distribution based on the point density segmentation track as claimed in claim 1, wherein: in the step 2), the maximum value and the minimum value of the longitude and the latitude are respectively solved through the longitude and latitude information of the track, so as to obtain the range of the area, and then the area is uniformly divided, so as to establish a track loss model DGH tree, which comprises the following steps:

2.3) establishing transverse and longitudinal DGH trees respectively through the Nx, Ny and the track data set model T.

4. The method for realizing k anonymity of trajectory data distribution based on point density segmentation trajectory according to claim 1, wherein: in step 3), traversing each track in the track data set model T, adding a virtual point between each pair of adjacent points in the track, and generating a track data set model T containing the virtual points _virtual And a virtual point marker dataset model virtual; trajectory data set model T and trajectory data set model T containing virtual points _virtual The virtual point mark data set model virtual and the virtual points are defined as follows:

definition 3, trajectory data set model: t ═ tr ₁ ,tr ₂ ,tr ₃ ,...,tr _n ]Wherein tr is _i A set of points representing the ith trace, i 1,2，3，...，n；tr _i ＝[p ₁ ,p ₂ ,p ₃ ,...,p _m ]wherein p is _j Representative locus tr _i 1, 2, 3, a, m; p is a radical of formula _j ＝(p _j .X,p _j .Y)∈tr _i Wherein p is _j .X、p _j Y is the trajectory tr _i Midpoint p _j Longitude and latitude of (c);

definition 5, virtual point marking data set model: virtuall ═ vir [, vir ═ vir [ ] ₁ ,vir ₂ ,vir ₃ ,...,vir _n ]Wherein vir _i Is a trajectory data set model T containing virtual points _virtual A virtual point mark list corresponding to the ith track; vir _i ＝[q ₁ ,q ₂ ,q ₃ ,...q _m ]，q _j Represents vir _i The jth point; q. q.s _j When the value of (A) is 0, q represents a true point _j When the value of (1) represents a virtual point; virtual point mark data set model virtual and track data set model T containing virtual points _virtual There is a one-to-one mapping in position, vir _i Corresponds to trv _i ，q _j Corresponding to pv _j Wherein q is _j ∈vir _i ，pv _j ∈trv _i And q is _j And pv _j Each represents vir _i And trv _i The jth number in (1);

5. The method for realizing k anonymity of trajectory data distribution based on point density segmentation trajectory according to claim 1, wherein: in step 4), a trajectory data set model T containing virtual points is formed _virtual Clustering is carried out by regarding the point set as a whole, the generated clustering centers are numbered, and the number of the clustering center to which each point belongs is recorded by a mark data set model mark, which comprises the following steps:

4.1) model T of the trajectory data set containing the virtual points _virtual Clustering is carried out by regarding the whole point set, a clustering center is generated, the clustering center is numbered, and a track data set model T containing virtual points is recorded _virtual The serial number of the cluster center corresponding to each point in the cluster;

4.2) traversing the trajectory dataset model T containing virtual points _virtual Judging whether the virtual point is a virtual point or not through a virtual point marking data set model virtual, and recording the number of the clustering center to which the mark data set model mark belongs for the real point; trajectory dataset model T containing virtual points _virtual The mark data set model mark is defined as follows:

defining 7 a trajectory data set model containing virtual points: t is _virtual ＝[trv ₁ ,trv ₂ ,trv ₃ ,...,trv _n ]Therein trv _i A point set representing the ith trajectory containing virtual points, where i ═ 1, 2, 3., n; trv _i ＝[pv ₁ ,pv ₂ ,pv ₃ ,...,pv _m ]Wherein pv is _j Representative trajectory trv _i The jth point of (a), wherein j is 1, 2, 3, ·, m; pv _j ＝(pv _j .X,pv _j .Y)∈trv _i Wherein pv is _j .X、pv _j Y is the locus trv _i Midpoint pv _j Longitude and latitude of (c);

definition 8: labeling data set model: mark ═ mark [ [ mark ₁ ,mar ₂ ,mar ₃ ,...,mar _n ]Wherein mar _i Is to contain a virtual pointA virtual point mark list corresponding to the ith track in the mark data set model mark; mar _i ＝[z ₁ ,z ₂ ,z ₃ ,...,z _m ]，z _j Represents mar _i J-th point in (1); mark data set model mark and track data set model T containing virtual point _virtual There is a one-to-one mapping relationship on the locations, mar _i Corresponds to trv _i ，z _j Corresponding to pv _j Wherein z is _j ∈mar _i ，pv _j ∈trv _i And z is _j And pv _j Respectively represent mar _i And trv _i The j-th number in (2).

6. The method for realizing k anonymity of track data distribution based on the point density segmentation track as claimed in claim 1, wherein: in step 5), judging whether the numbers of the clustering centers of the adjacent points of each track in the track data set model T are the same or not by combining the track data set model T and the mark data set model mark, if the numbers are different, segmenting the track, and if the numbers are different, keeping the same, generating a segmented track data set model T _partition (ii) a Wherein the track data set model T marks the data set model mark, the segmented track data set model T _partition The definition is as follows:

definition 9, trajectory data set model: t ═ tr ₁ ,tr ₂ ,tr ₃ ,...,tr _n ]Wherein tr is _i A set of points representing an ith trace, wherein i is 1, 2, 3. tr _i ＝[p ₁ ,p ₂ ,p ₃ ,...,p _m ]Wherein p is _j Representative locus tr _i The jth point of (a), wherein j is 1, 2, 3, ·, m; p is a radical of _j ＝(p _j .X,p _j .Y)∈tr _i Wherein p is _j .X、p _j Y is the trace tr _i Midpoint p _j Longitude and latitude of (c);

definition 10: labeling data set model: mark ═ mark [ [ mark ₁ ,mar ₂ ,mar ₃ ,...,mar _n ]Wherein mar _i Is a virtual point mark list corresponding to the ith track in a mark data set model mark containing virtual points；mar _i ＝[z ₁ ,z ₂ ,z ₃ ,...,z _m ]，z _j Represents mar _i J-th point in (1); mark data set model mark and track data set model T containing virtual point _virtual There is a one-to-one mapping relationship on the locations, mar _i Corresponds to trv _i ，z _j Corresponding to pv _j Wherein z is _j ∈mar _i ，pv _j ∈trv _i And z is _j And pv _j Respectively represent mar _i And trv _i The jth number in (1); wherein, the track data set model T containing virtual points _virtual The definition is as follows:

defining 12, a segmented track data set model: t is _partition ＝[trp ₁ ,trp ₂ ,trp ₃ ,...,trp _n ]Wherein trp _i A set of points representing the ith segmented trajectory; trp _i ＝[pp ₁ ,pp ₂ ,pp ₃ ,...,pp _m ]In which pp _j Representative trajectory trp _i The jth point of (1); pp (polypropylene) _j ＝(pp _j .X,pp _j .Y)∈trp _i In which pp _j .X、pp _j Y is the trace trp _i Midpoint pp _j Longitude and latitude of (c).

7. The method for realizing k anonymity of trajectory data distribution based on point density segmentation trajectory according to claim 1, wherein: in step 6), the segmented trajectory data set is modeled _partition Using iterationsClustering the tracks by a formula track k-anonymous clustering algorithm; segmented trajectory dataset model T _partition And the definition of the progressive sequence alignment algorithm is as follows:

defining 13, a segmented track data set model: t is a unit of _partition ＝[trp ₁ ,trp ₂ ,trp ₃ ,...,trp _n ]Wherein trp _i A set of points representing an ith segmented trajectory, where i ═ 1, 2, 3., n; trp _i ＝[pp ₁ ,pp ₂ ,pp ₃ ,...,pp _m ]In which pp _j Representative locus trp _i The jth point of (a), wherein j is 1, 2, 3, ·, m; pp (polypropylene) _j ＝(pp _j .X,pp _j .Y)∈trp _i Wherein pp _j .X、pp _j Y is the trace trp _i Midpoint pp _j Longitude and latitude of (c);

define 16, progressive sequence alignment algorithm: the method comprises the steps of selecting the longest track from a group of tracks as a basic track, then selecting one track in any order from the rest of the group of tracks, synthesizing the selected track with each track only once, and realizing the purpose through dynamic sequence alignment, wherein the track synthesized through the dynamic sequence alignment becomes a new basic track.