CN116010838A

CN116010838A - Vehicle track clustering method integrating density value and K-means algorithm

Info

Publication number: CN116010838A
Application number: CN202310031352.2A
Authority: CN
Inventors: 周旭; 徐腾鹏; 刘衍珩
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2023-01-10
Filing date: 2023-01-10
Publication date: 2023-04-25

Abstract

The invention discloses a vehicle track clustering method integrating a density value and a K-means algorithm, which comprises the following steps: step one, acquiring original track data, and preprocessing the original track data to obtain a track set; step two, track distances are calculated between tracks in the track set in sequence; determining the density of all tracks in a track set, and adding the track with the largest density value as a first center track into a cluster center track set; sequentially calculating the weights of the remaining tracks selected as the center tracks in the track set, taking the remaining track with the largest weight value as the next center track, and moving the next center track into the cluster center track set until the capacity of the cluster center track set reaches K, and completing the center track selection; and fifthly, removing the track with the lowest density value in the track set, and performing K-means clustering again until the iteration times are reached or elements in the track cluster are not changed any more, and ending the track clustering process. The method has the characteristic of high clustering precision.

Description

Vehicle track clustering method integrating density value and K-means algorithm

Technical Field

The invention relates to the technical field of track data mining, in particular to a vehicle track clustering method integrating a density value and a K-means algorithm.

Background

In recent years, with rapid development of wireless communication, GPS devices, cloud computing and storage technologies, acquisition, transmission and storage of location data become easier and faster, so that a huge amount of track data is generated. The data generally contains information such as user identification, longitude and latitude, time, altitude and the like, so the track data can be regarded as a time-position sequence of the moving object, and knowledge hidden behind the track data is mined and analyzed by using technologies such as clustering, classification and the like, so that schemes are provided for the fields such as city planning, position service, user recommendation and the like.

The K-means clustering algorithm is used as the most popular unsupervised learning clustering algorithm, has the characteristics of simple principle, easy realization and high convergence rate, is widely applied to the fields of images, characters, tracks and the like, but still has certain limitations, such as poor grasp of K value selection, sensitivity to noise and abnormal points, inaccurate cluster center calculation in the initial center point selection and iteration process, local iteration of clustering or initial center track concentration, poor clustering effect and the like, and is time-consuming in calculation process.

dominogo-Ferrer et al propose the application of K-means clustering algorithm on tracks, the method randomly selects K initial center tracks, then calculates the distances from the remaining N-K tracks to the K initial center tracks, and distributes the distances to corresponding track clusters, but the method randomly selects the initial center tracks, which can cause unstable clustering results, and does not reject outlier tracks, which can cause poor clustering effect. The initial track selection method proposed by Wang et al is to select an average track of all tracks, and to use the average track as a first initial center track, select a subsequent center track as a track farthest from a previous center track, and then form each center track and K-1 tracks closest to the center track into a track cluster containing K tracks.

Outliers are a few objects that deviate from the normal dataset, failing to form clusters or even negatively affecting the clustering results. Dubey et al propose to screen outliers using a density-based outlier detection algorithm and select an initial center on the screened sample points in combination with a maximum-minimum distance method, but the algorithm does not solve the problem of trajectory clustering. Yu et al propose that the outlier track is removed first, then the K tracks with the longest duration are selected as the initial center tracks, and the clustering strategy proposed by the method emphasizes the duration of the track time, but the problem that the tracks with the long duration are too concentrated cannot be solved, and the K tracks with the longest time span may be very close to each other, so that the initial center tracks are densely distributed.

Disclosure of Invention

The invention aims to design and develop a vehicle track clustering method integrating a density value and a K-means algorithm, wherein the sparse distribution characteristics of a track data set are integrally determined through track density, outlier tracks are accurately removed, and the K-means clustering algorithm is combined, so that the tracks have the characteristics of large density value and uniform distribution, and the iteration precision and the clustering speed are improved.

The technical scheme provided by the invention is as follows:

a vehicle track clustering method integrating a density value and a K-means algorithm comprises the following steps:

step one, acquiring track data of vehicle movement, and preprocessing to obtain a track set TS= { T ₁ ,T ₂ ,…T _i ,…T _j ,…T _n }；

wherein ,T_i For the ith track, T _j Is the j-th track, and i=1 to n, j=1 to n, T _i ＝{i,p ₁ ,p ₂ …p _t ,…p _m }，T _j ＝(j,q ₁ ,q ₂ …q _t ,…q _l )，p _t Is the track T _i Trace point (x) at time stamp t ₁ ,y ₁ ,t)，q _t Is the track T _j Trace point (x) at time stamp t ₂ ,y ₂ ,t)；

Step two, track distances are calculated between every two tracks in the track set in sequence, and track distances among all tracks in the track set are stored in a track distance matrix;

wherein the track distance satisfies:

D(T _i ,T _j )＝D(p _t ,q _t )；

in the formula ,D(T_i ,T _j ) Is the track T _i And trajectory T _j Distance between D (p) _t ,q _t ) Is the locus point p _t And locus point q _t The transformation distance between them, and the locus point p _t And locus point q _t The conversion distance between the two is as follows:

in the formula ,d(p_t ,q _t ) Is the locus point p _t And locus point q _t The actual distance between them, min { d (p _t-1 ,q _t-1 ),d(p _t-1 ,q _t ),d(p _t ,q _t-1 ) And is track T _i And trajectory T _j A minimum distance between pairs of alignment points;

step three, determining densities of all tracks in the track set, and adding the track with the largest density value as a first center track to a cluster center track set TScen= { T _c1 ,T _c2 ,…,T _cz …,T _ck In };

wherein the density of the tracks satisfies:

in the formula ,ρ(T_i ) Is the track T _i Is the density of u (x) as a function value, eps as the neighborhood distance, T _cz Z=1 to k for the z-th center locus;

step four, sequentially calculating the weight of each remaining track selected as a center track in the track set, taking the remaining track with the largest weight value in the track set as the next center track, and moving the remaining track into the cluster center track set until the capacity of the cluster center track set reaches K, and completing the selection of the center tracks;

wherein ,

and fifthly, removing the track with the lowest density value in the track set, and performing K-means clustering again until the iteration times are reached or elements in the track cluster are not changed, and ending the track clustering process.

Preferably, the pretreatment includes: track selection, track segment segmentation, and/or track filling.

Preferably, the second step further includes:

if the tracks in the track set have no overlapping time, the track distance is INF, and the distance of the track is INF.

Preferably, the determination of the coincidence time between every two tracks specifically includes:

if it is

There is no coincidence time between the two tracks, and when there is coincidence time between the two tracks, [ a, b ]]Is the coincidence time between two tracks;

wherein, start _i Is the track T _i End of the start time of (2) _i Is the track T _i Is the expiration time, start of (2) _j Is the track T _j End of the start time of (2) _j Is the track T _j A=max (start) _i ,start _j )，b＝min(end _i ,end _j )。

Preferably, the neighborhood distance satisfies:

in the formula ,

is the average of the distance array.

Preferably, the distance array satisfies:

Dη＝{d(x _iη )}；

wherein Dη is a distance array, D (x _iη ) The distance parameter is the eta small value of each row in the track distance matrix, i is the track number, and eta is the Eps neighborhood parameter.

Preferably, the Eps neighborhood parameter satisfies:

where n is the number of tracks in the track set.

Preferably, the weight of the track selected as the center track satisfies:

ω(T _i )＝d _min (T _i ,TScen)*ρ(T _i )；

in the formula ,ω(T_i ) Is the track T _i Weights selected as center trajectories, d _min (T _i TScen) is track T _i Minimum center track distance to cluster center track set TScen.

Preferably, the trajectory T _i The minimum center track distance to the cluster center track set TScen satisfies:

d _min (T _i ,TScen)＝min(D(T _i ,T _cz ))(T _cz ∈TScen,z∈[1,k])；

in the formula ,D(T_i ,T _cz ) Is the track T _i To cluster center trace setCenter trajectory T in synthetic TScen _cz Is a track distance of (a) in the track direction.

The beneficial effects of the invention are as follows:

(1) According to the vehicle track clustering method integrating the density value and the K-means algorithm, which is designed and developed by the invention, the track density is determined through the track distance, so that the outlier track is determined, and the track clustering effect is improved.

(2) According to the vehicle track clustering method integrating the density value and the K-means algorithm, when the initial K central tracks are selected, the following two conditions that the tracks with high density and far distance from the existing central tracks are required to be simultaneously met are simultaneously set to be selected as the initial central tracks, so that the initial tracks can be more accurately selected.

(3) According to the vehicle track clustering method integrating the density value and the K-means algorithm, which is designed and developed by the invention, the influence of the track density is still considered when a new cluster center track is calculated in an iterative manner, so that the track with the largest track density value in the cluster is still selected as the center track. Therefore, the iteration speed is increased, the requirement that the track with a large density value is used as a center track is met, and the clustering accuracy is improved.

Drawings

FIG. 1 is a flow chart of a vehicle track clustering method integrating density values and a K-means algorithm.

FIG. 2 shows a track T according to the present invention ₁ And trajectory T ₂ Three-bit space-time trajectory schematic of (2).

Fig. 3 is a schematic diagram of a clustering flow in the fifth step of the present invention.

FIG. 4 is a trajectory spatiotemporal three-dimensional plot of a synthetic dataset according to the present invention.

FIG. 5 is a two-dimensional view of the trajectory space of the synthetic dataset of the present invention.

FIG. 6 is a space-time three-dimensional view of tracks in a first cluster of tracks of a synthetic dataset according to the invention.

FIG. 7 is a space-time three-dimensional view of tracks in a second cluster of tracks of a synthetic dataset according to the invention.

FIG. 8 is a space-time three-dimensional view of tracks in a third cluster of tracks of a synthetic dataset according to the invention.

FIG. 9 is a space-time three-dimensional view of a trace in a fourth cluster of traces of the synthetic dataset of the present invention.

FIG. 10 is a space-time three-dimensional view of a trace in a fifth cluster of traces of the synthetic dataset according to the invention.

FIG. 11 is a space-time three-dimensional view of a trace in a sixth cluster of traces of the synthetic dataset of the invention.

FIG. 12 is a space-time three-dimensional view of a trace in a seventh cluster of traces of the synthetic dataset of the invention.

FIG. 13 is a space-time three-dimensional view of tracks in the eighth cluster of tracks of the synthetic dataset according to the invention.

FIG. 14 is a space-time three-dimensional view of a trace in a ninth cluster of traces of the synthetic dataset according to the invention.

FIG. 15 is a trajectory spatiotemporal three-dimensional plot of a real dataset according to the present invention.

FIG. 16 is a two-dimensional view of the trajectory space of a real dataset according to the present invention.

FIG. 17 is a space-time three-dimensional view of tracks in a first cluster of tracks of a real dataset according to the invention.

FIG. 18 is a space-time three-dimensional view of tracks in a second cluster of tracks of a real dataset according to the invention.

FIG. 19 is a space-time three-dimensional view of tracks in a third cluster of tracks of a real dataset according to the invention.

FIG. 20 is a space-time three-dimensional view of a trace in a fourth trace cluster of a real dataset according to the invention.

FIG. 21 is a spatio-temporal three-dimensional view of a trace in a fifth cluster of traces of a real dataset according to the invention.

FIG. 22 is a space-time three-dimensional view of a trace in a sixth cluster of traces of a real dataset according to the invention.

FIG. 23 is a bar graph of the results of the profile factor of the present invention over a composite dataset with K (number of clusters).

Fig. 24 is a bar graph of the results of the profile factor of the present invention over a real dataset with K (number of clusters).

Detailed Description

The present invention is described in further detail below to enable those skilled in the art to practice the invention by reference to the specification.

As shown in FIG. 1, the vehicle track clustering method integrating the density value and the K-means algorithm provided by the invention comprises the steps of firstly determining an Eps neighborhood and counting the density information of tracks; filtering the outlier track through the density value, and eliminating the influence of the outlier on the whole clustering; then selecting the track with the largest track data set density as a first initial center track, and based on the track density and the minimum distance between the track and the existing center track; finally, through calculation of track weights, tracks with high density and far distance from the existing center track are selected as new initial center tracks; in the iterative process, a calculation method of the density value is still used, and the track with the largest track density value in the cluster is used as the center track of the next round of clustering. The method improves the track clustering effect by monitoring and filtering outliers, selecting initial center tracks and iteratively calculating the cluster center track strategy. The method specifically comprises the following steps:

step one, acquiring track data (original track) of vehicle movement, and preprocessing to obtain a track set TS= { T formed by a plurality of tracks ₁ ,T ₂ ,…T _i ,…T _j ,…T _n }；

wherein ,T_i For the ith track, n is the number of tracks in the track set, T _j The j-th track is the j-th track, i and j represent the numbers of the tracks, i=1 to n, and j=1 to n;

the tracks are all ordered sequences of track points, T _i ＝{i,p ₁ ,p ₂ …p _t ,…p _m }，T _j ＝(j,q ₁ ,q ₂ …q _t ,…q _l )，p _t Is the track T _i Trace point at time stamp t, q _t Is the track T _j The trace points at the time stamp t are all time space triplets, i.e. p _t ＝(x ₁ ,y ₁ ,t)，q _t ＝(x ₂ ,y ₂ T) representing the position coordinates of the moving object at time t as (x) ₁ ,y ₁) and (x₂ ,y ₂ ) Wherein the x-axis coordinate is the position longitude and the y-axis coordinate is the position latitude.

The pretreatment comprises the following steps: performing track selection, track segment segmentation, track filling and the like on an original track, wherein the track selection is to select tracks with track points exceeding 20 in the original track; the track segment segmentation is to segment the track at the position of the track point with a large adjacent time interval (namely the difference between two adjacent time stamps) to form a plurality of sub-tracks, specifically, segment the track according to the time interval, and if the time interval of the two adjacent track points in the track is larger than a specified time threshold, segment the track between the two points to obtain two sub-tracks; the track filling is to add a plurality of track points into the track by adopting a linear interpolation method, and the point interpolation of each track is improved to the set maximum track point so as to achieve the aim of consistent point of all tracks.

Wherein, adding a plurality of track points into the track means adding new track points between two track points with the largest time interval on the track, and repeatedly adding the track points until the track points of the two tracks are the same, so as to obtain a track T ₁ And trajectory T ₂ For example, T ₁ ＝{1,p ₁ ,p ₂ …p _t ,…p _m }，T ₂ ＝(2,q ₁ ,q ₂ …q _t ,…q _l ) Track T ₁ The number of track points in the track is m, and the track T ₂ The number of track points in the track T is l, and m is less than l, then the track T is needed ₁ The method for inserting (l-m) track points to keep the track points of the two tracks the same is as follows: first in track T ₁ The two points with the largest time interval are found to be p ₁ ＝(x ₃ ,y ₃ ,t ₁ )，p ₂ ＝(x ₄ ,y ₄ ,t ₂ ) Then according to the linear interpolation method, the three-dimensional coordinate of the point to be inserted is p _tm ＝(x _tm ,y _tm ,t _t), wherein t_t ＝(t ₁ +t ₂ )/2，x _tm ＝(x ₃ +x ₄ )/2，y _tm ＝(y ₃ +y ₄ ) 2, such a trackT ₁ The number of points of (2) is increased to (m+1); then at track T ₁ Continuously searching two points with the maximum time interval, and linearly inserting track points according to the previous steps; the track points are inserted in a circulating way until the final track T ₁ And trajectory T ₂ The number of points is the same.

In this embodiment, the time threshold is 3 minutes, and the maximum track point number is 320 points.

Step two, track similarity is a standard for measuring the similarity degree between tracks, the track clustering effect is directly determined, and the track similarity is represented by the distance between tracks, so that alignment points and alignment points between tracks are determined:

with track T ₁ And trajectory T ₂ For example, T ₁ ＝{1,p ₁ ,p ₂ …p _t ,…p _m }，T ₂ ＝(2,q ₁ ,q ₂ …q _t ,…q _l ) Locus point p _t And locus point q _t Respectively the track T ₁ and T₂ The points at the time stamp T are then pairs of aligned points at the time stamp T, as shown in fig. 2, track T ₁ And trajectory T ₂ Two tracks aligned in time, then track T ₁ Upper trace point p ₂ Is the track T ₂ Upper trace point q ₂ Track T ₁ Upper trace point p ₂ Is the track T ₂ Upper trace point q ₁ Similarly, track T ₂ Upper trace point q ₂ Is the track T ₁ Upper trace point p ₂ Track T ₂ Upper trace point q ₂ Is the track T ₁ Upper trace point p ₁ I.e. the alignment point of a certain track point on one track is the point before the alignment point of that point on the other track.

Calculating the trajectory T _i And trajectory T _j Is the distance between the tracks of (a):

(1) Calculating the minimum distance min { d (p) _t-1 ,q _t-1 ),d(p _t-1 ,q _t ),d(p _t ,q _t-1 ) Then calculate the trace point p at the same time stamp _t And a track pointq _t The actual distance between the two points to obtain a track point p _t And locus point q _t Conversion distance between:

in the formula ,d(p_t ,q _t ) Is the locus point p _t And locus point q _t The actual distance between, D (p _t ,q _t ) Is the locus point p _t And locus point q _t A transformed distance between;

sequentially calculating track distances between every two tracks in the track set:

D(T _i ,T _j )＝D(p _t ,q _t )；

in the formula ,D(T_i ,T _j ) Is the track T _i And trajectory T _j A distance therebetween;

in the above distance calculation, the temporal similarity of the tracks has been implied, but because the temporal factor is considered, there is also a possibility that there is no time overlap between the two tracks, and for the sake of subsequent simplification of the calculation, the value of the track distance that does not exist (i.e., the track that does not have time overlap) is set to infinity INF, and the distance between the same tracks (i.e., including the track itself) is also set to infinity INF, not participating in the size sorting of the track distances.

Wherein for each track T _i All calculate the sum of the rest of the tracks T remaining in the track set _j If no coincidence time exists between the two tracks, the distance value is INF, if the coincidence time exists between the two tracks, the track distance is calculated according to a track distance formula after the coincidence time period is determined, and the distance between the tracks is INF.

Specifically, each track and the rest of tracks T in the track set are calculated _j The overlapping time of (2) includes:

extracting the start time and the end time of each track, and setting a track T _i The initial time and the end time of (1) are respectively start _i and end_i Track T _j The initial time and the end time of (1) are respectively start _j and end_j If (3)

The two tracks do not have the overlapping time, otherwise, the two tracks have the overlapping time, and when the two tracks have the overlapping time, [ a, b ]]For the coincidence time between two tracks, a=max (start _i ,start _j )，b＝min(end _i ,end _j )。

(2) For all tracks in the track set TS, distances between all track pairs can be calculated, and the distances between all track pairs are saved into the track distance matrix M:

M _n*n ＝{D(T _i ,T _j )|i∈[1,n],j∈[1,n]}；

wherein n is the size of the track, and the elements in the ith row and jth column in the track distance matrix represent the track T _i And trajectory T _j A distance value between them.

The larger the inter-track distance is, the smaller the inter-track similarity is, and the smaller the inter-track distance is, the larger the inter-track similarity is.

The calculated track distance comprehensively considers the space-time characteristics of the track and the similarity of the track segments, and lays a foundation for the similarity measurement of the track clustering algorithm.

Step three, after the track distance matrix M is obtained through calculation, K initial center tracks are needed to be selected, and the process of selecting the initial center tracks by fusing the density value and a K-means algorithm (DBK-means) is as follows: firstly, determining an Eps neighborhood based on a t-neighbor distance, thereby calculating a density value of each track and storing the density value into a density array; then, outlier tracks with too low density values are removed, and the track with the largest track density is selected as a first initial center track; and finally, the density value and the Eps neighborhood distance obtained through calculation are used for simultaneously obtaining high density and the track farthest from the center point of the existing cluster-like structure, and the track is most likely to be selected as the rest center track. The specific process comprises the following steps:

(1) Calculating Eps neighborhood parameters:

wherein eta is an Eps neighborhood parameter, and n is the number of tracks in the track set.

(2) Ordering each row in the track distance matrix, and taking out the eta-th small value as a distance parameter d (x _iη ) Obtaining a distance array:

where Dη is the distance array.

(3) Calculating the Eps neighborhood distance:

(4) With track T _i For the research object, calculating a track T by taking an Eps neighborhood distance as a radius _i Density of (3):

/>

in the formula ,ρ(T_i ) Is the track T _i U (x) is a function value.

(5) Calculating the density value of each track in the track set by the steps, storing the density value into a track density array Den, and then selecting the track with the largest density value as a first center track and adding the first center track into TScen, wherein TScen= { T _c1 ,T _c2 ,…,T _cz …,T _ck The track set TS is represented by a cluster center track set selected by a clustering algorithm, and TScen is initially an empty set,；

(6) Sequentially calculating the minimum distance from each remaining track to the center track in the track set:

d _min (T _i ,TScen)＝min(D(T _i ,T _cz ))(T _cz ∈TScen,z∈[1,k])；

in the formula ,d_min (T _i TScen) is track T _i Minimum center track distance to cluster center track set TScen.

(7) According to d _min (T _i TScen) calculates each remaining track T in the track set _i Probability of being selected as center trajectory:

ω(T _i )＝d _min (T _i ,TScen)*ρ(T _i )；

in the formula ,ω(T_i ) For track weight, i.e. track T in track set _i The track with the largest weight value of the rest tracks in the track set is used as the next initial center track, and is moved into TScen= { T _c1 ,T _c2 ,…,T _cz …,T _ck In }.

(8) When the capacity of TScen is equal to K, indicating that K initial center tracks are selected and completed;

wherein, the value of K is preset,

in the specific implementation, the value with the best effect is selected from the range of K values.

Step five, as shown in fig. 3, clustering of tracks is realized based on a K-means clustering algorithm:

(1) According to the track density array Den, removing the track with the lowest track density, wherein the lower the density value of the track is, the more the track deviates from the whole track cluster, which means that the track is an outlier track and has no positive effect on track clustering.

(2) And respectively searching the distances from n-K residual tracks to K initial center tracks through a track distance matrix M, and classifying the residual tracks into track clusters which are closest to the initial center tracks.

(3) And sequentially finding out the track with the largest density value in the K track clusters as the center track of the current cluster through the track density array Den, and carrying out the clustering process of the next round.

(4) Repeating the process (2) and the process (3), and when the iteration times are reached or elements in the track clusters are not changed any more, ending the track clustering process.

Examples

In this embodiment, experiments were performed on both data sets to evaluate the clustering effect of the algorithm. The hardware environment of the experiment is as follows: intel (R) Core (TM) i7-9750H [email protected] 2.59GHz,8.00GB memory. The operating system is Microsoft Windows, the algorithm is written in Java language, and the programming environment is IDEA.

(1) Synthesizing a data set:

the synthesized data set is generated by a Thomas Brinkhoff mobile object generator based on a map of Aldburg, germany, and the track set of the mobile object can be obtained by setting parameters of the generator. For experimental repeatability, the relevant parameters are as follows: generating 10 moving objects and 1 external object per timestamp; the moving object refers to an automobile, the external object refers to the weather condition (or other factors) of a certain area, and the weather condition of a certain area can influence the speed of the moving object and whether to select a rerouting movement or not; 100 time stamps; the moving speed of the moving object is 250; 'probability' is 1000 and the trace points are generated in a continuous time. Through the setting of the parameters, 1000 synthetic tracks are generated in total, the number of places on each track ranges from [1, 100], 45929 places in Ordburg are accessed in total, meanwhile, in order to enable the simulation experiment to be closer to the experiment under the real data set, tracks with the number of points not exceeding 20 are removed, namely the number of positions of all tracks is between [21, 100], the number of tracks after the removing operation is 786, and each track has 55.6 positions on average.

(2) True dataset:

the real track data set consists of 536 files, including the time-space information of 536 taxis collected in the period of 5 months to 6 months in the san francisco bay 2008 for 30 days, and each file is all the information of 1 taxi, namely the longitude and latitude of the sampling point, the time stamp and whether passengers are carried or not. Since continuous track information of a taxi in one month is difficult to be identified as a single track, the track needs to be divided by setting a time interval threshold. Track points between 12:04 at 5 months and 12:04 at 26 months in 2008 were the most numerous, and therefore were taken as a track set for the experiment. The average time interval between successive positions in the dataset was 88 seconds, so a time threshold of 3 minutes was set as the track segmentation time, if the time interval of successive positions was greater than 3 minutes, the track was truncated from between the two positions, and a total of 482 tracks were obtained at the end of the last track and the beginning of the next track.

In the algorithm of the invention, in order to ensure that the clustering result is not affected by the low-frequency iteration number as much as possible, the iteration number is set to 100.

The trajectory clustering algorithm may observe the final experimental effect by visualization software, in this example using the python language to complete the visualization process.

The algorithm of the invention is adopted to cluster the synthesized data set, the clustering results when different K values are set are compared, the clustering result is better when K=9, as shown in fig. 4, the space-time three-dimensional graph of the track clustering of the synthesized data set, because the time length of the track is 100 at most, the space length of one track can only occupy a local part of a city and can not span the whole Aldenburg city; as shown in FIG. 5, the space two-dimensional graph can be seen that the influence factor of the track clustering is more a space factor, namely an x and y coordinate; 6-14, the overall track is too concentrated in time, i.e., the average track time length is over 50% of the maximum time length, so more clusters are presented that the tracks are clustered at a spatial level.

The algorithm is adopted to cluster the real data sets, the clustering results when different K values are set are compared, the clustering results are better when K=6, as shown in fig. 15, the clustering results are a space-time three-dimensional graph of the real data sets, different from tracks of the synthesized data sets, the average time span of the tracks of the real data sets exceeds 3 hours, a taxi can sufficiently span the whole san francisco bay area in 3 hours, and because the time span of the real data sets used in the experiment is 24 hours, the time span of each track is only 1/8 of the total time span, and is far lower than the track time span of the synthesized data sets; as shown in fig. 16, it is shown that each track is paved in the whole city, and longitude and latitude coordinates are not main factors affecting the clustering effect; 17-22, the influence factor of track clustering is more a time factor, namely t coordinate, rather than longitude and latitude coordinates, and the whole tracks are too concentrated in space, so that more tracks are presented to have clustering effects aggregated in a time layer, and the tracks have respective clustering effects at different times.

Clustering uses distance as a measure to divide samples in a dataset into a plurality of disjoint clusters by distance. The purpose is to make objects within the same cluster as similar as possible and objects between different clusters as different as possible. The outline coefficient (The Silhouette Index) is used for measuring the similarity of an object and a cluster where the object is located compared with other clusters, and combines the aggregation degree and the separation degree, so that the effect of track clustering can be more comprehensively evaluated, the value range of the outline coefficient is [ -1,1], and the clustering effect is better when the value is larger.

As shown in fig. 23, for the synthetic data set, it can be seen that the overall size of the contour coefficient decreases with the increase of the number of clusters, which is quite normal, and because of the increase of the number of clusters K, more track clusters are formed, so that the tracks belonging to the same track cluster are divided into different track clusters, which is liable to cause overcrowding between different clusters, so that the degree of separation is greatly increased, the degree of aggregation is reduced, and the contour coefficient is decreased.

As can be seen from fig. 23, the profile coefficient of the DBK-means according to the present invention is higher than the other three algorithms, which are the other three algorithms: the mdav algorithm is based on a K-means framework, but adopts a method for randomly selecting an initial track without rejecting outlier tracks, and randomly selecting an initial center track in an iterative process; the gcdm algorithm selects the track farthest from one central track as the central track of the iteration of the round, but does not consider the influence of the track distance factor on the selection of the initial central track; the tc_mftsm algorithm selects K tracks with the longest duration as initial center tracks, and uses the average track of track clusters as a strategy of center tracks in the iterative clustering process; all three algorithms can lead to redundancy of clustering results with different degrees, so that the contour coefficient result is reduced. The result shows that the contour coefficient of the method is higher than that of other three comparison algorithms, and the method is superior to the other three methods in that the initial center track can be selected by taking the track with relatively far distribution distance and relatively large track density as the core track all the time based on the calculation of the track density value, so that the obtained clustering effect is more effective.

As shown in fig. 24, from the trend situation of the profile coefficient under the real data set, the positive influence of the track clustering algorithm described in the present invention on track clustering can be seen: and (3) obtaining sparsity of each track by quantification through definition of set track density, selectively removing outlier tracks with low track density values, and simultaneously calculating probability of each track being selected by referring to track neighborhood distance and track density values, so as to complete the clustering process.

Specifically, it can be seen from the graph that the mdav algorithm, gcdm algorithm and tc_mftsm algorithm are not great in difference with the DBK-means of the invention when the K value is smaller; when K is 10 and 15, the profile coefficient obtained by the DBK-means is higher than that of other three algorithms, and the obtained clustering effect is better than that of other three algorithms, mainly because the algorithm can filter out outlier tracks by utilizing the track density calculated in advance, optimize the clustering effect, and fully utilize the distance factors between the track density and the center track when the initial center track is selected, so that the track with larger track density and more uniform distribution can be selected as the initial center track; in the track iteration process, the influence of the track density on the track clusters is always considered, so that the central track of the next iteration has the characteristic of high density, and the clustering effect is positively influenced.

The invention relates to a vehicle track clustering method integrating a density value and a K-means algorithm, which solves the problem of frame optimization when the K-means clustering algorithm is applied to the track field, and under the condition of giving a large amount of track data such as pedestrian tracks, vehicle tracks and the like, a plurality of track clusters are generated by carrying out track clustering on a large amount of disordered tracks, so that the track similarity in the same track cluster is high, the track similarity among different track clusters is low, the track clusters formed by the track clustering algorithm can know the track flow in the clusters, thereby obtaining the distribution condition of hot spot traffic routes and cold gate routes of cities, providing data support for traffic monitoring departments, and providing an improved thought for urban bus line planning. Meanwhile, a Location Based Service (LBS) can be provided for the user, because the formed track clusters are mostly attached to the hot spot routes and hot spot areas, which has positive effects on the user destination prediction and track recommendation.

Although embodiments of the present invention have been disclosed above, it is not limited to the details and embodiments shown, it is well suited to various fields of use for which the invention is suited, and further modifications may be readily made by one skilled in the art, and the invention is therefore not to be limited to the particular details and examples shown and described herein, without departing from the general concepts defined by the claims and the equivalents thereof.

Claims

1. A vehicle track clustering method integrating a density value and a K-means algorithm is characterized by comprising the following steps:

wherein the track distance satisfies:

D(T _i ,T _j )＝D(p _t ,q _t )；

wherein the density of the tracks satisfies:

wherein ,

2. The vehicle trajectory clustering method of fusion density values and K-means algorithm of claim 1, wherein the preprocessing includes: track selection, track segment segmentation, and/or track filling.

3. The vehicle trajectory clustering method of fusion density values and K-means algorithm of claim 2, wherein the second step further comprises:

4. The vehicle track clustering method of the fusion density value and the K-means algorithm according to claim 3, wherein the determination of the coincidence time between the tracks comprises:

if it is

5. The vehicle trajectory clustering method of fusion density values and K-means algorithm of claim 4, wherein the neighborhood distance satisfies:

in the formula ,

is the average of the distance array.

6. The vehicle trajectory clustering method of fusion density values and K-means algorithm of claim 5, wherein the distance array satisfies:

Dη＝{d(x _iη )}；

7. The vehicle trajectory clustering method of fusion density values and K-means algorithm of claim 6, wherein the Eps neighborhood parameters satisfy:

where n is the number of tracks in the track set.

8. The vehicle trajectory clustering method of fusion density values and K-means algorithm of claim 7, wherein the weights of the trajectories selected as center trajectories satisfy:

ω(T _i )＝d _min (T _i ,TScen)*ρ(T _i )；

9. The vehicle trajectory clustering method of fusion density values and K-means algorithm of claim 8, wherein the trajectory T _i The minimum center track distance to the cluster center track set TScen satisfies:

d _min (T _i ,TScen)＝min(D(T _i ,T _cz ))(T _cz ∈TScen,z∈[1,k])；

in the formula ,D(T_i ,T _cz ) Is the track T _i Center track T into cluster center track set TScen _cz Is a track distance of (a) in the track direction.