CN112328728A

CN112328728A - Clustering method and device for mining traveler track, electronic device and storage medium

Info

Publication number: CN112328728A
Application number: CN202011371345.XA
Authority: CN
Inventors: 张欣环; 刘宏杰; 吴金洪; 施俊庆; 毛程远; 孟国连
Original assignee: Zhejiang Normal University CJNU
Current assignee: Zhejiang Normal University CJNU
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-02-05

Abstract

The application relates to a clustering method, a clustering device, an electronic device and a storage medium for excavating pedestrian tracks. The clustering method for mining the traveler track comprises the following steps: acquiring traveler track data; determining a clustering parameter of traveler trajectory data, wherein the clustering parameter comprises: the optimal neighborhood radius and the minimum neighborhood number; determining a clustering result of traveler trajectory data according to the clustering parameters; evaluating the clustering result according to preset evaluation indexes to obtain an optimal clustering parameter, wherein the preset evaluation indexes comprise: the evaluation result of the internal and external duty ratio index evaluation indexes comprises the following steps: dividing the sum of the intra-class densities of any two classes by the arithmetic mean of the maximum of the two cluster merged densities; and determining the optimal clustering result of the traveler trajectory data according to the optimal clustering parameters. By the method and the device, the problem of low accuracy of the clustering parameters in the related technology is solved, and the accuracy of the clustering parameters is improved.

Description

Clustering method and device for mining traveler track, electronic device and storage medium

Technical Field

The present application relates to the field of urban traffic, and in particular, to a clustering method, apparatus, electronic apparatus, and storage medium for mining pedestrian trajectories.

Background

The track mining means that the moving track points of the travelers are clustered into suitable areas on the basis of the long-term moving track of the travelers. In an urban public transport system, the mining of trajectory data of travelers is one of key technologies for constructing a customized public transport network, and is also the basis of the optimized site selection of bus stops. At present, the track clustering setting of the bus lines and the stations mostly takes the lowest operation cost as a target, and the distance and the time cost of travelers are less considered.

For example, in some related techniques, hot spots are mined by dividing a track into sub-track segments and then applying a density-based clustering algorithm to cluster the sub-tracks. For another example, a grid-based movement trajectory mining algorithm is proposed in some related arts, which first divides data based on grids and then clusters each grid using DBSCAN. But since the number of clusters is the required input for the FCM cluster, three numbers are specified as clustering parameters. However, the two ways only slice or grid the trajectory data, and then apply the clustering algorithm to the actual trajectory clustering scene. In the research process, the problem that the accuracy of clustering parameters is low and the obtained clustering result is poor exists because the clustering algorithm is not improved in the related technology is found.

At present, no effective solution is provided for the problem of low accuracy of clustering parameters in the related technology.

Disclosure of Invention

The embodiment of the application provides a clustering method, a clustering device, an electronic device and a storage medium for excavating a pedestrian track, and aims to at least solve the problem of low accuracy of clustering parameters in the related art.

In a first aspect, an embodiment of the present application provides a clustering method for mining a traveler track, where the method includes:

acquiring the traveler track data;

determining a clustering parameter of the traveler trajectory data, wherein the clustering parameter comprises: the optimal neighborhood radius and the minimum neighborhood number;

determining a clustering result of the traveler trajectory data according to the clustering parameters;

evaluating the clustering result according to a preset evaluation index to obtain an optimal clustering parameter, wherein the preset evaluation index comprises: the evaluation result of the internal and external duty ratio index evaluation index comprises the following steps: dividing the sum of the intra-class densities of any two classes by the arithmetic mean of the maximum of the two cluster merged densities;

and determining the optimal clustering result of the traveler track data according to the optimal clustering parameters.

In some embodiments, before determining the clustering parameters of the traveler trajectory data, the method further comprises:

preprocessing the traveler trajectory data, wherein the preprocessing comprises at least one of: data cleaning processing and data ETL processing.

In some of these embodiments, determining the clustering parameters of the traveler trajectory data comprises:

determining a walking distance interval of a traveler in a preset time period and determining a staying time interval in a preset area in the preset time period;

and determining the optimal neighborhood radius according to the walking distance interval, and determining the minimum neighborhood point number according to the stay time interval.

In some embodiments, determining the clustering result of the traveler trajectory data according to the clustering parameters includes:

and taking the clustering parameters as input parameters of a preset clustering model, and performing circulating density clustering calculation to obtain the clustering result.

In some embodiments, evaluating the clustering result according to a preset evaluation index to obtain an optimal clustering parameter includes:

evaluating the clustering result according to the internal and external duty ratio index to obtain a three-dimensional surface graph of the internal and external duty ratio index, wherein an X coordinate of the three-dimensional surface graph is used for representing neighborhood radius, a Y coordinate is used for representing the minimum neighborhood point number, and a Z coordinate is used for representing the internal and external duty ratio index;

determining the optimal clustering parameter under the condition that the value on the Z coordinate of the three-dimensional surface graph of the internal and external duty ratio indexes is minimum, wherein the optimal clustering parameter comprises the following steps: and optimizing the internal and external duty ratio indexes.

In some embodiments, after determining the optimal clustering result of the traveler trajectory data according to the optimal clustering parameter, the method further includes:

and performing compactness evaluation, separation evaluation and DBI index evaluation on the optimal clustering result to determine the clustering effect of the optimal clustering result.

In some embodiments, the preset evaluation index further includes: a contour coefficient evaluation index and a DBI index evaluation index; evaluating the clustering result according to a preset evaluation index to obtain an optimal clustering parameter, wherein the step of evaluating the clustering result according to the preset evaluation index comprises the following steps:

evaluating the clustering result according to the contour coefficient evaluation index to obtain a three-dimensional surface map of the input parameter of the contour coefficient evaluation index, wherein the X coordinate of the three-dimensional surface map is used for expressing the neighborhood radius, the Y coordinate is used for expressing the minimum neighborhood point number, and the Z coordinate is used for expressing the contour coefficient;

evaluating the clustering result according to a DBI index evaluation index to obtain a three-dimensional surface graph of DBI index evaluation index input parameters, wherein an X coordinate of the three-dimensional surface graph is used for representing neighborhood radius, a Y coordinate is used for representing the minimum neighborhood point number, and a Z coordinate is used for representing the DBI index;

determining the optimal clustering parameter according to the three-dimensional surface graph, the DBI index evaluation index input parameter and the contour coefficient evaluation index input parameter of the contour coefficient evaluation index input parameter, wherein the optimal clustering parameter comprises one of the following parameters: the system comprises an optimal contour coefficient evaluation index input parameter, an optimal DBI index evaluation index input parameter and an optimal contour coefficient evaluation index input parameter.

In a second aspect, an embodiment of the present application further provides a clustering device for mining a pedestrian trajectory, where the device includes:

the acquisition module is used for acquiring the traveler track data;

a first determining module, configured to determine a clustering parameter of the traveler trajectory data, where the clustering parameter includes: the optimal neighborhood radius and the minimum neighborhood number;

the second determining module is used for determining a clustering result of the traveler track data according to the clustering parameters;

the evaluation module is used for evaluating the clustering result according to preset evaluation indexes to obtain an optimal clustering parameter, wherein the preset evaluation indexes comprise: the evaluation result of the internal and external duty ratio index evaluation index comprises the following steps: dividing the sum of the intra-class densities of any two classes by the arithmetic mean of the maximum of the two cluster merged densities;

and the third determining module is used for determining the optimal clustering result of the traveler trajectory data according to the optimal clustering parameters.

In a third aspect, an embodiment of the present application provides an electronic apparatus, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the clustering method for mining pedestrian trajectories as described in the first aspect.

In a fourth aspect, the present application provides a storage medium, on which a computer program is stored, where the program is executed by a processor to implement the clustering method for mining a pedestrian trajectory as described in the first aspect.

Compared with the related art, the clustering method, the device, the electronic device and the storage medium for mining the traveler track provided by the embodiment of the application acquire the traveler track data; determining a clustering parameter of traveler trajectory data, wherein the clustering parameter comprises: the optimal neighborhood radius and the minimum neighborhood number; determining a clustering result of traveler trajectory data according to the clustering parameters; evaluating the clustering result according to preset evaluation indexes to obtain an optimal clustering parameter, wherein the preset evaluation indexes comprise: the evaluation result of the internal and external duty ratio index evaluation indexes comprises the following steps: dividing the sum of the intra-class densities of any two classes by the arithmetic mean of the maximum of the two cluster merged densities; and determining the optimal clustering result of the traveler trajectory data according to the optimal clustering parameters, so that the problem of low accuracy of the clustering parameters in the related technology is solved, and the accuracy of the clustering parameters is improved.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a block diagram of a hardware structure of a terminal of a clustering method for mining a traveler's trajectory according to an embodiment of the present invention;

FIG. 2 is a flow chart of a clustering method for mining a traveler's trajectory according to an embodiment of the present application;

FIG. 3 is a schematic diagram of inner and outer duty cycle regions according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a two-dimensional synthetic dataset according to an embodiment of the present application;

FIG. 5 is a histogram of clustering results according to an embodiment of the present application;

FIG. 6 is a schematic diagram of three-dimensional curved surfaces of different performance levels according to an embodiment of the present application;

FIG. 7 is a diagram illustrating clustering results for different performance indicators according to an embodiment of the present application;

fig. 8 is a block diagram of a clustering device for mining a pedestrian trajectory according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. Reference herein to "a plurality" means greater than or equal to two. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The method provided by the embodiment can be executed in a terminal, a computer or a similar operation device. Taking the operation on the terminal as an example, fig. 1 is a block diagram of a hardware structure of the terminal of the clustering method for mining a traveler track according to the embodiment of the present invention. As shown in fig. 1, the terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used for storing computer programs, for example, software programs and modules of application software, such as a computer program corresponding to the clustering method for mining pedestrian trajectories in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, thereby implementing the above-mentioned methods. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The embodiment provides a clustering method for mining a pedestrian track, and fig. 2 is a flowchart of the clustering method for mining the pedestrian track according to the embodiment of the application, and as shown in fig. 2, the flowchart includes the following steps:

step S201, traveler trajectory data is acquired.

In this step, the traveler track data may be obtained from a historical database in which the traveler track data is stored, or may be obtained in real time by obtaining the traveler track data collected by the APP of the user terminal, where the traveler track data may be track data of a traveler within a certain period of time preset by the user.

Step S202, determining clustering parameters of the pedestrian trajectory data, wherein the clustering parameters comprise: the optimal neighborhood radius and the minimum neighborhood point number.

It should be noted that the neighborhood radius may refer to a walking distance of the traveler, and the minimum neighborhood point may refer to a number of times that the traveler stays in a certain area.

In this embodiment, the optimal neighborhood radius may be determined according to the walking distance of the traveler in a certain time, and the minimum neighborhood point number may be determined according to the number of times the traveler stays in a certain area, so as to realize the value taking of the clustering parameters.

And step S203, determining a clustering result of the traveler trajectory data according to the clustering parameters.

In this step, the Clustering parameters in step S202 may be subjected to cyclic Density Clustering, and then a Clustering result of corresponding traveler trajectory data is obtained, where the Clustering mode may be through a Clustering algorithm (Density-Based Spatial Clustering of Applications with Noise, abbreviated as DBSCAN).

Step S204, evaluating the clustering result according to preset evaluation indexes to obtain an optimal clustering parameter, wherein the preset evaluation indexes comprise: the evaluation result of the internal and external duty ratio index evaluation indexes comprises the following steps: the sum of the intra-class densities of any two classes is divided by the arithmetic mean of the maximum of the two cluster merged densities.

In the related technology, only the coefficients such as the clustering cohesion degree and the clustering distance are considered during clustering, and certain limitation exists in the aspect of track clustering. For example, when evaluating the clustering degree of aggregation, the intra-cluster density is not considered, and the relationship between the intra-cluster number and the cluster size is ignored, which may result in a low accuracy of the clustering result. In the step, the clustering algorithm is improved by evaluating the internal and external duty ratio index evaluation indexes, the influence of the internal density of the clusters on the clustering result is avoided by combining the internal and external duty ratio index evaluation indexes, and the accuracy of the clustering result is improved.

It should be noted that, in this embodiment, the lower the value of the internal and external duty ratio index is, the better the clustering parameter is, that is, the better the clustering effect is.

And S205, determining the optimal clustering result of the traveler trajectory data according to the optimal clustering parameters.

Based on the above steps S201 to S205, the clustering parameter is determined according to the trajectory data of the traveler, the clustering result of the trajectory data of the traveler is determined according to the clustering parameter, and finally the clustering result is evaluated by the internal and external duty ratio index evaluation index to obtain the optimal clustering parameter, so that the improvement of the clustering algorithm is realized, the problem that the intra-cluster density is not considered when the clustering parameter is calculated in the related art is avoided, and the accuracy of the clustering parameter is improved.

In some embodiments, before determining the clustering parameter of the traveler trajectory data, preprocessing may be further performed on the traveler trajectory data, where the preprocessing includes at least one of: data cleaning processing and data ETL processing.

In this embodiment, because the obtained traveler track data may have situations of damaged data, duplicated data, invalid data, and the like, in this embodiment, by preprocessing the traveler track data, the problem that when the clustering parameter of the traveler track data is determined in step S202, the accuracy of the clustering parameter is low because the traveler track data may have damaged data, duplicated data, and invalid data may be avoided.

Note that, data cleansing: irrelevant data and repeated data can be deleted from the traveler track data, and the noise traveler track data can be smoothed.

Data ETL (Extract-Transform-Load): the method can be characterized in that all behavior tracks of a user are extracted from a data instance by using a unique identification code of the user, a single data set of the user is constructed, all users are circularly traversed, and finally the single data sets of a plurality of users are formed and serve as a candidate set of the whole cluster set. And finally, extracting a plurality of candidates from the candidate set as experimental objects, ensuring that the trajectory data of a single user is greater than a preset value (for example 1000), and constructing a cluster set.

In different application scenarios, the optimal clustering input parameter value fluctuates within a certain range. Since the range of the input parameter determines the execution efficiency of the clustering algorithm and the possibility of finding the optimal clustering parameter, it is important to establish a suitable range of the input parameter before the clustering algorithm is executed. The clustering frequency is too many, and the data set may not form effective clusters; clustering times are too few, clustering is too dispersed, and the method is not practical. Furthermore, the distance between cluster points can affect the compactness within a cluster. If the distance metric is too large, the clusters are too discrete to effectively distinguish the different clusters. If the distance metric is too small, the clustering distance is too close, possibly yielding too many trivial, worthless clustering results.

Therefore, in order to avoid the above problem, in some embodiments, determining the clustering parameter of the traveler trajectory data may include the following steps: determining a walking distance interval of a traveler in a preset time period and determining a staying time interval in a preset area in the preset time period; and determining the optimal neighborhood radius according to the walking distance interval, and determining the minimum neighborhood point number according to the stay frequency interval.

In this embodiment, the determination of the optimal neighborhood radius and the minimum neighborhood point can be realized by using the walking distance interval as the neighborhood radius interval and using the stay frequency interval as the minimum neighborhood point interval, and then determining the optimal neighborhood radius according to the neighborhood radius interval and determining the minimum neighborhood point according to the minimum neighborhood point interval, thereby avoiding the above problems and improving the accuracy of the clustering parameters.

In some embodiments, determining the clustering result of the traveler trajectory data according to the clustering parameters may include the following steps: and taking the clustering parameters as input parameters of a preset clustering model, and performing circulating density clustering calculation to obtain a clustering result. In this embodiment, the clustering parameters are used as input parameters of the preset clustering model, and the cyclic density clustering calculation is performed to obtain a clustering result, so that the clustering result is obtained.

It should be noted that the preset clustering module may be configured by a user.

In some embodiments, evaluating the clustering result according to the preset evaluation index to obtain the optimal clustering parameter may include the following steps: evaluating the clustering result according to the internal and external duty ratio index to obtain a three-dimensional surface map of the internal and external duty ratio index, wherein the X coordinate of the three-dimensional surface map is used for expressing the neighborhood radius, the Y coordinate is used for expressing the minimum neighborhood point number, and the Z coordinate is used for expressing the internal and external duty ratio index; determining an optimal clustering parameter under the condition that the value on the Z coordinate of the three-dimensional surface graph of the internal and external duty ratio indexes is minimum, wherein the optimal clustering parameter comprises the following steps: and optimizing the internal and external duty ratio indexes.

In this embodiment, the optimal clustering parameters are automatically obtained by obtaining the three-dimensional surface maps of the internal and external duty ratio indexes and then determining the optimal clustering parameters according to the three-dimensional surface maps of the internal and external duty ratio indexes, so that manual participation is not required, and the tedious process of manual participation is reduced.

In some embodiments, after the optimal clustering result of the traveler trajectory data is determined according to the optimal clustering parameters, compactness evaluation, separation evaluation and DBI index evaluation can be performed on the optimal clustering result to determine the clustering effect of the optimal clustering result.

In the embodiment, the optimal clustering result is subjected to compactness evaluation, separation evaluation and DBI index evaluation, so that the clustering effect of the optimal clustering result is realized, and the inner clustering value and the clustering distance of the clustering result can be reflected, so that a user can determine the clustering effect conveniently.

In addition to the above embodiments, the clustering result is evaluated through the internal and external duty ratio index evaluation indexes, in some embodiments, the preset evaluation index may further include: a contour coefficient evaluation index and a DBI index evaluation index; evaluating the clustering result according to a preset evaluation index to obtain an optimal clustering parameter, further comprising the following steps: evaluating the clustering result according to the contour coefficient evaluation index to obtain a three-dimensional surface graph of the input parameter of the contour coefficient evaluation index, wherein the X coordinate of the three-dimensional surface graph is used for expressing the neighborhood radius, the Y coordinate is used for expressing the minimum neighborhood point number, and the Z coordinate is used for expressing the contour coefficient; evaluating the clustering result according to the DBI index evaluation index to obtain a three-dimensional surface map of the DBI index evaluation index input parameter, wherein the X coordinate of the three-dimensional surface map is used for expressing the neighborhood radius, the Y coordinate is used for expressing the minimum neighborhood point number, and the Z coordinate is used for expressing the DBI index; evaluating the clustering result according to the internal and external duty ratio index to obtain a three-dimensional surface map of the internal and external duty ratio index, wherein the X coordinate of the three-dimensional surface map is used for expressing the neighborhood radius, the Y coordinate is used for expressing the minimum neighborhood point number, and the Z coordinate is used for expressing the internal and external duty ratio index; determining an optimal clustering parameter according to the three-dimensional surface graph of the contour coefficient evaluation index input parameter, the DBI index evaluation index input parameter and the contour coefficient evaluation index input parameter, wherein the optimal clustering parameter comprises one of the following parameters: the system comprises an optimal contour coefficient evaluation index input parameter, an optimal DBI index evaluation index input parameter and an optimal contour coefficient evaluation index input parameter.

In this embodiment, the clustering result is clustered by combining the contour coefficient evaluation index, the DBI index evaluation index, and the internal and external duty ratio index evaluation index, so as to obtain three-dimensional surface maps of corresponding indexes, and finally, an optimal clustering parameter is obtained according to the three-dimensional surface maps corresponding to the indexes, so that the accuracy of the clustering parameter can be further improved.

The embodiments of the present application are described and illustrated below by means of preferred embodiments.

In the related art, the application of the DBSCAN algorithm requires two important parameters: the minimum neighborhood point number, MinPts, for a given point to be a core object within the neighborhood, and the neighborhood radius, Eps. Also BSCAN has found wide application in many fields due to its simplicity and ability to detect clusters of different sizes and shapes. Since the traditional DBSCAN algorithm heavily depends on the manual experience of the user when selecting the clustering parameters, if the user does not have enough practical experience to determine the appropriate clustering parameter values, the quality of the clustering result may be affected by the improper value of the input parameter. In order to overcome the defects, on one hand, in some related technologies, a K-nearest neighbor algorithm and a DBSCAN algorithm are combined to determine clustering parameters, so that a parameter-free clustering technology is realized; or else, the Dsets (advantage set) and DBSCAN algorithm are mixed to automatically search for values, but the above method needs to process data at least twice, and the complicated steps are not suitable for large-scale data. On the other hand, some related technologies may also improve the clustering effect by improving the effectiveness index of the clustering algorithm (e.g., Duun index, DBI (Davies-Bouldin index) and contour (Silhouette) coefficient are three basic indexes for evaluating the label-free clustering algorithm), for example, a new clustering effectiveness index, called as tight-separation ratio (CSP) index, is designed in some related technologies to evaluate the clustering result generated by the AHC algorithm and determine the optimal clustering number Euclidean distance, which attempts to capture the data density along the connected mean line segments to estimate the distance between the cluster means. For example, in some related technologies, in order to mine effective and potential information about a ship motion rule in AIS data, track segments are clustered by using a similar DBSCAN algorithm to obtain a typical ship motion track, but the method often focuses on detecting an airspace anomaly of the track, ignores detection of track time domain anomaly, and has the problems of low detection accuracy and the like.

Meanwhile, the effectiveness indexes of the clustering parameters in the prior art are generally directed at a two-dimensional artificial data set, only the clustering degree and the clustering distance are concerned, and the intra-clustering density is ignored. This results in that the clustering result clustered by the manner in the related art may become a long-bar clustering result, which is not actually required. Therefore, aiming at the defects of indexes related to the clustering parameters in the related technology, it is necessary to improve the DBSCAN algorithm and the validity index thereof at the same time to correctly find out the optimal clustering parameters of the traveler position information data set, thereby improving the accuracy of the clustering parameters.

The embodiment of the application realizes the improvement of the DBSCAN, and can automatically determine the input parameters by clustering the data by using the improved DBSCAN clustering algorithm, thereby avoiding the problem of low clustering parameter accuracy caused by determining the clustering parameters according to the practical experience of users.

In the improved DBSCAN algorithm provided in the embodiment of the present application, a clustering result generated in a clustering process is used as an input parameter of an evaluation function, and then an evaluation result is obtained, as shown in table 1.

TABLE 1 improved data sheet for DBSCAN clustering algorithm

Wherein, the input parameters in table 1 include the following:

(1) d is the current input data set, e.g., D1 (x)₁，y₁) X and y representing the coordinates of the planes in the set.

(2) MaxEps is the maximum distance between two plane coordinate points and can be flexibly determined according to practical significance.

(3) MinEps is the minimum distance between two planar coordinate points and can be flexibly determined according to practical significance.

(4) And E represents the distance between any two points in the set, and the value range is between MinEps and MaxEps.

(5) MaxNum sets the upper limit of the clustering threshold because if the number of clusters is too large, the dataset may not form a valid cluster.

(6) MinNum sets the lower limit of the cluster threshold. If the number of clusters is too small, it may result in too many clusters, or even a point becoming a class, with no final calculation result.

(7) M determines an optimal number threshold for a cluster, which ranges between MaxNum and MinNum.

The output parameters in table 1 include the following:

(1) ResultC is a clustering result, and different clustering results can be obtained by using different input parameters.

(2) MinIedci is the minimum duty cycle, initially set to infinity.

(3) Bestpeps is the optimum value for E, initially set to 0.

(4) BestminPts is the optimum value for M, with an initial value of 0.

It should be noted that different input parameters (clustering parameters) will generate different clustering results. In order to prevent the loss of parameters, the improved DBSCAN algorithm in the embodiment of the present application provides a range of input parameters (clustering parameters), traverses all parameter values in the range, and then generates a clustering result. And finally obtaining an optimal evaluation value through evaluation and calculation of a clustering result, and calculating an optimal input parameter (optimal clustering parameter) based on a back propagation method.

The improved DBSCAN algorithm flow description comprises the following steps:

step 1, establishing an input parameter range.

In different application scenarios, the optimal clustering input parameter value fluctuates within a certain range. Establishing a suitable range of input parameters prior to execution of the algorithm is important because the range of input parameters determines the efficiency of algorithm execution and the likelihood of finding an optimum. The clustering frequency is too many, and the data set may not form effective clusters; clustering times are too few, clustering is too dispersed, and the method is not practical. Furthermore, the distance between cluster points can affect the compactness within a cluster. If the distance metric is too large, the clusters are too discrete to effectively distinguish the different clusters. If the distance metric is too small, the clustering distance is too close, possibly yielding too many trivial, worthless clustering results. Therefore, in the early stage of clustering, the maximum value and the minimum value of Eps and MinPts are determined first, so as to construct the effective range of the clustering parameters.

And 2, generating a clustering result.

And (3) performing cyclic density clustering by taking the neighborhood radius range in the step (1) as an input parameter, finishing the clustering calculation of all the tracks of the travelers in 6 months, and storing each clustering result (resultC).

And 3, evaluating the clustering result.

Each clustering result is evaluated by using evaluation indexes such as contour coefficient, DBI (DAVID-BOULDIN index), and internal and external duty cycle index (ieci index) proposed herein. And storing the best clustering parameters BestEps and BestMinPts into the evaluation indexes.

And 4, obtaining an optimal clustering result.

And (4) calculating the optimal clustering result by taking the BestEps and BestMinPts in the step (3) as input parameters. The clustering result is the clustering of the actual activity track of the traveler, and is the starting point and the ending point of all possible trips of the traveler in the subsequent research.

The duty ratio-based cluster evaluation index in step 2 can be realized by the following method:

in some embodiments, an evaluation index of the clustering parameter is selected to evaluate the quality of the clustering result, which may also be referred to as a clustering validity analysis. A good cluster partition should generally have the following characteristics: samples in different clusters are as different as possible, and samples in the same cluster are as similar as possible.

Therefore, through research on historical tracks of travelers, it is found that factors influencing clustering results not only include the degree of cohesion of clusters and the boundary distance between clusters, but also include the number of track points in the clusters. The traditional evaluation index only considers the coefficients of clustering cohesion degree, clustering distance and the like, so that the traditional evaluation index has certain limitation in the aspect of track clustering. And when the cluster aggregation degree is evaluated, the intra-cluster density is not considered, and the relation between the intra-cluster number and the cluster size is ignored. In irregular clustering, the influence degree of a single variable is often too large, and a clustering result often stays on a boundary point, so that the optimal selection of a clustering parameter cannot be realized.

Aiming at the problem that the existing evaluation index is not suitable for density-based geographic position information clustering, the embodiment of the application provides an effectiveness evaluation index IEDCI (internal and external duty cycle index) based on the intra-cluster and external duty cycle. Wherein, the formula of the internal and external duty ratio is as follows:

according to equation (1), the internal and external duty cycles involve three regions, S as shown in FIG. 3_i、S_jAnd S_i+jWherein S is_iCan represent the area enclosed by the outermost point in the ith class, S_jCan represent the area enclosed by the outermost point in the j-th class, S_i+jIt may represent the area enclosed by the outermost points after the two classes are merged. By balancing the relationship between the intra-class distance and the inter-class distance with the duty ratio in the embodiment of the present application, the improper situation of single-point classification or all-point classification can be solved. It should be noted that the area is a two-dimensional criterion that can be used to evaluate the degree of dispersion of the two classes, thereby effectively avoiding the linear extremum distance that may exist for the points in the two classes.

After the internal and external duty ratios are determined, an evaluation index ieci (internal and external duty cycle index) based on the internal and external duty ratios can be proposed based on the internal and external duty ratios, and the formula is as follows:

wherein n is_i、n_jThe number of points in the ith and jth clusters is shown, and k is the number of the current clusters.

Represents the maximum of any two different cluster set ratios, F (k) is the result of the duty cycle based evaluation index: the sum of the intra-class densities of any two classes divided by the arithmetic mean of the maximum of the two cluster merged densities, with smaller F (k) indicating better classification results. The difference in the number of clusters may lead to different results. The clustering parameters work best when the value of F (k) is minimum (clustering threshold MinPts and neighborhood radius Eps).

In order to find the optimal input parameters and the optimal clustering results, the clustering results generated by different input parameters are evaluated through effectiveness evaluation indexes based on clustering points and clustering duty ratios, the current optimal input parameters (optimal clustering parameters) are determined according to previous feedback, and the accuracy of the clustering parameters is improved.

The examples of the present application are described and illustrated below with reference to a few experimental examples.

In some embodiments, a data set for conducting an experiment (a traveler's trajectory data set for the experiment) may be first determined, including at least one of: a simulation data set, a case data set.

Wherein, the acquisition of the simulation data set can be realized by the following modes:

the simulation data sets are random numbers generated by computer simulation, each data set can have 1200 points, and each point is represented in a coordinate form and is divided into a cluster. As shown in fig. 4, the data sets may be a clear cluster, b fuzzy cluster, c halo cluster, and d non-cluster, and in these data sets, the structures of the clear cluster and the fuzzy cluster may be convex, the structure of the halo cluster may be annular, and the structure of the non-cluster may be splattered.

Wherein the case data set may be obtained by:

the case traveler trajectory data used in the embodiment of the application may use APP (for example, Yi Bus mobile APP) from the user terminal, where Yi Bus is a mobile APP, and may query traffic information such as a station, a line transfer, and a real-time arrival prediction. In the present embodiment, the location information data of 500 users in the city G in the last 6 months (for example, 1 month to 6 months 2020). The 500 users correspond to 500 TXT formatted files, each representing all the location information for each traveler in the 6 months. The trajectory data for each traveler can be represented by the y and x coordinates of the trajectory points, and further, since the data set represents the trajectory points of the real traveler, the structure of the data is diverse in comparison to the computer-generated simulation data set, for example, the structure of the data can include, but is not limited to, linear, circular, convex, splash-shaped, and the like. The data structure of the case data set in this embodiment is shown in table 2, where UID is a unique identifier of a SIM card of a user, LNG is a longitude of a current user location, LAT is a dimension of the current user location, and UP _ TIME is a coordinate upload TIME.

Table 2G city bus trip data structure table

Due to the fact that data collected by the APP have the conditions of damaged data, repeated data, invalid data and the like, the data need to be preprocessed. In the embodiment of the application, the data can be preprocessed in the following two ways:

the first method is as follows: data cleaning: the preprocessing of the data is mainly to delete irrelevant data and repeated data and smooth noisy data.

The second method comprises the following steps: data ETL (Extract-Transform-Load): the method can be characterized in that all behavior tracks of a user are extracted from a data instance by using a unique identification code of the user, a single data set of the user is constructed, all users are circularly traversed, and finally the single data sets of a plurality of users are formed and serve as a candidate set of the whole cluster set. And finally, extracting a plurality of candidates from the candidate set as experimental objects, ensuring that the trajectory data of a single user is greater than a preset value (for example 1000), and constructing a cluster set.

After the experimental data set is selected, clustering parameters can be compared in different experimental modes, and the comparison mode can include the following steps:

in the above embodiment, Eps is the walking distance of the traveler and MinPts is the number of times the traveler stays in a certain area in the traveler track mining, both of which have practical significance. Therefore, the parameter ranges can be defined in a practical sense. Through statistics of the existing data, it can be concluded that the traveler mostly walks between 20 and 110 meters, for example. Thus, experiments in the examples of the present application may set the Eps threshold to within (20,110), and subsequent experimental tests may also be based on this range.

There is no practical meaning in clustering too few points or too many points, because too small a cluster coordinate threshold may be a noise point, and it is difficult to find a cluster with a larger threshold. Therefore, in the experiment in the application example, the threshold of MinPts may be set within (8, 13), and the subsequent experimental test may be based on this range.

In order to verify the performance of the parameters automatically selected by the improved DBSCAN algorithm, the embodiment of the present application may generate a clustering result by using the case data set, and compare the clustering result with other parameters, such as empirical values and statistical values in the related art.

Counting results of all input parameters to obtain the most common clustering parameters, as shown in fig. 5, where clustering statistics histogram in fig. 5, Number of clusters represents the Number of clusters, and Frequency of clusters represents the occurrence Frequency of the Number of clusters, there are statistical input parameters, and the median (60, 12) of the current input parameters may be used as statistical input parameters (the Eps value is 60, and the MinPts value is 12). Whereas the Eps and MinPts values obtained by the user experience values are 85 and 10, respectively. The Eps and MinPts obtained by the improved DBSCAN algorithm of the embodiment of the application are 65 and 12 respectively.

The case data set has 500 individuals of anchor point information. Clustering results can also be evaluated using compactness, separation, and DBI. The compactness and the DBI represent the cohesion of the classes, the separation degree represents the distance between the classes, and the smaller the compactness and the DBI value is, the higher the separation value is, and the better the clustering effect is. As can be seen from table 3, the clustering parameters automatically generated in the embodiment of the present application achieve better clustering effect on the degree of separation and the DBI. Compared with the input through practical experience of a user in the related art, the clustering effect is improved through the method in the embodiment of the application.

Table 3 results of experiments with different performance parameters

However, to verify the performance of the iecdci, embodiments of the present application may use the simulation dataset and the case dataset to generate clustering results, respectively, and compare them with other validity indicators, including DBI and contour coefficient evaluation.

In this embodiment, the simulation data set is used to evaluate the performance of different algorithms, and the evaluation process includes the following steps:

the clustering results of a plurality of simulation data sets are evaluated by using compactness and separation in the embodiment of the application. Table 4 shows the compactness evaluation results of the three evaluation indexes, and from the results in table 4, the values of the dci for clear clusters, fuzzy clusters, and non-clusters of the data set are better. Table 5 shows the results of the separation evaluation of the three evaluation functions, from which it can be seen that the dci has better evaluation values for clearly clustered and non-clustered data sets.

TABLE 4 compactness evaluation result table for different performance indexes

TABLE 5 Segregation degree evaluation results Table for different performance indexes

The case data set may be used in this embodiment to evaluate the performance of the algorithm, and the evaluation process is as follows:

firstly, selecting optimal input (optimal clustering parameters): the clustering algorithm in table 1 above was performed using three evaluation indexes of contour coefficient, DBI, and iecdci. After traversing all possible values in the parameter range, the algorithm may obtain the optimal input parameters corresponding to the three evaluation functions, as shown in table 6, and in this embodiment, a three-dimensional surface graph may be used to explain the process of obtaining the optimal input parameters, as shown in fig. 6. X in FIG. 6

The axes represent all possible values for Eps, the y-coordinate represents all possible values for MinPts, and the z-coordinate represents the corresponding value. When the value is minimal, Eps and MinPts values are best. After parameter selection within the same input range, the optimal parameter value resulting from the contour coefficient will be generated at the boundary point, and both the DBI and the iecdci will obtain the optimal parameter value within this range, thereby minimizing the value of the evaluation index.

TABLE 6 optimal MinPts and Eps values Table

Secondly, clustering results: three different clustering results are generated using the optimal input values (optimal clustering parameters) of the three evaluation indexes. As shown in fig. 7, it can be seen from fig. 7 that, for the cluster points in the same range, the result produced by the contour coefficient evaluation index aggregates the discrete points within the red ellipse into one class. However, from the practical situation of the traveler track, the clustering result is poor due to excessive activities of the traveler. In the DBI clustering result, the red ellipse is divided into two parts. Similarly, the distance from point a to point B in fig. 7 is far beyond the range of human activity (e.g., 500 meters) in fig. 7 (B). In the clustering result by the improved algorithm in the embodiment of the application, the range of the activities of the travelers is smaller than the radius of the resident track. Therefore, the improved clustering algorithm has higher precision and better effect in practical application.

Thirdly, evaluating the optimal clustering result: the generated optimal clustering results were evaluated for compactness and separation, and the evaluation results are shown in table 7. On the basis of fully considering the influence of clustering density and clustering distance, the optimal clustering result obtained in the embodiment of the application has higher separability and smaller compactness, so that the method is more suitable for the actual situation of human activities in track clustering and has higher accuracy.

TABLE 7 evaluation result tables of degree of separation and degree of compaction

In the embodiment of the application, the input parameters of the clustering algorithm (DBSCAN) are evaluated based on improved evaluation indexes (internal and external duty ratio indexes), and the evaluation indexes balance the intra-class distance and the inter-class distance, so that the optimal input parameters of traveler position information clustering are obtained, and the problem of inaccurate parameters caused by manual experience is solved. Secondly, the scheme provided in the embodiment of the application is verified based on the bus trip data, and experiments show that the algorithm provided by the embodiment can find the optimal input parameter value on a bullet track data set. Through calculation of compactness and separation degree of a clustering result and comparison with a DBI (Davies-Bouldin) index and a contour coefficient, an optimal parameter value obtained by the IEDCI evaluation index has a smaller intra-clustering value and a larger inter-clustering value. Therefore, the improved clustering algorithm provided by the embodiment of the application has good performance in the aspect of mining the starting point of the pedestrian trajectory data cluster.

The scheme provided in the embodiment of the application can be used for clustering traveler position information to obtain a travel starting point and a travel finishing point, and can be popularized to the routing problems of logistics and supply chain management, automobile dynamic routing, gas station planning and the like in some embodiments.

The size of the cluster is limited due to the range of motion of the person or vehicle. Therefore, in some embodiments, the SIM card positioning information of the user may also be added to the experimental data to enrich the data diversity, since the frequency of use of APP directly determines the cluster density of the current cluster. Secondly, the overall calculation efficiency can be improved by introducing the calculation step length into the calculation process.

The embodiment also provides a clustering device for excavating a pedestrian track, which is used for implementing the above embodiments and preferred embodiments, and the description of the devices is omitted. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 8 is a block diagram illustrating a clustering apparatus for mining a pedestrian trajectory according to an embodiment of the present application, where the apparatus includes, as shown in fig. 8:

an obtaining module 81, configured to obtain traveler trajectory data;

a first determining module 82, coupled to the obtaining module 81, configured to determine a clustering parameter of the traveler trajectory data, where the clustering parameter includes: the optimal neighborhood radius and the minimum neighborhood number;

a second determining module 83, coupled to the first determining module 82, configured to determine a clustering result of the traveler trajectory data according to the clustering parameters;

the evaluation module 84, coupled to the second determining module 83, is configured to evaluate the clustering result according to a preset evaluation index to obtain an optimal clustering parameter, where the preset evaluation index includes: the evaluation result of the internal and external duty ratio index evaluation indexes comprises the following steps: dividing the sum of the intra-class densities of any two classes by the arithmetic mean of the maximum of the two cluster merged densities;

and a third determining module 85, coupled to the evaluating module 84, configured to determine an optimal clustering result of the traveler trajectory data according to the optimal clustering parameter.

In some of these embodiments, the apparatus further comprises: the preprocessing module is used for preprocessing the traveler trajectory data, wherein the preprocessing comprises at least one of the following steps: data cleaning processing and data ETL processing.

In some of these embodiments, the first determination module 82 includes: the first determining unit is used for determining a walking distance interval of a traveler in a preset time period and determining a staying time interval in a preset area in the preset time period; and the second determining unit is used for determining the optimal neighborhood radius according to the walking distance interval and determining the minimum neighborhood point number according to the staying time interval.

In some of these embodiments, the second determining module 83 includes: and the calculating unit is used for performing circulating density clustering calculation by taking the clustering parameters as input parameters of a preset clustering model to obtain a clustering result.

In some of these embodiments, the evaluation module 84 includes: the first evaluation unit is used for evaluating the clustering result according to the internal and external duty ratio index indexes to obtain a three-dimensional surface map of the internal and external duty ratio indexes, wherein the X coordinate of the three-dimensional surface map is used for expressing the neighborhood radius, the Y coordinate is used for expressing the minimum neighborhood point number, and the Z coordinate is used for expressing the internal and external duty ratio indexes; a third determining unit, configured to determine an optimal clustering parameter under a condition that a value on a Z coordinate of the three-dimensional surface map of the internal and external duty cycle indexes is minimum, where the optimal clustering parameter includes: and optimizing the internal and external duty ratio indexes.

In some of these embodiments, the apparatus further comprises: and the fourth determining module is used for performing compactness evaluation, separation evaluation and DBI index evaluation on the optimal clustering result and determining the clustering effect of the optimal clustering result.

In some of these embodiments, the evaluation module 84 further includes: the second evaluation unit is used for evaluating the clustering result according to the contour coefficient evaluation index to obtain a three-dimensional surface map of the input parameter of the contour coefficient evaluation index, wherein the X coordinate of the three-dimensional surface map is used for expressing the neighborhood radius, the Y coordinate is used for expressing the minimum neighborhood point number, and the Z coordinate is used for expressing the contour coefficient; the third evaluation unit is used for evaluating the clustering result according to the DBI index evaluation index to obtain a three-dimensional surface map of the DBI index evaluation index input parameter, wherein the X coordinate of the three-dimensional surface map is used for expressing neighborhood radius, the Y coordinate is used for expressing the minimum neighborhood point number, and the Z coordinate is used for expressing the DBI index; the fourth evaluation unit is used for evaluating the clustering result according to the internal and external duty ratio index indexes to obtain a three-dimensional surface map of the internal and external duty ratio indexes, wherein the X coordinate of the three-dimensional surface map is used for expressing the neighborhood radius, the Y coordinate is used for expressing the minimum neighborhood point number, and the Z coordinate is used for expressing the internal and external duty ratio indexes; a fourth determining unit, configured to determine an optimal clustering parameter according to the three-dimensional surface map of the contour coefficient evaluation index input parameter, the DBI index evaluation index input parameter, and the contour coefficient evaluation index input parameter, where the optimal clustering parameter includes: the system comprises an optimal contour coefficient evaluation index input parameter, an optimal DBI index evaluation index input parameter and an optimal contour coefficient evaluation index input parameter.

The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

The present embodiment also provides an electronic device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

step S201, traveler trajectory data is acquired.

It should be noted that, for specific examples in this embodiment, reference may be made to examples described in the foregoing embodiments and optional implementations, and details of this embodiment are not described herein again.

In addition, in combination with the clustering method for mining the pedestrian trajectory in the above embodiment, the embodiment of the present application may provide a storage medium to implement. The storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements any of the above-described embodiments of clustering methods for mining pedestrian trajectories.

It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A clustering method for mining pedestrian trajectories, the method comprising:

acquiring the traveler track data;

2. A method of clustering mined traveler trajectories according to claim 1, characterized in that before determining the clustering parameters of the traveler trajectory data, the method further comprises:

3. The method of claim 1, wherein determining clustering parameters for the traveler trajectory data comprises:

4. The method of claim 1, wherein determining the clustering result of the traveler trajectory data according to the clustering parameters comprises:

5. The method of claim 1, wherein evaluating the clustering results according to a predetermined evaluation index to obtain an optimal clustering parameter comprises:

6. The method of claim 1, wherein after determining an optimal clustering result for the traveler trajectory data based on the optimal clustering parameters, the method further comprises:

7. The clustering method for mining pedestrian trajectories as claimed in claim 1, wherein the preset evaluation index further comprises: a contour coefficient evaluation index and a DBI index evaluation index; evaluating the clustering result according to a preset evaluation index to obtain an optimal clustering parameter, wherein the step of evaluating the clustering result according to the preset evaluation index comprises the following steps:

8. A clustering apparatus for mining a pedestrian trajectory, the apparatus comprising:

the acquisition module is used for acquiring the traveler track data;

9. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the computer program to perform the clustering method of mined pedestrian trajectories of any of claims 1 to 7.

10. A storage medium, in which a computer program is stored, wherein the computer program is arranged to execute a clustering method of mined pedestrian trajectories according to any one of claims 1 to 7 when executed.