CN112990976B

CN112990976B - Commercial site selection method, system, equipment and medium based on open source data mining

Info

Publication number: CN112990976B
Application number: CN202110332552.2A
Authority: CN
Inventors: 魏宗财; 刘雨飞; 唐琦婧; 魏纾晴; 彭丹丽; 陈旭华; 刘晨瑜
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2024-06-14
Anticipated expiration: 2041-03-29
Also published as: CN112990976A

Abstract

The invention discloses a commercial site selection method, a system, equipment and a medium based on open source data mining, wherein the method comprises the following steps: acquiring data of a target area through a multi-source data open platform; performing grid division and numbering on the target area, and constructing an index system for clustering and selecting addresses according to the acquired data; preprocessing the data of the target area; respectively linking the divided grids according to the preprocessed data, and counting the value of each influencing factor in the index system; according to the number of the grid, counting the value of each influence factor in the index system, and analyzing by using a two-step clustering algorithm; and according to the analysis results of the two-step clustering algorithm, site selection suggestions of commercial sites with different categories and different scales are given. The method is based on open source data mining, combines two-step clustering algorithm for analysis, and can provide assistance and reference for site selection of different scales and different categories of commercial sites in cities according to analysis results.

Description

Commercial site selection method, system, equipment and medium based on open source data mining

Technical Field

The invention relates to a commercial site selection method, in particular to a commercial site selection method, a system, equipment and a medium based on open source data mining.

Background

The site selection significance of the commercial site is very great, and the commercial site is an important component part of the high-quality development of cities in terms of macroscopic urban planning, so that the vitality of the cities and the travel of citizens are influenced. Reasonable commercial site layout can increase the running efficiency of cities; from microscopic enterprises and individuals, commercial sites are basic units for operation and development, and the compatibility of different cities and businesses is a key factor for realizing site selection of commercial sites. Compared with other factors, the site selection has long-term property and fixity, when the external environment changes, other operation factors can be adjusted, once the site selection is determined, the site selection is difficult to change, and the site selection is suitable, so that enterprises and individuals can benefit for a long time.

In the existing commercial site selection method, factors such as population, traffic, existing commercial aggregation, rent of shops and the like are mainly considered, the factors are core indexes which are required to be considered in commercial site selection, but the indexes are not comprehensive, meanwhile, the data use amount is small, the factors are not considered from the whole city, and the problems that whether the commercial sites of different types and scales are suitable for construction or not are difficult to determine by using unified standards are solved. The existing scholars use the shared bicycle traffic travel data to analyze the traffic hot spot area of the city, so that the high correlation between the traffic hot spot area and the business is verified, and meanwhile, the compatibility of different places of the city and the business is also a key factor for realizing the site selection of the business.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a commercial site selection method, a system, equipment and a medium based on open source data mining, which are based on the data mined by the open source data and are analyzed by combining a two-step clustering algorithm, and according to the analysis result, assistance and reference can be provided for the site selection of commercial sites of different scales and different categories in cities.

A first object of the present invention is to provide a method for site selection for commercial sites based on open source data mining.

A second object of the present invention is to provide a commercial site selection system based on open source data mining.

A third object of the present invention is to provide a computer device.

A fourth object of the present invention is to provide a storage medium.

The first object of the present invention can be achieved by adopting the following technical scheme:

a method of site selection for commercial sites based on open source data mining, the method comprising:

acquiring data of a target area through a multi-source data open platform;

Performing grid division and numbering on the target area, and constructing an index system for clustering and selecting addresses according to the acquired data;

preprocessing the data of the target area;

respectively linking the divided grids according to the preprocessed data, and counting the value of each influencing factor in the index system;

according to the number of the grid, counting the value of each influence factor in the index system, and analyzing by using a two-step clustering algorithm;

and according to the analysis results of the two-step clustering algorithm, site selection suggestions of commercial sites with different categories and different scales are given.

Further, the mesh division and numbering are performed on the target area, each mesh is a basic unit for site selection, and an index system for cluster site selection is constructed according to the acquired data, and the method specifically comprises the following steps:

Extracting the boundary of the target area, creating grid surface elements to cover the boundary of the target area and cutting according to the boundary to obtain grids with corresponding numbers;

According to factors to be considered in site selection of commercial sites, the clustering indexes are divided into six categories, namely a man-mouth factor, a shared bicycle trip factor, a shop rent factor, a traffic comprehensive factor, a commercial aggregation factor and a land utilization factor.

Further, the data of the target area comprises population density, shared bicycle travel, store rentals, urban road traffic, commercial POIs and land utilization data; the commercial POI data comprise restaurant commercial POI, financial commercial POI and shopping commercial POI data;

the preprocessing the data of the target area specifically includes:

classifying the population density raster data into five classes based on a natural break point classification method, assigning values from low to high to corresponding intervals to obtain a population reclassification raster pattern, and converting the population reclassification raster pattern into ordered class variables;

selecting a line tracking interval tool according to the travel data of the shared bicycle, and carrying out line tracking analysis based on a starting point and an ending point to obtain path line data of the shared bicycle;

Selecting a rent field to process the store rent data by using a nuclear density estimation method to obtain a store rent evaluation grid diagram; classifying the obtained rent evaluation grid patterns based on a natural break point classification method, assigning values from low to high to corresponding intervals to obtain a rent reclassification grid pattern of the shops, and converting the rent reclassification grid pattern of the shops into ordered category variables;

processing commercial POI data by using a nuclear density estimation method to respectively obtain commercial aggregation distribution grid patterns of catering, shopping and finance types;

Classifying the obtained commercial concentration distribution grid patterns of the catering, shopping and finance classes based on a natural break point classification method, assigning corresponding intervals from low to high to obtain commercial concentration reclassification grid patterns of the catering, shopping and finance classes, and converting all reclassification grid patterns into ordered class variables;

Aiming at urban arterial road POI data, secondary arterial road POI data, urban subway station POI data and urban bus station POI data in urban road traffic data, respectively establishing buffer areas for the urban arterial road, the urban secondary arterial road, the bus station and the subway station according to the distance, assigning values to the buffer areas according to the distance, and converting the urban arterial road, the urban secondary arterial road, the bus station and the subway station into ordered category variables;

And converting the land use current situation map into vector data, and assigning values to the commercial land parcels and the non-commercial land parcels respectively to obtain land use classification surface elements.

Further, the step of respectively linking the divided grids according to the preprocessed data, and counting the value of each influence factor in the index system specifically includes:

The population reclassification grid diagram, the store rent reclassification grid diagram, and the restaurant, shopping and finance business aggregation level reclassification grid diagram are respectively converted into surface elements, and are respectively in spatial link with the divided grids to respectively obtain population density factor evaluation, store rent factor evaluation, restaurant business aggregation level evaluation, shopping business aggregation level evaluation and finance business aggregation level evaluation; the linking value of the single grid takes the average value of grid values in the grid;

carrying out space linking on the path line data of the shared bicycle and the divided grids to obtain travel path length factor evaluation of the shared bicycle; the link value of the single grid takes the total value of the path length in the grid;

According to the urban arterial road, the urban secondary arterial road, the bus stops and the subway stop buffer areas, a multi-factor weighted superposition analysis method is applied to obtain a traffic comprehensive factor graph;

converting the traffic comprehensive factor graph into surface elements, and performing space link with the divided grids to obtain urban road traffic factor evaluation; the linking value of the single grid takes the average value of grid values in the grid;

Space linking is carried out on land utilization classification surface elements and the divided grids to obtain land utilization evaluation; the link value of the single grid takes the average value of element values in the grid.

Further, the multi-factor weighted overlap-add analysis method specifically comprises the following steps:

And (3) superposing and analyzing four factors, namely the main road distance, the secondary road distance, the subway station distance and the bus station distance, to obtain the traffic comprehensive factor evaluation, wherein the evaluation model is as follows:

wherein S is final traffic comprehensive factor evaluation, and W _i is weight; x _i is a variable factor; the main road distance weight is 0.3, the secondary main road distance weight is 0.2, the bus station distance weight is 0.2, and the subway station distance weight is 0.3, so that the traffic comprehensive factor evaluation is obtained.

Further, according to the number of the grid, the value of each influencing factor in the index system is counted, and the two-step clustering algorithm is used for analysis, which specifically comprises the following steps:

Counting the values of population density factor evaluation, shared bicycle travel path length factor, store rent factor evaluation, traffic factor evaluation, catering business aggregation degree evaluation, financial business aggregation degree evaluation, shopping business aggregation degree evaluation and land utilization evaluation into a table;

According to the statistical form, a two-step clustering algorithm is used for analysis, a clustering result table is generated, and the category number corresponds to the corresponding grid number;

Obtaining a final result diagram according to the clustering result table;

and according to the spatial distribution of the clustering result table and the final result diagram, site selection suggestions of commercial sites with different categories and different scales are given.

Further, the two-step clustering algorithm comprises a pre-clustering stage and a clustering stage, wherein distance measures are used in the pre-clustering stage and the clustering stage;

the pre-clustering stage comprises: adopting the idea of CF tree growth in a BIRCH algorithm, reading data points in a data set one by one, and clustering data points in a dense area in advance to form a plurality of small sub-clusters while generating the CF tree;

the clustering stage comprises: taking the result of the pre-clustering stage, namely the sub-clusters as objects, merging the sub-clusters one by utilizing a condensation method until the expected number of clusters is reached.

The second object of the invention can be achieved by adopting the following technical scheme:

a commercial site selection system based on open source data mining, the system comprising:

The data acquisition module is used for acquiring the data of the target area through the multi-source data open platform;

The grid dividing module is used for carrying out grid division and numbering on the target area and constructing an index system of clustering site selection according to the acquired data;

the data preprocessing module is used for preprocessing the data of the target area;

The statistics module is used for respectively linking the divided grids according to the preprocessed data and counting the value of each influence factor in the index system;

the statistics and analysis module is used for carrying out statistics on the value of each influence factor in the index system according to the number of the grid, and carrying out analysis by using a two-step clustering algorithm;

and the site selection suggestion module is used for giving site selection suggestions of the commercial sites with different categories and different scales according to the analysis results of the two-step clustering algorithm.

The third object of the present invention can be achieved by adopting the following technical scheme:

a computer device comprises a processor and a memory for storing a program executable by the processor, wherein the processor realizes the commercial site selection method when executing the program stored by the memory.

The fourth object of the present invention can be achieved by adopting the following technical scheme:

a storage medium storing a program which, when executed by a processor, implements the commercial site selection method described above.

Compared with the prior art, the invention has the following beneficial effects:

Based on the data of open source data mining, the invention comprehensively considers various factors from the whole city, and the compatibility of different sites and businesses in the city is also a key factor of site selection implementation of the business network points; and combining a two-step clustering algorithm to provide assistance and reference for site selection of different types and different scales of commercial sites in cities.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a business website address selection method based on open source data mining according to embodiment 1 of the present invention.

FIG. 2 is a population density factor evaluation chart of example 1 of the present invention.

Fig. 3 is a diagram of a travel path length of a shared bicycle according to embodiment 1 of the present invention.

FIG. 4 is a chart of evaluation of commercial lease factors in example 1 of the present invention.

FIG. 5 is a graph showing evaluation of commercial aggregation of catering products according to example 1 of the present invention.

Fig. 6 is a view showing evaluation of commercial aggregation of financial type according to example 1 of the present invention.

FIG. 7 is a graph showing the evaluation of the commercial concentration of shopping class according to example 1 of the present invention.

Fig. 8 is a traffic factor evaluation chart of embodiment 1 of the present invention.

Fig. 9 is a plot of land evaluation for example 1 of the present invention.

Fig. 10 is a schematic diagram of the model of embodiment 1 of the present invention.

Fig. 11 is a cluster quality diagram of embodiment 1 of the present invention.

Fig. 12 is a schematic diagram of a clustering result table in embodiment 1 of the present invention.

Fig. 13 is a block diagram of a commercial site selection based on a two-step clustering algorithm according to embodiment 1 of the present invention.

Fig. 14 is a block diagram of a commercial site selection system based on open source data mining according to embodiment 2 of the present invention.

Fig. 15 is a block diagram showing the structure of a computer device according to embodiment 3 of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by persons of ordinary skill in the art without making any inventive effort based on the embodiments of the present invention are within the scope of protection of the present invention.

Example 1

In this embodiment, guangzhou city day reach of river is taken as an example, and a site auxiliary planning and selecting method based on open source data mining is provided, and an embodiment of the present invention is described below with reference to the accompanying drawings.

Fig. 1 is a flow chart of a method for commercial site selection based on open source data mining according to embodiment 1 of the present invention.

S101, acquiring data of a target area through a multi-source data open platform.

The data comprise population density, shared bicycle travel, store rent, urban road traffic, commercial POI and land utilization data; the commercial POI data comprise restaurant commercial POI, financial commercial POI and shopping commercial POI data; the urban road traffic data comprises urban road data, bus station distance and subway station distance, and the urban road data comprises urban arterial road distance and urban secondary arterial road distance.

In one embodiment, for the data acquisition in step S101, the specific implementation method is as follows:

Population density of the guangzhou city Tianhe district is obtained through WorldPop websites; the sharing bicycle trip point in the format of excel in Tianhe area in Guangzhou city is provided with data of 9 months and 16 days in 2019 by the Mobei bicycle App; the store renting data, guangzhou city weather reach of river urban road data, urban bus station data, urban subway station data, catering business POI data, financial business POI data and shopping business POI data are obtained through a Goldmap database; the current map of land utilization in the Tianhe district of Guangzhou city is obtained by government agency network public documents.

S102, carrying out grid division and numbering on the target area, and constructing an index system for clustering and site selection according to the acquired data.

In one embodiment, for the target area in step S102, according to the administrative boundary of the celestial river region, the meshing process is performed, and the mesh is numbered and a clustered index system is constructed, and the specific implementation method is as follows:

Extracting administrative boundaries of the Guangzhou city Tianhe district, importing GIS software, creating 300m x 300m grid surface elements to cover the whole administrative boundaries, and cutting according to the boundaries to obtain 1729 grids with corresponding numbers. According to various factors to be considered in site selection of commercial sites, the clustering indexes are divided into: six major categories of population factors, shared bicycle travel factors, store rent factors, traffic comprehensive factors, commercial aggregation factors and land utilization factors relate to 10 indexes, and the specific indexes are shown in the following table 1:

TABLE 1 Cluster index table

S103, preprocessing the data of the target area.

In one embodiment, the preprocessing of the acquired data of population density of the river area, travel points of the shared bicycle, rent of the shop, urban road traffic, various commercial POIs and land utilization in step S103 is implemented by the following steps:

For population factors, importing the population density raster data of the Tianhe region into GIS software, selecting a reclassification tool to divide the data into five types based on a natural break point grading method, and assigning values of the corresponding intervals from low to high to 1,2,3,4 and 5 to obtain the human population reclassification raster image of the Tianhe region, wherein the human population reclassification raster image is converted into ordered category variables, so that the subsequent cluster analysis is convenient.

The natural break point classification method is based on natural grouping inherent in data, identifies a classification interval, can perform most proper grouping on similar values, and can maximize differences among various classes. The grouping method is to divide data into a plurality of classes for which boundaries are set at positions where differences in data values are relatively large.

Calculating total deviation sum of Squares (SDAM) for array of certain class in classification result, and recording one group of results as A _array, its average valueThe method comprises the following steps:

Its sum of squares of total deviation (SDAM) is:

in the formulas (1) and (2), n is the number of elements in the array; x _i is the value of the ith element.

For each combination of ranges in the classification result, a sum of squares of class total deviation (SDCM) is calculated, and the smallest value is found and is recorded as SDCM _min. The n elements are divided into k classes, so that the classification result can divide k subsets, one of which is [ X1X2 … Xi ], [ xi+1xi+2 … Xj ], …, [ xj+1xj+2 … Xn ], calculate the total sum of squares of the deviations of each subset, SDAMi, …, SDAMn, and sum SDCM1 to:

SDCM₁＝SDAM_i+SDAM_j+…+SDAM_n (3)

The classification result can be divided into other cases of k classes, and SDCM ₂, …, The smallest one of these values is selected as the final result SDCM _min and verified by goodness of fit.

The gradient gvf _i by calculating the various classifications is:

gvf _i ranges from 1 (perfect fit) to 0 (bad fit), the higher the gradient, the larger the inter-class difference, the test proves that the classification obtained by SDCM _min has the largest gradient value, and a conclusion that the result of the natural break point classification method is ideal can be obtained.

And for the travel factors of the sharing bicycle, importing the travel point data of the sharing bicycle on the weekend in the Tianhe area of Guangzhou city into ArcGIS software, selecting a line tracking interval tool, and carrying out line tracking analysis based on a starting point and an ending point to obtain the path line element data of the sharing bicycle. The commercial site selection has an important relation with resident traveling, and the shared bicycle path on the weekend can reflect the path of resident activity traveling to a certain extent, so that the higher the linear density of the shared bicycle, the denser the path, and the higher the commercial value.

For commercial aggregation factor, the commercial POI points of the Tianhe region, shopping and finance can be processed by using a nuclear density estimation method to respectively obtain commercial aggregation distribution grid patterns of the dining, shopping and finance, and the commercial aggregation is an important factor to be referred to for commercial network point site selection. The higher the aggregation of restaurant, the larger the corresponding passenger flow volume, and the benign competitive cycle effect is formed by the restaurant and the periphery, so that the higher the aggregation of the related categories, the more suitable the sites of the commercial sites are in general.

And for the market rent factors, a nuclear density estimation method is utilized to select rent fields as value reference to process market POI points, so that a market rent factor evaluation schematic diagram in the Tianhe area can be obtained, and the market rent has important significance for site selection of commercial sites.

The kernel density Estimation (KERNEL DENSITY Estimation) is based on estimating the density of a point or line pattern with a moving cell. Given sample points x1, x2, … …, xn, a detailed distribution of attribute variable data is modeled using core estimation. When two-dimensional data are calculated, d is 2, and a common nuclear density estimation function formula is as follows:

Where K (x) is called a kernel function, (x-x _i)²+(y-y_i)² is the distance between points (x _i,y_i) and (x, y), h is the bandwidth, and n is the number of points in the study range.

In the kernel density estimation, the bandwidth is a free parameter defining the size of the smoothing quantity, and too large or too small a bandwidth affects the result of f (x). By adopting the rule of thumb of Silverman, under the assumption that f (x) is normal, the formula of broadband optimization calculation can be simplified as follows according to the work of Ker, A.P. and B.K. Goodwin:

where σ is the sample variance.

After obtaining the distribution grid diagram of the business concentration of the dining, shopping and finance of the river area and the evaluation grid diagram of the rent of the shop, selecting a reclassification tool to divide the tool into five categories based on a natural break point grading method, and assigning values of 1,2,3, 4 and 5 to corresponding intervals from low to high to obtain the reclassification grid diagram of the dining, shopping and finance of the river area and the reclassification grid diagram of the rent of the shop, wherein the grid diagrams are converted into ordered category variables, so that the subsequent clustering analysis is convenient.

And respectively importing the main road, the secondary road, the urban subway station POI data and the urban bus station POI data of the Tianhe area into ArcGIS software for traffic factors, and selecting a multiple buffer area tool. "25m, 50m, 75m, 100m, 125m" for urban arterial roads; the urban secondary arterial road is divided into 20m, 40m, 60m, 80m and 100 m; the bus stops are treated by 30m, 60m, 90m, 120m and 150 m; buffers were established for subway stations at "50m, 100m, 150m, 200m, 250 m". Meanwhile, because the commercial site selection has an important relation with the road traffic accessibility, the closer the commercial site is to a public traffic site and a main road is, the higher the accessibility is, and the commercial site is more suitable to be arranged, so that various factors are assigned with corresponding buffer areas from far to near, namely '1, 2, 3, 4 and 5', so that the factors are converted into ordered class variables, and the subsequent cluster analysis is convenient.

For land utilization factors, converting a weather reach of river land utilization current situation map into vector data, importing the vector data into ArcGIS software, assigning a value of '5' to commercial land parcels and '1' to non-commercial land parcels, and obtaining land utilization classification surface elements according to the fact that the higher the correlation with the commercial land parcels is, the higher the land utilization classification value is.

Thus, the preliminary processing of the data of each clustering factor is completed.

S104, respectively linking the divided grids according to the preprocessed data, and counting the value of each influence factor in the index system.

In one embodiment, for step S104, linking the grid of the cells of the river region according to the preprocessed data, and counting the values of each influencing factor, the specific implementation method is as follows:

Converting the human mouth weight classification grid diagram of the Tianhe area obtained in the step S103 into a surface element in a GIS by using a grid surface element tool, and performing space linking with 300 m.300m grids, and taking the average value of grid values in the grids according to the corresponding ordered category variable by the linking value of the single grid to obtain the graph 2;

Space linking is carried out on the path line element data of the weekend shared bicycle in the Tianhe area obtained in the step S103 and 300m grids, and the total value of the path length in the grids is taken by the link value of the single grid according to the corresponding ordered category variable to obtain a figure 3;

Converting the re-classified grid map of the store rentals in the Tianhe area obtained in the step S103 into surface elements in a GIS by using a grid surface element tool, and performing spatial linking with 300m grids, wherein the average value of grid values in the grids is taken by the linking value of a single grid according to the corresponding ordered category variable to obtain the graph of FIG. 4;

Converting the commercial aggregation reclassification grid patterns of the dining, finance and shopping areas of the Tianhe area obtained in the step S103 into surface elements by using a grid surface element tool in ArcGIS respectively, and carrying out space linking with divided grids of 300m x 300m of the Tianhe area, wherein the linking value of a single grid takes the average value of grid values in the grid according to corresponding ordered category variables to obtain the graph 5, the graph 6 and the graph 7;

And (3) carrying out superposition analysis on four factors of the main road distance, the secondary road distance, the subway station distance and the bus station distance to obtain a traffic comprehensive factor evaluation by applying a multi-factor weighted superposition analysis method to the main road and secondary road of the Tianhe area and the public traffic station buffer area obtained in the step (S103), wherein an evaluation model is as follows:

In the formula (7), S is the final traffic comprehensive factor evaluation; w _i is the weight; x _i is a variable factor. The method comprises the steps of obtaining a road distance weight of 0.3, a secondary road distance weight of 0.2, a bus stop distance weight of 0.2 and a subway stop distance weight of 0.3, obtaining a road traffic comprehensive factor evaluation, converting a traffic comprehensive factor evaluation graph into a surface element by using a grid surface element tool, carrying out space linking with grids 300m in a road area, and taking an average value of grid values in the grids by using a linking value of a single grid to obtain a graph 8;

space linking is carried out on the land utilization classification surface elements of the Tianhe area obtained in the step S103 and grids of 300m by 300m, and the linking value of each grid takes the average value of element values in the grids to obtain a graph 9;

S105, counting the value of each influence factor in the index system according to the number of the grid, and analyzing by using a two-step clustering algorithm.

And counting all influence factor values into a table according to the numbers of the grid cells of the Tianhe area, analyzing by using a two-step clustering algorithm, and giving planning and site selection suggestions of businesses of different categories and scales in the Tianhe area according to the analysis result of the clustering algorithm.

In one embodiment, for step S105, all the influence factor values are counted into a table according to the numbers of the grid cells of the celestial river region and analyzed by using a two-step clustering algorithm, and according to the analysis result of the clustering algorithm, planning and site selection suggestions of businesses of different categories and scales in the celestial river region are given, and the specific implementation method is as follows:

The same EXCEL table is imported according to the grid number from fig. 2, fig. 3, fig. 4, fig. 5, fig. 6, fig. 7, fig. 8, fig. 9 in step S104, and then the table is imported into SPSS software, and a two-step clustering tool is selected for analysis.

The two-step clustering algorithm is divided into two stages:

Pre-clustering (pre-clustering) stage. By adopting the idea of CF tree growth in the BIRCH algorithm, data points in the data set are read one by one, and data points in a dense area are clustered in advance while the CF tree is generated, so that a plurality of small sub-clusters (sub-clusters) are formed.

Clustering (clustering) stage. Taking the result of the pre-clustering stage, namely the sub-clusters as objects, merging the sub-clusters one by utilizing a condensation method until the number of the clusters is expected.

Distance measures are used in both types of operations, and mainly comprise Euclidean distances and log-likelihood distances.

The Euclidean distance is the distance between two class centers, which refers to the average of all variables in the class. Let us assume a data set Q with m samples, each with n variable indices. Then there are:

in this matrix (which is not preserved during the calculation), x _ij is the observation of the jth variable of the ith sample (i.ltoreq.i.ltoreq.m; i.ltoreq.j.ltoreq.n), and the observation x _i＝(x_i1、x_i2···x_ik···x_in for each sample can be considered as a point in n-dimensional space. Before clustering, k observables are selected (or set by the system) as initial clustering center points, and the observables are distributed into the classes where the centers of the classes are located according to the distance minimum principle of the centers of the classes, so that k classes formed by the first iteration are formed. And calculating the average value of each variable according to observed quantity forming each class, wherein the n average values of each class form k points in an n-dimensional space, namely the class center of the second iteration. And iterating according to the method until the specified iteration times or the criterion requirement of stopping the iteration is met, stopping the iteration and completing the clustering.

In this process, the Euclidean distance is denoted by d _ij, and the calculation formula is the square root of the Euclidean squared distance, as follows:

The likelihood log distance can handle both continuous and categorical variables. It is a distance-based probability value, where the distance between two classes varies as the likelihood logarithm decreases when the two classes merge into one class. When likelihood logarithms are calculated, continuous variables are ideally required to satisfy normal distribution, classification variables are required to satisfy polynomial distribution, and the variables are assumed to be independent of each other. We define the distance between class j and class s as d (j, s):

d(j,s)＝ξ_j+ξ_s-ξ_<j,r> (9)

in this process, either a Bayes (BIC) or an Akaik (AIC) criterion would be calculated for each class and an initial estimate of the number of classes would be made, and the final number of clusters would be determined as the number of clusters that maximizes the distance between the two closest classes in the initial class. Assuming that the clustering number is J, the calculation formulas are as follows:

Where N represents the total number of observables, K ^A is the total number of continuous variables used in the process, K ^B is the total number of classified variables used in the process, and L _k represents the number of kth classified variable.

And (3) selecting classification variables: the method comprises 7 sub-items of a store rent factor, a restaurant type commercial aggregation degree, a shopping type commercial aggregation degree, a financial type commercial aggregation degree, a traffic factor evaluation, a population density factor evaluation and a land factor evaluation.

Continuous variable selection: and sharing 1 subentry of the travel path length factor of the bicycle.

The analysis results are shown in fig. 10, 11, 12 and table 2 below; wherein, the model outline and the cluster quality are respectively shown in fig. 10 and 11; the clustering result is classified into 6 classes, and the conditions of the individual factors of each specific class are shown in the following table 2; generating a clustering result table in the SPSS, as shown in fig. 12, wherein the category numbers correspond to ID numbers of corresponding 300m×300m grids, and reintroducing ArcGIS software to obtain a final result, as shown in fig. 13.

Table 2 table of individual subentry factor cases for each class

S106, according to analysis results of the two-step clustering algorithm, site selection suggestions of commercial sites with different categories and different scales are given.

From the spatial distribution of fig. 13 and the data of fig. 12, site selection recommendations for different categories, different sizes of commercial sites are shown in table 3 below:

table 3 site selection suggestions for different categories, different sizes of commercial sites

It should be noted that while the method operations of the above embodiments are described in a particular order, this does not require or imply that the operations must be performed in that particular order or that all of the illustrated operations be performed in order to achieve desirable results. Rather, the depicted steps may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, or one step decomposed into multiple steps to perform.

Example 2:

As shown in fig. 14, the present embodiment provides a commercial site location system based on open source data mining, which includes an acquisition data module 1401, a meshing module 1402, a data preprocessing module 1403, a statistics module 1404, a statistics and analysis module 1405, and a site location suggestion module 1406 of a commercial site, wherein specific functions of the respective modules are as follows:

an acquisition data module 1401, configured to acquire data of a target area through a multi-source data open platform;

The grid division module 1402 is configured to grid divide and number the target area, and construct an index system for clustering and selecting addresses according to the acquired data;

A data preprocessing module 1403, configured to preprocess data of the target area;

a statistics module 1404, configured to link the divided grids according to the preprocessed data, and count a value of each influencing factor in the index system;

The statistics and analysis module 1405 is configured to perform statistics on the value of each influencing factor in the index system according to the number of the grid, and perform analysis by using a two-step clustering algorithm;

And the site selection suggestion module 1406 is used for giving site selection suggestions of different categories and different scales to the commercial sites according to the analysis results of the two-step clustering algorithm.

Specific implementation of each module in this embodiment may be referred to embodiment 1 above, and will not be described in detail herein; it should be noted that, the system provided in this embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functions may be allocated to different functional modules according to needs, that is, the internal structure is divided into different functional modules, so as to complete all or part of the functions described above.

Example 3:

The present embodiment provides a computer device, which may be a computer, as shown in fig. 15, including a processor 1502, a memory, an input device 1503, a display 1504 and a network interface 1505 connected through a system bus 1501, where the processor is configured to provide computing and control capabilities, the memory includes a nonvolatile storage medium 1506 and an internal memory 1507, where the nonvolatile storage medium 1506 stores an operating system, a computer program and a database, and the internal memory 1507 provides an environment for the operating system and the computer program in the nonvolatile storage medium, and when the processor 1502 executes the computer program stored in the memory, the commercial site selection method of the foregoing embodiment 1 is implemented as follows:

acquiring data of a target area through a multi-source data open platform;

preprocessing the data of the target area;

Example 4:

the present embodiment provides a storage medium, which is a computer-readable storage medium storing a computer program that, when executed by a processor, implements the commercial site selection method of the above embodiment 1, as follows:

acquiring data of a target area through a multi-source data open platform;

preprocessing the data of the target area;

The computer readable storage medium of the present embodiment may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In summary, the method acquires multiple index data through the multi-source data open platform, counts the value of each influence factor in each index, counts the values of all influence factors into a table, analyzes the table by using a two-step clustering algorithm, and gives business site selection suggestions of different categories and different scales according to the analysis result of the clustering algorithm. The method can provide assistance and reference for planning and site selection of different types and different scales of commercial sites in cities.

The above examples only represent possible embodiments of the invention, which are described more specifically and in detail but are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A method for site selection of a commercial site based on open source data mining, the method comprising:

Acquiring data of a target area through a multi-source data open platform, wherein the data of the target area comprise population density, shared bicycle travel, store rentals, urban road traffic, commercial POIs and land utilization data; the commercial POI data comprise restaurant commercial POI, financial commercial POI and shopping commercial POI data;

preprocessing the data of the target area;

according to the analysis result of the two-step clustering algorithm, site selection suggestions of commercial sites with different categories and different scales are given;

the preprocessing the data of the target area specifically includes:

and converting the land use current situation map into vector data, and respectively assigning values to the commercial land parcels and the non-commercial land parcels to obtain land use classification surface elements.

2. The method for site selection of commercial network according to claim 1, wherein the step of meshing and numbering the target area and constructing an index system for cluster site selection according to the acquired data comprises the following steps:

3. The method for locating commercial network points according to claim 1, wherein the step of respectively linking the divided grids according to the preprocessed data, and counting the value of each influencing factor in the index system comprises the following steps:

4. A method of site selection for commercial sites according to claim 3, characterized in that said multi-factor weighted overlap-add analysis is:

5. The method for locating commercial network points according to claim 3, wherein said counting the value of each influencing factor in said index system according to the number of said grid, and analyzing by using a two-step clustering algorithm, specifically comprises:

Obtaining a final result diagram according to the clustering result table;

6. The method for locating a commercial site according to any one of claims 1 to 5, wherein the two-step clustering algorithm includes a pre-clustering stage and a clustering stage, wherein distance measures are used in both the pre-clustering stage and the clustering stage;

7. A commercial site selection system based on open source data mining, the system comprising:

The acquisition data module is used for acquiring data of a target area through the multi-source data open platform, wherein the data of the target area comprise population density, shared bicycle travel, store rentals, urban road traffic, commercial POIs and land utilization data; the commercial POI data comprise restaurant commercial POI, financial commercial POI and shopping commercial POI data;

the site selection suggestion module is used for providing site selection suggestions of the commercial sites with different categories and different scales according to the analysis results of the two-step clustering algorithm;

the preprocessing the data of the target area specifically includes:

8. A computer device comprising a processor and a memory for storing a program executable by the processor, wherein the processor, when executing the program stored in the memory, implements the commercial site selection method of any one of claims 1-6.

9. A storage medium storing a program which, when executed by a processor, implements the method of commercial site selection of any one of claims 1 to 6.