CN113205368B

CN113205368B - Industrial and commercial customer clustering method based on time sequence water consumption data

Info

Publication number: CN113205368B
Application number: CN202110569868.3A
Authority: CN
Inventors: 朱波; 穆利; 吴铭; 王亚琦; 陶鹏
Original assignee: Hefei Water Group Co ltd
Current assignee: Hefei Water Group Co ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2022-11-29
Anticipated expiration: 2041-05-25
Also published as: CN113205368A

Abstract

The invention discloses a method for clustering industrial and commercial businesses based on time sequence water consumption data, which comprises the following steps: 1. building daily water consumption data of industrial and commercial enterprises and carrying out data preprocessing work; 2. learning and representing the time-series water data based on an LSTM model; 3. clustering industrial and commercial customers based on the water use trend; 4. clustering industrial and commercial customers based on the water use range on the basis of clustering according to the water use trend; 5. and visually displaying the clustering result. The invention can learn abundant water use patterns and trend information hidden in the time sequence water use data of the industrial and commercial enterprises through the LSTM model, the water use patterns and the trend information are used as the water use characteristic representation of the industrial and commercial enterprises, and the clustering of the industrial and commercial enterprises based on two factors of the water use trend and the water use range can be accurately and rapidly completed by combining with the kmeans algorithm.

Description

Industrial and commercial customer clustering method based on time sequence water consumption data

Technical Field

The invention relates to the technical field of user clustering, in particular to a time sequence data-based industrial and commercial customer clustering method.

Background

In the existing research on a user clustering method, a kmeans algorithm plays an excellent effect in static data clustering, but similarity between industrial and commercial customers is calculated by adopting Euclidean distance, the sequence of time points cannot be considered, only similarity of water consumption can be captured, and the trend of water consumption characteristics changing along with time cannot be described.

The water consumption data of the industrial and commercial enterprises is time sequence data, and the water consumption of the industrial and commercial enterprises on day, week and month is recorded according to fixed time intervals, such as day, week and month, and potential time sequence water consumption characteristics of a plurality of industrial and commercial enterprises, such as water consumption period, water consumption mode and the like, are hidden. The state change of a certain moment of a sample in time series data is related to the state of the previous moment and the next moment, so how to analyze the change rule in the time series data by combining the state of the previous moment and the next moment is a difficult point.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a time sequence data-based industrial and commercial customer clustering method, so that learning and characterization of a water use mode hidden in time sequence water use data of industrial and commercial customers can be realized through an LSTM model, clustering of characteristics of two aspects of water use trend and water use range of the industrial and commercial customers is performed by combining a kmeans algorithm, clustering accuracy is improved, and mining of rich change rules and trends hidden in the time sequence water use data is facilitated, so that the water use mode of the industrial and commercial customers is accurately and completely carved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to a clustering method of industrial and commercial enterprises based on time sequence water consumption data, which is characterized by comprising the following steps:

step1, constructing daily water consumption data of industrial and commercial enterprises;

step 1.1, obtaining remote water meter data of industrial and commercial enterprises, and extracting industrial and commercial enterprise id, water meter updating time, accumulated water flow, industrial and commercial enterprise remote water meter address and industrial and commercial enterprise name in the remote water meter data;

step 1.2, carrying out longitude and latitude conversion on the industrial and commercial tenant remote water meter address to obtain longitude and latitude information of the industrial and commercial tenant;

step 1.3, dividing the remote water meter data according to the industrial and commercial customer id to obtain m parts of water meter data files named by the industrial and commercial customer id, and arranging all data in the water meter data files according to the sequence of water meter updating time; wherein m represents the total number of industrial and commercial businesses;

step 1.4, carrying out difference processing on the water consumption accumulated flow value of each industrial and commercial company in the first water meter updating time and the water consumption accumulated flow value of the last water meter updating time every day, and thus constructing daily water consumption vectors of t days of m industrial and commercial companies

Wherein, the first and the second end of the pipe are connected with each other,

representing the daily water consumption value of the ith industrial and commercial company on the t day, wherein t represents the water consumption days, and marking the sample characteristic set formed by the water consumption vectors of the m industrial and commercial companies on the t day as X = { X = ⁱ |i＝1,2,...,m}；

Step 1.5, carrying out detection and processing of abnormal values on the sample feature set A to obtain a sample feature set X' after abnormal processing;

step 1.6, processing the missing value of the processed sample feature set A 'to obtain a sample feature set X' subjected to missing processing;

step2, representing the characteristics of time sequence water consumption data based on an LSTM model;

step 2.1, carrying out normalization processing on the sample characteristic set X' subjected to deletion processing to obtain a normalized sample characteristic set which is recorded as

Wherein the content of the first and second substances,

expressing the normalized daily water consumption value of the ith industrial company t day, an

Expressing the normalized daily water consumption value of the ith industrial and commercial company on the tth day;

step 2.2, pre-training an LSTM model;

the normalized sample characteristic set

Dividing the LSTM model into a training set and a verification set, and determining an epoch value, a batch-size value and a predicted step size value of the LSTM model training;

inputting the training set into the LSTM model to obtain a prediction sequence of a verification set, then calculating an error between the prediction sequence output by the LSTM model and the verification set by adopting a root-mean-square error, thereby completing one training of the LSTM model, and stopping training when the training times reach the epoch value, thereby obtaining the trained LSTM model and using the trained LSTM model as a merchant time sequence water use characteristic extraction model;

step 2.3, inputting the daily water consumption data of all industrial and commercial businesses into the commercial business time sequence water use characteristic extraction model, and outputting the water use characteristic vector Y = { Y ] of each industrial and commercial business ⁱ I =1,2,. ·, m }; wherein, y ⁱ Representing the water use characteristic vector of the ith industrial and commercial company, an

Representing the nth dimension characteristic value of the ith industrial and commercial company, wherein n represents the dimension of the water use characteristic vector;

step3, adopting a kmeans clustering algorithm to carry out water use eigenvector Y = { Y } on each industrial and commercial company ⁱ I =1,2, · m } performing industrial and commercial customer clustering based on water usage trends;

step 3.1, determining the optimal clustering quantity by combining an elbow method and a contour coefficient method, and recording the optimal clustering quantity as K;

step 3.2, based on the optimal clustering quantity K, using the water consumption characteristic vector y of the ith industrial and commercial company ⁱ The method is used as a sample to be detected and input into a kmeans algorithm, so that industrial and commercial businesses in the water use characteristic vector Y of each industrial and commercial business are gathered into K clusters, and the coordinates of the centers of the K clusters are randomly initialized by using the formula (1):

in the formula (1), the reaction mixture is,

the center of the k-th cluster is represented,

a coordinate value representing the nth dimension of the kth class center;

step 3.3, calculating the sample y to be measured by using the formula (2) ⁱ To the kth cluster center

European distance of

Thereby obtaining a sample y to be measured ⁱ Euclidean distance to the center of each cluster:

step 3.4, according to the sample y to be measured ⁱ The Euclidean distance from the center of each cluster is used for measuring the sample y ⁱ Dividing the cluster into clusters with the shortest Euclidean distance;

step 3.5, after all samples to be tested are divided into the clusters to which the samples belong, K classes are obtained, and the set of industrial and commercial businesses in each class is obtained as

Wherein the content of the first and second substances,

represents the feature vector of the jth industrial business in the kth class, an

j＝1,2,...,S _k ，S _k Representing the number of industrial businesses in the kth class;

representing the characteristic value of the nth dimension of the jth industrial company in the kth class;

calculating the mean value of the feature vectors of the industrial business in the kth class by using the formula (6)

Thereby obtaining an updated cluster center of

And assign a value to

k＝1,2,...,K；

In the formula (6), the reaction mixture is,

coordinate value representing the nth dimension of the updated kth class center；

Step 3.6, repeating the steps 3.3 to 3.5 until the cluster center is not changed any more, outputting the final cluster center and the industrial business id in each class, and accordingly grouping the industrial businesses with similar water use trends into one class;

step4, clustering industrial and commercial businesses based on the water consumption range;

step 4.1, based on the result of the water use trend clustering of the industrial and commercial customers, acquiring a set B of the industrial and commercial customers in each class _k ＝{b ^j |j＝1,2,...,S′ _k In which b ^j Represents the jth industrial business in the kth class, K ∈ {1, 2., K }, S' _k Representing the number of industrial and commercial businesses in the kth class when the clustering algorithm converges;

step 4.2, obtaining the j industrial and commercial business b in the k class after normalization ^j The real daily water consumption value on the t day is recorded

Represents the j industrial business b in the k class after normalization ^j Actual daily water usage value on day t, j =1, 2. _k Repeating the process from the step 3.1 to the step 3.6, and re-clustering the real daily water consumption values of the industrial and commercial customers in each class, so as to cluster the industrial and commercial customers with similar water use trends and water consumption into a class;

step5, visualizing a clustering result;

step 5.1, calculating the average value of the daily water consumption of the industrial and commercial customers in each class by taking the average value vector of the daily water consumption vectors of all the industrial and commercial customers in each class in the clustering result of the step4 as a class center, and respectively classifying and visualizing the water consumption condition, the class center and the daily water consumption average value of the industrial and commercial customers in each class by drawing a two-dimensional coordinate system;

step 5.2, acquiring the class centers in all the clusters calculated in the step 5.1, and simultaneously visualizing the K class centers by drawing a two-dimensional coordinate system;

and 5.3, acquiring the industrial and commercial tenant id in each class in the clustering result of the step 4.2, and drawing a map according to the longitude and latitude information of the industrial and commercial tenant so as to visualize the geographical position of the industrial and commercial tenant in each class.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the invention, the difference processing is carried out on the accumulated flow data, and the data completion is carried out on the missing value and the abnormal value by adopting the adjacent data, so that the daily water consumption data of each industrial and commercial company is constructed, thereby completing the pretreatment process of the time-series water consumption data, greatly improving the quality of a data mining mode, and being beneficial to improving the efficiency of actual data mining.

2. The method constructs and trains an LSTM model, learns rich water consumption patterns hidden in the time sequence water consumption data of the industrial and commercial enterprises, and represents the time sequence data as a static feature vector with specified dimensionality, so that the feature representation of the water consumption trend of the industrial and commercial enterprises is realized, the water consumption change rule of the industrial and commercial enterprises can be accurately described, and the accuracy of a subsequent clustering algorithm is improved;

3. the method combines a kmeans algorithm to carry out similarity calculation on the water use characteristic vectors of the industrial and commercial customers represented by the LSTM model, thereby completing the clustering of the industrial and commercial customers based on the water use trend, clustering the industrial and commercial customers with similar water use trends into one class, and being beneficial to mining and analyzing different water use change rules presented in different clustering results;

4. the invention aims at industrial and commercial enterprises with similar water use trends, and uses the kmeans algorithm again based on the real daily water use data to finish the clustering of the industrial and commercial enterprises based on the aspect of the water use range, thereby clustering the industrial and commercial enterprises with similar water use trends and water use ranges into a class, being beneficial to analyzing and comparing the difference of the water use ranges of different industrial and commercial enterprises in the clustering results with similar water use trends, and further accurately and completely depicting the water use modes of the industrial and commercial enterprises.

Drawings

FIG. 1 is a process flow diagram for industrial and commercial customer clustering in accordance with the present invention;

FIG. 2 is a diagram of the single cell state structure of the long short term memory model (LSTM) of the present invention;

FIG. 3 is a flow chart of the kmeans clustering algorithm of the present invention.

Detailed Description

In this embodiment, a method for clustering industrial and commercial businesses based on time-series water consumption data, specifically, as shown in fig. 1, is performed according to the following steps:

step1, building daily water consumption data of industrial and commercial businesses;

step 1.1, obtaining remote water meter data of industrial and commercial customers, and extracting industrial and commercial customer id, water meter updating time, accumulated water flow, industrial and commercial customer remote water meter address and industrial and commercial customer name in the remote water meter data;

in the specific implementation, the data of the remote water meter records the accumulated water flow of all industrial and commercial enterprises in 366 days from 2020-01-01 to 2020-12-31, wherein the accumulated flow is updated every hour at the whole point of the water meter. Secondly, the original data are randomly and disorderly arranged and belong to a large csv file, so that required columns such as the industrial and commercial customers id, the water meter updating time, the accumulated water flow, the industrial and commercial customer remote water meter addresses, the industrial and commercial customer names and the like are extracted firstly to reduce the file memory;

step 1.2, carrying out longitude and latitude conversion on the remote water meter address of the industrial and commercial tenant to obtain longitude and latitude information of the industrial and commercial tenant;

in this embodiment, the method for converting the address name of the user into latitude and longitude information by calling the high-resolution map API includes the following steps:

step1, acquiring the URL of the address on the high-resolution map, and then entering the address in keywords of the URL, thereby obtaining the URL of the address.

Step2, sending a request to the URL, obtaining page information corresponding to the URL by using a request.get (URL) text method in python, and converting the page information into character string type data.

And Step3, analyzing json data by using a json loads () method in Python based on the json format of the return type of the page data obtained in Step2, and converting the json data into dictionary type data.

And Step4, extracting the data obtained in Step3, and extracting longitude and latitude information according to the key value and the value in the dictionary.

Step 1.3, remote water meter data are divided according to industrial and commercial customer id to obtain m water meter data files named by the industrial and commercial customer id, and all data in the water meter data files are arranged according to the sequence of water meter updating time; wherein m represents the total number of industrial and commercial businesses;

in specific implementation, the data after the key column is extracted is still huge, which may cause the operation efficiency to be greatly reduced. The processing method includes the steps that large csv files are divided into files according to the industrial and commercial company id columns, the divided files are named according to the id of each industrial and commercial company, and therefore the independent remote water meter recording data of each industrial and commercial company are obtained.

Step 1.4, carrying out difference calculation processing on the water consumption accumulated flow value of each industrial and commercial company in the first water meter updating time and the water consumption accumulated flow value of the last water meter updating time every day, and thus constructing daily water consumption vectors of t days of m industrial and commercial companies

representing the daily water consumption value of the ith industrial and commercial company on the t day, t representing the water consumption days, and recording the sample characteristic set formed by the water consumption vectors of the m industrial and commercial companies on the t day as X = { X = ⁱ |i＝1,2,...,m}；

In this embodiment, a total 2158 industrial and commercial company with 366 days of complete water record from 2020-01-01 to 2020-12-31 is provided, and assuming that the accumulated flow of the industrial and commercial company a at 1 month and 1 day 00 in 2020 is x and the accumulated flow of the industrial and commercial company a at 1 month and 1 day 24 in 2020 is y, the daily water consumption value of the industrial and commercial company a at 1 month and 1 day in 2020 is (x-y), and so on, the daily water consumption data of 366 days of all the industrial and commercial companies can be calculated.

due to the fact that the water consumption accumulated flow at the initial moment or the water consumption accumulated flow at the last moment of a certain day is lost in record of the remote water meter, an abnormal value occurs in the daily water consumption calculation process. Performing special value (null value) processing on the abnormal value;

in specific implementation, due to abnormal record of the remote water meter, the cumulative flow at 00 time or the cumulative flow at 24 time of a certain industrial and commercial company is 0, so that a correct daily water consumption value cannot be calculated. Therefore, when the daily water consumption data is calculated, the judgment of a conditional statement is required to be set: if (00 time cumulative water flow = =0or 24 time cumulative water flow = = 0), the daily water consumption can be first subjected to null value processing; else, daily water consumption =24 moment accumulated water consumption flow-00 moment accumulated water consumption flow;

the condition that the water consumption record of a certain industrial and commercial company in a certain day is lost exists in the record of the remote water meter, so that the daily water consumption of a changed day cannot be correctly calculated. Assigning a special value (null value) to the missing daily water consumption data for processing;

in specific implementation, due to abnormal record of the remote water meter, all data of certain industrial and commercial businesses in a certain day are lost, so that the daily water consumption value of the lost day cannot be calculated, and therefore, the judgment of condition statements is required to be set: if (xx year-xx month-xx day = null) can give a null value to the daily water consumption of the missing day;

in this embodiment, the empty value may be filled by using a fillna () method in Python, because the water consumption of the industrial and commercial enterprises is generally large-scale water users, the water consumption is relatively stable, and the daily water consumption values of adjacent days are relatively similar, so parameters in the method may be selected as fillna (method = "filll", axis = 1) for filling the empty value with the value as a previous (column) value of a same row, or selected as fillna (method = "backsfill", axis = 1) for filling the empty value with the value as a next (column) value of the same row;

expressing the normalized daily water consumption value of the ith industrial company t day, and

the formula of the normalization processing is as follows:

in the formula (1), the acid-base catalyst,

the data after the normalization for the variables is carried out,

in the case of the original data of the variables,

and

maximum and minimum values in the raw data, respectively;

step 2.2, pre-training an LSTM model;

normalizing the sample characteristic set

inputting the training set into an LSTM model to obtain a prediction sequence of a verification set, calculating an error between the prediction sequence output by the LSTM model and the verification set by adopting a root-mean-square error to finish one-time training of the model, and stopping training when the training times reach a predetermined epoch value, so as to obtain a trained LSTM model and serve as a merchant time sequence water use characteristic extraction model;

in specific implementation, the structure diagram of the single cell state of the LSTM model is shown in fig. 2, and the pre-training LSTM model comprises the following steps:

step1: training a forgetting gate, wherein the process is expressed as:

f _t ＝σ(W _ft x _t +W _fh h _t-1 +b _f )

in the formula, x _t To input samples, f _t To forget the gate sample, σ (-) represents the activation function, sigmod, W is used _ft And W _fh Respectively representing forgetting gate and x _t And h _t-1 Inter weight coefficient, h _t-1 Representing a hidden state at time t-1, b _f Is the forgetting gate bias coefficient;

step2: training input gates, whose process is represented as:

g _t ＝σ(W _gt x _t +W _gh h _t-1 +b _g )

in the formula, g _t Represents the input Gate sample, W _gt And W _gh Respectively representing input gate and x _t And h _t-1 Inter weight coefficient, b _g Is the input gate bias coefficient;

step3: updating the memory unit, wherein the process is represented as:

s _t ＝f _t s _t-1 +g _t tanh(W _st x _t +W _sh h _t-1 +b _s )

in the formula s _t In a cellular state, W _st And W _sh Respectively represent cell and x _t And h _t-1 Inter weight coefficient, b _s The corresponding bias coefficient of the cell;

step4: updating the current state of the output gate, wherein the activation function is a tanh function;

step5: and repeating step1 to step4 until the model converges.

Step 2.3, inputting the daily water consumption data of all industrial and commercial businesses into a commercial business time sequence water use characteristic extraction model, and outputting the water use characteristic vector Y = { Y } of each industrial and commercial business ⁱ I =1,2, ·, m }; wherein, y ⁱ Represents the water consumption characteristic vector of the ith industrial and commercial company, an

Representing the nth dimension characteristic value of the ith industrial and commercial company, wherein n represents the dimension of the water use characteristic vector; in one embodiment, the trained model is stored in a test.h5 file, and is used for storing information such as parameters of the finally determined LSTM model; and setting the dimension of the model output vector to be 64, namely converting 366-dimensional original daily water consumption data into 64-dimensional characteristic vectors to be used as the characteristics of each industrial and commercial company.

Step3, adopting a kmeans clustering algorithm to carry out water use eigenvector Y = { Y } on each industrial and commercial company ⁱ I =1,2,..., m } for industrial and commercial business clustering based on water usage trends, as shown in fig. 3;

step 3.1, determining the optimal clustering number by combining an elbow method and a contour coefficient method, and recording the optimal clustering number as K;

determining the optimal clustering quantity by combining an elbow method and a contour coefficient method, and recording the optimal clustering quantity as K; the formula of the elbow method is as follows:

in the formula (2), K is the number of clusters, C _i Denotes the ith cluster, p is C _i Sample point of (1), m _i Is C _i Center of mass (C) _i Mean of all samples), SSE is the sum of squared clustering errors for all samples;

with the increase of the clustering number k, the sample division is finer, the clustering degree of each cluster is gradually increased, and then the sum of squared errors SSE is gradually reduced; when k is smaller than the real clustering number, the descending amplitude of SSE is large because the aggregation degree of each cluster is greatly increased due to the increase of k, and when k reaches the real clustering number, the descending amplitude of SSE is suddenly reduced and then tends to be flat along with the continuous increase of the k value, the relation graph of the SSE and the k value is the shape of an elbow, and the k value corresponding to the elbow is the real clustering number of the data;

next, the formula of the contour coefficient method is as follows:

in the formula (3), S (i) represents the profile coefficient of the ith sample, a (i) is the intra-cluster dissimilarity representing the mean of the distances from the ith sample to other samples in the cluster to which the ith sample belongs, and b (i) is the inter-cluster dissimilarity representing the minimum value of the average distances from the ith sample to all samples in each cluster not in which the ith sample is located; the mean value of the contour coefficients S (i) of all samples is called the contour coefficient of the clustering result;

s (i) belongs to [ -1,1], the closer the outline coefficient is to 1, the more reasonable the sample i is clustered, so the value of k corresponding to the larger outline coefficient is selected;

in the embodiment, the relationship graphs of the SSE, the contour coefficient and the K value can be simultaneously calculated and drawn, and an optimal K value is determined by combining the elbow of the relationship graph of the SSE and the K value and the local optimal point of the relationship graph of the contour coefficient and the K value, so that the similarity among the classes in the clustering result is as high as possible, and the similarity among the classes is as low as possible;

step 3.2, based on the optimal clustering quantity K, using the water use characteristic vector y of the ith industrial and commercial company ⁱ The samples to be detected are input into a kmeans algorithm, so that the industrial companies in the water use characteristic vector Y of each industrial company are gathered into K clusters, and the coordinates of the centers of the K clusters are initialized randomly by using the formula (1):

in the formula (1), the reaction mixture is,

the center of the k-th cluster is indicated,

a coordinate value representing the nth dimension of the kth class center;

European distance of

step 3.4, according to the sample y to be measured ⁱ Euclidean distance from the center of each cluster to the sample y to be measured ⁱ Dividing the cluster into clusters with the shortest Euclidean distance;

step 3.5, dividing all samples to be tested into the clusters to which the samples belong to obtain K classes, and acquiring the set of industrial and commercial businesses in each class as

Wherein the content of the first and second substances,

represents the feature vector of the jth industrial business in the kth class, and

Thereby obtaining an updated cluster center

And assign a value to

k＝1,2,...,K；

In the formula (6), the reaction mixture is,

a coordinate value representing the updated nth dimension of the kth class center;

in specific implementation, step3 is based on the water use trend feature vector of the industrial and commercial customers learned and output by the long-short term memory model in step2, clustering is performed on the industrial and commercial customers by adopting a kmeans algorithm with Euclidean distance as a similarity measurement method, at the moment, the network of the LSTM model learns the content stored, discarded and read in the long-term state of the time sequence water sequence of the industrial and commercial customers, and the long-term water use trend in the time sequence water use data is detected, so that the clustering of the industrial and commercial customers based on the water use trend aspect can be realized by combining the kmeans algorithm based on the 64-dimensional feature vector output by the model;

step4, clustering industrial and commercial customers based on the range of water consumption;

step 4.1, acquiring the industrial and commercial businesses in each class based on the result of the water use trend clustering of the industrial and commercial businessesSet of (A) is B _k ＝{b ^j |j＝1,2,...,S′ _k In which b is ^j Represents the jth industrial business in the kth class, K ∈ {1, 2., K }, S' _k Representing the number of industrial and commercial businesses in the kth class when the clustering algorithm converges;

step 4.2, obtaining the j industrial and commercial business b in the k class after normalization ^j The real daily water consumption value on the t day is recorded as

Represents the j industrial business b in the k class after normalization ^j True daily water usage value on day t, j =1,2. _k Repeating the process from the step 3.1 to the step 3.6, and re-clustering the real daily water consumption values of the industrial and commercial customers in each class, so as to cluster the industrial and commercial customers with similar water use trends and water consumption into a class;

in this embodiment, based on the clustering result in step3, many industrial and commercial customers with very similar water use trends and widely different water use ranges are clustered into one class, and we aim to cluster industrial and commercial customers with similar water use trends and slightly different water use ranges into one class, so that on the basis that clustering of the industrial and commercial customers with similar water use trends is completed in step3, we further adopt the kmeans algorithm again based on the original daily water use data of the industrial and commercial customers in the class, and at this time, do not need to capture the water use mode in the time sequence water use data, so that the feature that the kmeans algorithm is sensitive to numerical values so as to distinguish the water use sizes is utilized, and clustering is performed again according to the water use range on the basis of the clustering with similar water use trends, and in the clustering process, the optimal clustering number is determined by still combining the elbow method and the contour coefficient method, so that the similarity in the class is as high as possible, and the inter-class similarity is as low as possible. Finally, industrial and commercial enterprises with similar water use trends and water use ranges are gathered into one category;

step5, visualizing a clustering result;

step 5.1, taking the mean value vector of the daily water consumption vectors of all the industrial and commercial enterprises in each class in the clustering result of the step4 as a class center, calculating the daily water consumption mean value of the industrial and commercial enterprises in each class, and respectively carrying out classification visualization on the water consumption condition, the class center and the daily water consumption mean value of the industrial and commercial enterprises in each class by drawing a two-dimensional coordinate system;

in the specific embodiment, the date is taken as an x axis, the daily water consumption value is taken as a y axis, and the real daily water consumption information, the class center coordinates and the average daily water consumption value of all the industrial and commercial businesses in the class are visualized;

in the embodiment, all class center coordinate vectors are visualized by taking the date as an x axis and the daily water consumption value as a y axis;

and 5.3, acquiring the industrial and commercial enterprises id in each class in the clustering result of the step 4.2, and drawing a map according to the longitude and latitude information of the industrial and commercial enterprises so as to visualize the geographical position of the industrial and commercial enterprises in each class.

In specific implementation, the visualization of the geographical position of the class-I industrial and commercial businesses comprises the following steps:

step1: generating a longitude and latitude dictionary according to the id of the industrial and commercial customers in each class and longitude and latitude information corresponding to the id, wherein the key is the id of the industrial and commercial customers, and the value is the longitude and latitude

Step2: drawing a curtain, namely an area for displaying a map, by introducing a Geo packet in a pyecharts drawing tool, displaying the number of industrial and commercial businesses in the class by acquiring the number of pieces of data in each class right above the area, wherein each industrial and commercial business is represented as one piece of data, and setting formats such as the size, background color, font size and the like of the curtain;

step3: using a geo.add () function, setting a parameter maptype = 'joint fertilization', and loading a map resource package of a joint fertilization market in a curtain;

step4: on the fertilizer market map drawn in step3, the longitude and latitude dictionaries obtained in step1 are marked with scattered points one by one, and formats such as the size, shape, color and the like of the scattered points are set;

step5: acquiring name information of each industrial and commercial company according to the industrial and commercial company id, and displaying the name information and longitude and latitude information of the industrial and commercial company on a scatter point in a legend mode;

step6: and storing the maps drawn by step 1-step 5 and the visualization results as html files, so as to complete the visualization of the geographical positions of the industrial and commercial businesses in each class.

Claims

1. A method for clustering industrial and commercial businesses based on time sequence water consumption data is characterized by comprising the following steps:

step 1.3, dividing the remote water meter data according to the industrial and commercial customer id to obtain m parts of water meter data files named by the industrial and commercial customer id, and arranging all data in the water meter data files according to the sequence of water meter updating time; wherein m represents the total number of industrial businesses;

Wherein the content of the first and second substances,

Step 1.5, carrying out detection and processing on an abnormal value of the sample feature set X to obtain a sample feature set X' after abnormal processing;

step 1.6, processing the missing value of the processed sample feature set X 'to obtain a sample feature set X' subjected to missing processing;

Wherein the content of the first and second substances,

expressing the daily water consumption value of the ith industrial company on the tth day after normalization;

step 2.2, pre-training an LSTM model;

the normalized sample characteristic set

inputting the verification set into the LSTM model to obtain a prediction sequence of the verification set, then calculating an error between the prediction sequence output by the LSTM model and the verification set by adopting a root mean square error so as to complete one training of the LSTM model, and stopping training when the training times reach the epoch value so as to obtain the trained LSTM model and serve as a commercial tenant time sequence water use characteristic extraction model;

step 2.3, inputting the daily water consumption data of all industrial and commercial businesses into the commercial business time sequence water consumption characteristic extraction model, thereby outputting the water consumption characteristic vector Y = { Y } of each industrial and commercial business ⁱ I =1,2,. ·, m }; wherein, y ⁱ Representing the water use characteristic vector of the ith industrial and commercial company, an

step 3.2, based on the optimal clustering quantity K, using the water consumption characteristic vector y of the ith industrial and commercial company ⁱ The samples to be detected are input into a kmeans algorithm, so that the industrial companies in the water use characteristic vector Y of each industrial company are gathered into K clusters, and the coordinates of the centers of the K clusters are initialized randomly by using the formula (1):

in the formula (1), the reaction mixture is,

the center of the k-th cluster is indicated,

a coordinate value representing the nth dimension of the kth class center;

European distance of

Thereby obtaining an updated cluster center of

And assign a value to

k＝1,2,...,K；

In the formula (6), the reaction mixture is,

a coordinate value representing the nth dimension of the updated kth class center;

step 4.1, acquiring a set B of industrial and commercial businesses in each class based on the result of the water use trend clustering of the industrial and commercial businesses _k ＝{b ^j |j＝1,2,...,S′ _k In which b ^j Represents the jth industrial business in the kth class, K ∈ {1, 2., K }, S' _k Representing the number of industrial and commercial customers in the kth class when the clustering algorithm converges;

step 4.2, obtaining the j industrial and commercial business b in the k category after normalization ^j The real daily water consumption value on the t day is recorded

Represents the j industrial business b in the k category after normalization ^j True daily water usage value on day t, j =1,2. _k Repeating the process from the step 3.1 to the step 3.6, and reuniting the real daily water consumption value of each class of industrial and commercial enterprisesThe like, so that industrial and commercial enterprises with similar water use trends and water use amounts are gathered into a category;

step5, visualizing a clustering result;