CN108734217A

CN108734217A - A kind of customer segmentation method and device based on clustering

Info

Publication number: CN108734217A
Application number: CN201810496620.7A
Authority: CN
Inventors: 王新刚; 王琳琳; 孙涛; 姜雪松; 耿玉水; 鲁芹; 李爱民
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2018-05-22
Filing date: 2018-05-22
Publication date: 2018-11-02

Abstract

The invention discloses a kind of customer segmentation method and device based on clustering, this method include：Customer information raw data set is obtained, numeralization pretreatment is carried out, obtains data sample, dimensionality reduction and feature extraction are carried out to data sample by autocoder；Autocoder treated data sample is used to the weight of VC Method computation attribute feature, and the distance between sample point is calculated using the Euclidean distance formula of weighting；Calculate the average distance between all data samples, ergodic data sample searches the Neighbor Points that each sample point is less than average distance with its distance, count all sample Neighbor Points quantity and according to descending sort, determine k initial cluster center point, remainder data is clustered according to weighted euclidean distance formula, completes customer segmentation work.

Description

A kind of customer segmentation method and device based on clustering

Technical field

The invention belongs to market statistics and the technical fields of marketing, refer to a kind of customer segmentation based on clustering Method and device.

Background technology

Along with the rapid development of science and technology, the universal use of computer, the quiet infiltration of network is daily in us Every aspect.Nowadays, people find the increasingly heavier of valuable information change using data mining technology from each field It wants, passing development not only can be summed up, but also the development trend in data future can be predicted.Wherein, Customer segmentation is an important field of research.By the method for clustering, and will according to the similitude of client and diversity They are divided into different classes, and enterprise is facilitated to find different types of client, to formulate the sale scheme of differentiation, realize enterprise The bigger of industry profit, so, how the key point that enterprise obtains bigger profit is become to customer segmentation.Currently, in visitor It is primarily present following problem in family subdivision system:

First, customer segmentation system is faced with that data volume is big when handling Customer Information, and the attribute of data is more, number According to the higher problem of dimension, if directly select these initial data carry out clustering, do not only result in the effect of customer segmentation Rate is relatively low, calculates complex steps, can also make customer segmentation overlong time.

Second, in customer segmentation system research, traditional k-means clustering algorithms be most common application algorithm it One, but the algorithm equally treats the attribute of all data samples in cluster process, does not consider the difference between different attributes. However, the importance of different attributes is different, different influences is also generated to Clustering Effect, it is however generally that, it is important Attribute generates large effect to Clustering Effect.

Third, traditional k-means algorithms cluster when, it is more sensitive to the selection of initial cluster center point, exist with Machine chooses the blindness sex chromosome mosaicism of initial cluster center point.In general, the quality of initial cluster center point selection can produce cluster result Raw large effect may make Clustering Effect reach local optimum rather than global optimum, Er Qiehui once selection is improper Increase the iterations of algorithm, reduces convergence speed of the algorithm.

In conclusion for the problem of how preferably carrying out customer segmentation in the prior art, still lack effective solution Scheme.

Invention content

For the problem that the deficiencies in the prior art, how solution preferably carries out customer segmentation in the prior art, The present invention provides a kind of customer segmentation method and device based on clustering, in customer segmentation, according to the consumption of client They are collected as different classes by custom, to propose that different marketing strategies provides basis for inhomogeneous client.

The first object of the present invention is to provide a kind of customer segmentation method based on clustering.

To achieve the goals above, the present invention is using a kind of following technical solution：

A kind of customer segmentation method based on clustering, this method include：

Customer profile data collection is obtained, numeralization pretreatment is carried out, obtains data sample, by autocoder to data Sample carries out dimensionality reduction and feature extraction；

Autocoder treated data sample uses to the weight of VC Method computation attribute feature, and using plus The Euclidean distance formula of power calculates the distance between sample point；

The average distance between all data samples is calculated, ergodic data sample searches each sample point with its distance less than flat The Neighbor Points of equal distance, count all sample Neighbor Points quantity and descending sort, determine initial cluster center point, carry out its remainder Strong point clusters, and obtains inhomogeneous client, completes customer segmentation work.

Scheme as a further preference, in the method, the pretreated specific steps that quantize include：

The data of nonumeric type are subjected to numeralization processing；

Use the data of standardization formula manipulation numeric type；

The data of normalized processing are handled using normalized formula, obtain data sample.

Scheme as a further preference, it is described that dimensionality reduction and feature extraction are carried out to data sample by autocoder Specific steps include：

The primary data sample of no label is input on the encoder in autocoder and carries out compressed encoding, is obtained Code is encoded；

Operation is decoded to code using the decoder in autocoder, obtains new data sample；

The error for calculating new data sample and primary data sample moves the reconciliation of the encoder in encoder according to error transfer factor The weight parameter of code device carries out dimensionality reduction and feature extraction by adjusting the autocoder after parameter to data sample.

Scheme as a further preference, in the method, improved k-meams algorithms include：

Autocoder treated data sample uses to the weight of VC Method computation attribute feature, and using plus The Euclidean distance formula of power calculates the distance between sample, calculates the average distance between all data samples；

Ergodic data sample searches the Neighbor Points that each sample point is less than average distance with its distance, and it is close to count all samples Adjoint point quantity and descending sort, determine initial cluster center point.

Scheme as a further preference, by autocoder treated data sample in improved k-meams algorithms This uses the weight of VC Method computation attribute feature, and specific steps include：

Obtain the attribute value matrix of the data sample after automatic coder processes；

The coefficient of variation of each dimension attribute in computation attribute value matrix；

The weight of its each attributive character is calculated using the coefficient of variation of each dimension attribute acquired.

Scheme as a further preference, in improved k-meams algorithms, the coefficient of variation of each dimension attribute according to The standard deviation of each dimension attribute value is calculated with average in attribute value matrix.

Scheme as a further preference, in improved k-meams algorithms, the Euclidean distance using weighting The specific steps of distance that formula calculates between sample point include：

According to the weight for each dimension attribute being calculated, assignment weighting is carried out to Euclidean distance；

The distance between data sample point is calculated using the Euclidean distance formula of weighting.

Scheme as a further preference, in improved k-meams algorithms, the tool of the determining initial cluster center point Body step includes：

Optional data sample point searches all sample points for being less than average distance with its distance, as the data sample The Neighbor Points of point, and calculate the quantity of Neighbor Points；

Ergodic data sample searches the Neighbor Points that each sample point is less than average distance with its distance, and it is close to count all samples Adjoint point quantity and according to descending sort；

Neighbour is selected to count out highest sample point as first initial cluster center point, if sample point is initial clustering The Neighbor Points of central point are then ignored, and so on all sample points of traversal until determining k initial cluster center point.

The second object of the present invention is to provide a kind of computer readable storage medium.

A kind of computer readable storage medium, wherein being stored with a plurality of instruction, described instruction is suitable for by terminal device Reason device loads and executes a kind of customer segmentation method based on clustering.

The third object of the present invention is to provide a kind of terminal device.

A kind of terminal device, including processor and computer readable storage medium, processor is for realizing each instruction；It calculates Machine readable storage medium storing program for executing is suitable for being loaded by processor and being executed described one kind and is based on gathering for storing a plurality of instruction, described instruction The customer segmentation method of alanysis.

Beneficial effects of the present invention：

1. a kind of customer segmentation method and device based on clustering of the present invention, introduce autocoder Concept realizes the dimensionality reduction to data sample and the purpose of feature extraction so that the feature of the sample obtained after dimension-reduction treatment The characteristics of primary data sample can be represented to greatest extent, plays the role of effective, needle to the feature extraction of initial data The characteristics of stream data, more efficiently handles high dimensional data, and better effect is played in customer information processing.

2. a kind of customer segmentation method and device based on clustering of the present invention, by introducing VC Method To reflect importance of the different attribute to Clustering Effect.The big attribute of dispersion degree role in cluster is bigger, this hair It is bright to start with from the data of sample set, the weight of each attributive character is acquired using the coefficient of variation, and the weight is several applied to Europe In in range formula, as the weighting coefficient of each attribute, the distance between sample is calculated using the Euclidean distance of weighting, with the public affairs Formula, which carries out cluster, can make its Clustering Effect more preferably.

3. a kind of customer segmentation method and device based on clustering of the present invention, the new selection cluster of use The method of central point not only avoids the randomness for choosing initial cluster center point, and the distribution for having reacted data sample is special Point does not easily cause the Clustering Effect of local optimum, compensates for the deficiency of traditional k-means algorithms.

Description of the drawings

The accompanying drawings which form a part of this application are used for providing further understanding of the present application, and the application's shows Meaning property embodiment and its explanation do not constitute the improper restriction to the application for explaining the application.

Fig. 1 is the customer segmentation method flow chart based on clustering of the present invention；

Fig. 2 is the flow chart of the autocoder of the present invention.

Specific implementation mode：

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another It indicates, all technical and scientific terms that the present embodiment uses have and the application person of an ordinary skill in the technical field Normally understood identical meanings.

It should be noted that term used herein above is merely to describe specific implementation mode, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singulative It is also intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or combination thereof.

It should be noted that flowcharts and block diagrams in the drawings show according to various embodiments of the present disclosure method and The architecture, function and operation in the cards of system.It should be noted that each box in flowchart or block diagram can represent A part for a part for one module, program segment, or code, the module, program segment, or code may include one or more A executable instruction for realizing the logic function of defined in each embodiment.It should also be noted that some alternately Realization in, the function that is marked in box can also occur according to the sequence different from being marked in attached drawing.For example, two connect The box even indicated can essentially be basically executed in parallel or they can also be executed in a reverse order sometimes, This depends on involved function.It should also be noted that each box in flowchart and or block diagram and flow chart And/or the combination of the box in block diagram, it can be come using the dedicated hardware based system for executing defined functions or operations It realizes, or can make to combine using a combination of dedicated hardware and computer instructions to realize.

In the absence of conflict, the features in the embodiments and the embodiments of the present application can be combined with each other with reference to The invention will be further described with embodiment for attached drawing.

Embodiment 1：

The purpose of the present embodiment 1 is to provide a kind of customer segmentation method based on clustering.

As shown in Figure 1,

Step (1)：Customer information raw data set is obtained, numeralization pretreatment is carried out, obtains data sample, by automatic Encoder carries out dimensionality reduction and feature extraction to data sample；

Step (2)：Pass through autocoder treated data obtain cluster knot by improved k-meams algorithm process Fruit completes customer segmentation work；

Step (2-1)：Autocoder treated data sample is used to the power of VC Method computation attribute feature Weight, and the distance between sample is calculated using the Euclidean distance formula of weighting；

Step (2-2)：The average distance between all data samples is calculated, ergodic data sample searches each sample point and its Distance is less than the Neighbor Points of average distance, counts all sample Neighbor Points quantity and according to descending sort, determines in initial clustering Then they are carried out cluster operation according to weighted euclidean distance, complete customer segmentation work by remaining data point by heart point.

The present embodiment the step of in (1), the pretreated specific steps that quantize include：

The first step：Not only include the data of numeric type in most data set, also includes that character type etc. is other kinds of Data, so, the data of nonumeric type are subjected to numeralization processing first.Such as：There are two types of values for gender attribute：Man, female.It will It carries out numeralization processing, and male is indicated with 0, and women is indicated with 1, in this way, gender=0 that represent is male；Gender=1 represents Be women.

Second step：Use the data of standardization formula manipulation numeric type.Standardization formula is：

Wherein, x'_ijFor standardization as a result, x_ijFor pending numeric type data,For pending numeric type number According to average value,θ_jFor the standard deviation of pending numeric type data,

Third walks：The data of normalized processing are handled using normalized formula.Normalized formula For：

Wherein, min { x'_ijBe standardization result minimum value, max { x'_ijBe standardization result maximum Value,

Finally obtained data set can be more convenient to carry out the feature extraction of subsequent step.

It is described that dimensionality reduction and feature extraction are carried out to data sample by autocoder the present embodiment the step of in (1) Specific steps include：

Autocoder is mainly to be made of two parts of encoder network and decoder network, and encoder is to inputting sample This progress compressed encoding, it is therefore an objective to which with the initial data for indicating higher-dimension of the vector maximum limit compared with low dimensional, decoder can be with Operation is decoded to obtained new samples, it is reverted into initial data to the greatest extent by decoding process.

Their course of work can be described as：

The first step：The primary data sample of no label is input on encoder and is encoded, code codings are obtained.

Second step：Operation is decoded to code using decoder.

Third walks：The error for calculating new sample information and original sample information, according to error to encoder and decoder Weight parameter be adjusted, reconstructed error is reduced to minimum, code at this time coding is exactly the character representation of original sample.

The realization process of autocoder is as shown in Figure 2.

This embodiment introduces the concepts of autocoder.Today under the big data epoch, we are not all the time in face Face magnanimity formula, real-time and high-dimensional flow data.Effective ways one of of the autocoder as deep learning, realize The purpose of dimensionality reduction and feature extraction to data sample so that the feature of the sample obtained after dimension-reduction treatment can be to greatest extent Representative primary data sample the characteristics of.The introducing of autocoder plays effective work to the feature extraction of initial data With the characteristics of for flow data, better effect is played in market user's subdivision field.

The present embodiment the step of in (2-1), reflect different attribute to Clustering Effect by introducing VC Method Importance.In general, the big attribute of dispersion degree role in cluster is bigger, so, data of the present invention from sample set Start with, the weight of each attributive character is acquired using the coefficient of variation, and the weight is applied in Euclidean distance formula, made For the weighting coefficient of each attribute, the distance between sample is calculated using the Euclidean distance of weighting.It is specific as follows：

Assuming that data set X is the set for the data sample for needing to cluster, X is that the data object tieed up by n m is constituted, Its attribute value matrix is expressed as：

I-th of data sample can use x in X_i=(x_i1,x_i2,x_i3,…x_ij,…x_im) indicate, x_ijWhat is represented is i-th The value of the jth dimension attribute of a data object.I=1,2 ..., n；J=1,2 ... m.

1. seeking the coefficient of variation of each attribute first.The coefficient of variation is the ratio of standard deviation and average, uses v_jIndicate that jth dimension belongs to The coefficient of variation of property, mathematical formulae are：

Wherein,

2. seeking the weight w of its each attribute again using the coefficient of variation of each dimension attribute acquired_j, formula is：

Wherein, 1≤j≤m.

3. last, the weight of calculated each dimension attribute carries out assignment weighting to Euclidean distance, then with weighting Euclidean distance calculate data sample point between distance.Entitled Euclidean distance can be expressed as：

Wherein, x_aAnd x_bIt is two data sample points.

The present embodiment the step of in (2-2), for traditional k-means algorithms to the selection ratio of initial cluster center point More sensitive, the shortcomings that blindness randomly selected will produce poor Clustering Effect, this paper presents a kind of new selection is initial The method of cluster centre point, detailed process are as follows：

1. calculating the distance between each two data sample in data set.It is carried out using the Euclidean distance of weighting presented above Distance calculates.

2. acquiring the average distance between all data samples according to following formula, formula is：

Wherein, n is the number of data sample point,Represent from data set appoint take 2 samples at

Arrangement number.

3. optional data sample point x_i(1≤i≤n), lookup and x_iDistance be less than average distance Dis_(Average)Institute There is sample point, such sample point is referred to as x_iNeighbor Points, and calculate x_iThe number of Neighbor Points.And so on, statistical number According to the quantity for the Neighbor Points for concentrating all samples, and each sample is ranked up according to the height that its neighbour counts out.

4. select neighbour to count out highest sample point as the 1st initial cluster center point, neighbour count out the 2nd sample Point is the 2nd initial cluster center point, is searched down successively, if Neighbor Points number is ordered as the sample x of pth_j(1≤j≤n) It is the Neighbor Points for having selected cluster centre point, then ignores the point, check the sample point x for being ordered as P+1_z(1≤z≤n), if x_zPoint It is not the Neighbor Points for having cluster centre point, then by x_zAs an initial cluster center point.And so on, until finding whole K initial cluster center point.

Embodiment 2：

The purpose of the present embodiment 2 is to provide a kind of computer readable storage medium.

A kind of computer readable storage medium, wherein being stored with a plurality of instruction, described instruction is suitable for by terminal device equipment Processor load and execute following processing：

Step (2-2)：The average distance between all data samples is calculated, ergodic data sample searches each sample point and its Distance is less than the Neighbor Points of average distance, counts all sample Neighbor Points quantity and descending sort, determines initial cluster center point, Then, they are subjected to cluster operation according to weighted euclidean distance by remaining data point, complete customer segmentation work.

Embodiment 3：

The purpose of the present embodiment 3 is to provide a kind of customer segmentation device based on clustering.

A kind of customer segmentation system and device based on clustering, including processor and computer readable storage medium, place Device is managed for realizing each instruction；Computer readable storage medium is suitable for being added by processor for storing a plurality of instruction, described instruction It carries and executes following processing：

These computer executable instructions make the equipment execute according to each reality in the disclosure when running in a device Apply method or process described in example.

In the present embodiment, computer program product may include computer readable storage medium, containing for holding The computer-readable program instructions of row various aspects of the disclosure.Computer readable storage medium can be kept and store By the tangible device for the instruction that instruction execution equipment uses.Computer readable storage medium for example can be-- but it is unlimited In-- storage device electric, magnetic storage apparatus, light storage device, electromagnetism storage device, semiconductor memory apparatus or above-mentioned Any appropriate combination.The more specific example (non exhaustive list) of computer readable storage medium includes：Portable computing Machine disk, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or Flash memory), static RAM (SRAM), Portable compressed disk read-only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical coding equipment, the punch card for being for example stored thereon with instruction or groove internal projection structure, with And above-mentioned any appropriate combination.Computer readable storage medium used herein above is not interpreted instantaneous signal itself, The electromagnetic wave of such as radio wave or other Free propagations, the electromagnetic wave propagated by waveguide or other transmission mediums (for example, Pass through the light pulse of fiber optic cables) or pass through electric wire transmit electric signal.

Computer-readable program instructions described herein can be downloaded to from computer readable storage medium it is each calculate/ Processing equipment, or outer computer or outer is downloaded to by network, such as internet, LAN, wide area network and/or wireless network Portion's storage device.Network may include copper transmission cable, optical fiber transmission, wireless transmission, router, fire wall, interchanger, gateway Computer and/or Edge Server.Adapter or network interface in each calculating/processing equipment are received from network to be counted Calculation machine readable program instructions, and the computer-readable program instructions are forwarded, for the meter being stored in each calculating/processing equipment In calculation machine readable storage medium storing program for executing.

Computer program instructions for executing present disclosure operation can be assembly instruction, instruction set architecture (ISA) Instruction, machine instruction, machine-dependent instructions, microcode, firmware instructions, condition setup data or with one or more programmings Language arbitrarily combines the source code or object code write, the programming language include the programming language-of object-oriented such as C++ etc., and conventional procedural programming languages-such as " C " language or similar programming language.Computer-readable program refers to Order can be executed fully, partly be executed on the user computer, as an independent software package on the user computer Execute, part on the user computer part on the remote computer execute or completely on a remote computer or server It executes.In situations involving remote computers, remote computer can include LAN by the network-of any kind (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as utilize internet Service provider is connected by internet).In some embodiments, believe by using the state of computer-readable program instructions Breath comes personalized customization electronic circuit, such as programmable logic circuit, field programmable gate array (FPGA) or programmable logic Array (PLA), the electronic circuit can execute computer-readable program instructions, to realize the various aspects of present disclosure.

It should be noted that although being referred to several modules or submodule of equipment in the detailed description above, it is this Division is merely exemplary rather than enforceable.In fact, in accordance with an embodiment of the present disclosure, two or more above-described moulds The feature and function of block can embody in a module.Conversely, the feature and function of an above-described module can be with It is further divided into and is embodied by multiple modules.

The foregoing is merely the preferred embodiments of the application, are not intended to limit this application, for the skill of this field For art personnel, the application can have various modifications and variations.Within the spirit and principles of this application, any made by repair Change, equivalent replacement, improvement etc., should be included within the protection domain of the application.Therefore, the present invention is not intended to be limited to this These embodiments shown in text, and it is to fit to widest range consistent with the principles and novel features disclosed in this article.

Claims

1. a kind of customer segmentation method based on clustering, which is characterized in that this method includes：

Customer information raw data set is obtained, numeralization pretreatment is carried out, obtains data sample, by autocoder to data Sample carries out dimensionality reduction and feature extraction；

Autocoder treated data sample is used to the weight of VC Method computation attribute feature, and using weighting Euclidean distance formula calculates the distance between sample point；

The average distance between all data samples is calculated, ergodic data sample searches each sample point and is less than average departure with its distance From Neighbor Points, count all sample Neighbor Points quantity and according to descending sort, determine initial cluster center point, remaining is counted It is clustered according to according to the European cluster of weighting, completes customer segmentation work.

2. the method as described in claim 1, which is characterized in that in the method, carry out the pretreated specific steps that quantize Including：

The data of nonumeric type are subjected to numeralization processing；

Use the data of standardization formula manipulation numeric type；

3. the method as described in claim 1, which is characterized in that it is described by autocoder to data sample carry out dimensionality reduction and The specific steps of feature extraction include：

The primary data sample of no label is input on the encoder in autocoder and carries out compressed encoding, obtains code volumes Code；

The error for calculating new data sample and primary data sample, the encoder and decoder in encoder are moved according to error transfer factor Weight parameter, dimensionality reduction and feature extraction are carried out to data sample by adjusting the autocoder after parameter.

4. the method as described in claim 1, which is characterized in that in the method, improved k-meams algorithms include：

Autocoder treated data sample is used to the weight of VC Method computation attribute feature, and using weighting Euclidean distance formula calculates the distance between sample, calculates the average distance between all data samples；

Ergodic data sample searches the Neighbor Points that each sample point is less than average distance with its distance, counts all sample Neighbor Points Quantity and according to descending sort, determines initial cluster center point.

5. method as claimed in claim 4, which is characterized in that handle autocoder in improved k-meams algorithms Data sample afterwards uses the weight of VC Method computation attribute feature, and specific steps include：

6. method as claimed in claim 4, which is characterized in that in improved k-meams algorithms, the change of each dimension attribute Different coefficient is calculated according to the standard deviation of each dimension attribute value in property value matrix and average.

7. method as claimed in claim 4, which is characterized in that in improved k-meams algorithms, the Europe using weighting In several the specific steps of distance that calculate between sample point of range formula include：

8. method as claimed in claim 4, which is characterized in that in improved k-meams algorithms, the determining initial clustering The specific steps of central point include：

Optional data sample point searches all sample points for being less than average distance with its distance, as the data sample point Neighbor Points, and calculate the quantity of Neighbor Points；

Ergodic data sample searches the Neighbor Points that each sample point is less than average distance with its distance, counts all sample Neighbor Points Quantity and according to descending sort；

Neighbour is selected to count out highest sample point as first initial cluster center point, if sample point is initial cluster center The Neighbor Points of point are then ignored, and so on all sample points of traversal until determining k initial cluster center point.

9. a kind of computer readable storage medium, wherein being stored with a plurality of instruction, which is characterized in that described instruction is suitable for by terminal The processor of equipment loads and executes the method according to any one of claim 1-8.

10. a kind of terminal device, including processor and computer readable storage medium, processor is for realizing each instruction；It calculates Machine readable storage medium storing program for executing is for storing a plurality of instruction, which is characterized in that described instruction is appointed for executing according in claim 1-8 Method described in one.