CN106649517A

CN106649517A - Data mining method, device and system

Info

Publication number: CN106649517A
Application number: CN201610901862.0A
Authority: CN
Inventors: 侯捷; 李爱华; 葛胜利
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2016-10-17
Filing date: 2016-10-17
Publication date: 2017-05-10

Abstract

The invention provides a data mining method, device and system, and relates to the field of big data. The data mining method provided by the invention comprises the following steps: obtaining predetermined behavior data of users; classifying the users according to the generation time of the predetermined behavior data of each user and the number of the predetermined behavior data to determine a target user set; generating a single user feature vector of each user in the target user set according to the predetermined behavior data; and grading the target user set based on a clustering algorithm according to the single user feature vector to determine a grading user set. By adoption of such method, the users can be classified at first, user clustering is carried out in one category, so that appropriate target users can be selected to carry out clustering analysis, on one hand, better pertinence is guaranteed, and the operation data size is reduced, on the other hand, the interference of user data of different types to the clustering effect can be eliminated, and thus the user group division is more accurate.

Description

Data digging method, apparatus and system

Technical field

The present invention relates to big data field, particularly a kind of data digging method, apparatus and system.

Background technology

In big data application, often user group can be divided into according to the various actions feature of user some Class, the feature in order to be directed to customer group carries out accurate formula, personalized service.Cluster is that user group is carried out to divide a kind of Mode.Cluster is, by the classified process of data object, to make the object in same class have very high similarity, and makes difference Object height in class is different.Distinctiveness ratio is usually used distance to be measured.Cluster analysis is widely used to various fields, Such as market survey, data analysis, pattern-recognition etc..

But, effect user group divided for user behavior feature in cluster operation to a great extent according to The quality in basic data, the existing user group based on clustering algorithm is relied to divide and tend not to enough reflect user's well Behavioural characteristic, has that cluster is inaccurate, it is difficult to accurate formula, personalized clothes are carried out to customer group using cluster result Business.

The content of the invention

It is an object of the present invention to improve the degree of accuracy of user group's division.

According to an aspect of the present invention, a kind of data digging method is proposed, including：Obtain the predefined action number of user According to predefined action data include the effectiveness data of predefined action and the generation time of predefined action；According to the predetermined of each user The generation time of behavioral data and the quantity of predefined action data are classified to user, determine that targeted customer gathers；According to pre- Determine the single user characteristic vector that behavioral data generates each user in targeted customer's set；According to single user characteristic vector, it is based on Clustering algorithm is classified to targeted customer's set, determines that hierarchic user gathers.

Alternatively, predefined action data also include predetermined condition mark and effectiveness deduction data, are identified according to predetermined condition Recognize the first predefined action data；Single user characteristic vector includes first eigenvector index, second feature to figureofmerit the 3rd Characteristic vector index, fourth feature are to figureofmerit, fifth feature to figureofmerit and/or sixth feature to figureofmerit；According to predetermined Behavioral data generates the single user characteristic vector of each user in targeted customer's set to be included：According to first predefined action of user The quantity of data determines the first eigenvector index of user with the ratio of the quantity of predefined action data；Determine user each The effectiveness deduction data of predefined action data and the ratio of effectiveness data, and ratio is taken into average, determine the second feature of user To figureofmerit；The third feature vector of user is determined according to the effectiveness of user deduction data sum and the ratio of effectiveness data sum Index；Determine the fourth feature of user to figureofmerit according to the effectiveness of user deduction data sum；It is predetermined according to the first of user The quantity of behavioral data determines the fifth feature of user to figureofmerit；And/or, according to presence the first predefined action data of user The quantity of time period determine that the sixth feature vector of user refers to the ratio of the time segment number begun to pass through from user-network access Mark.

Alternatively, according to single user characteristic vector, targeted customer's set is classified based on clustering algorithm, it is determined that classification User's set includes：High-density region user is determined according to the single user characteristic vector of each user；From high-density region user In be selected as the user of initial cluster center, the quantity of initial cluster center is equal with predetermined classification quantity；According to initial poly- Class center, determines that hierarchic user gathers based on K mean algorithms.

Alternatively, initial cluster center is selected to include in high-density region user：According to single user characteristic vector in height The maximum user of density parameter is selected in density area user as the first initial cluster center；Select from high-density region user The farthest user of the initial cluster center of distance first is taken as the second initial cluster center；Choose from high-density region user away from Farthest user is used as the 3rd initial cluster center with a distance from the first initial cluster center and the second initial cluster center set； The like until determining whole initial cluster centers.

Alternatively, exclude the abnormal user in targeted customer's set, the effectiveness deduction data of abnormal user including user it With the user more than predetermined quantile；According to single user characteristic vector, targeted customer's set is classified based on clustering algorithm, Determine that hierarchic user's set includes：According to the single user characteristic vector of user in the targeted customer's set excluded after abnormal user, Targeted customer's set is classified based on clustering algorithm, determines that hierarchic user gathers；It is abnormal user choosing based on predetermined policy Hierarchic user's set is selected, and abnormal user is incorporated in hierarchic user's set.

Alternatively, also include：Characteristic vector index in single user characteristic vector is carried out into data normalization process；According to Single user characteristic vector, is classified based on clustering algorithm to targeted customer's set, determines that hierarchic user's set includes：According to mark Single user characteristic vector after quasi-ization process, is classified based on clustering algorithm to targeted customer's set, determines that hierarchic user collects Close.

By such method, first user can be classified, in a classification user clustering is carried out such that it is able to Select suitable targeted customer to carry out cluster analysis, the data volume of computing on the one hand more targetedly, can be reduced, on the other hand Interference of the inhomogeneous user data for Clustering Effect can be excluded, user group is divided more accurately, be easy to according to The result that family colony divides carries out accurate formula, personalized service.

According to another aspect of the present invention, a kind of data mining device is proposed, including：Data acquisition module, for obtaining The predefined action data at family are taken, predefined action data include the effectiveness data of predefined action and the generation time of predefined action； User's sort module, for according to the quantity for generating time and predefined action data of the predefined action data of each user to Family is classified, and determines that targeted customer gathers；Feature vector generation module, for according to predefined action data genaration targeted customer The single user characteristic vector of each user in set；User's diversity module, for according to single user characteristic vector, being calculated based on cluster Method is classified to targeted customer's set, determines that hierarchic user gathers.

Alternatively, user's diversity module includes：High density user's determining unit, for special according to the single user of each user Levy vector and determine high-density region user；Initial center determining unit, it is initial for being selected as from high-density region user The user of cluster centre, the quantity of initial cluster center is equal with predetermined classification quantity；Cluster cell, for according to initial clustering Center, determines that hierarchic user gathers based on K mean algorithms.

Alternatively, initial center determining unit is used for：Selected in high-density region user according to single user characteristic vector The maximum user of density parameter is used as the first initial cluster center；From the initial clustering of selected distance first in high-density region user The farthest user in center is used as the second initial cluster center；From the initial cluster center of selected distance first in high-density region user The user farthest with the distance of the second initial cluster center set is used as the 3rd initial cluster center；The like until determining complete Portion's initial cluster center.

Alternatively, also include：Abnormal user excludes module, abnormal for excluding the abnormal user during targeted customer gathers User includes the user of the effectiveness deduction data sum more than predetermined quantile of user；User's diversity module is used for：According to exclusion The single user characteristic vector of user, is carried out based on clustering algorithm to targeted customer's set in targeted customer's set after abnormal user Classification, determines that hierarchic user gathers；It is that abnormal user selects hierarchic user's set based on predetermined policy, and abnormal user is incorporated to In hierarchic user's set.

Alternatively, also include：Standardization module, for the characteristic vector index in single user characteristic vector to be carried out Data normalization process；User's diversity module is used for according to the single user characteristic vector after standardization, based on clustering algorithm Targeted customer's set is classified, determines that hierarchic user gathers.

Such device first can be classified user, carry out user clustering in a classification such that it is able to select Suitable targeted customer carries out cluster analysis, on the one hand more targetedly, can reduce the data volume of computing, on the other hand can Interference of the inhomogeneous user data for Clustering Effect is excluded, user group is divided more accurately, be easy to according to customer group Body division result carries out accurate formula, personalized service.

According to a further aspect of the invention, a kind of data digging system is proposed, including memory；And it is coupled to storage The processor of device, processor is configured to perform any one side as mentioned in the text based on the instruction for being stored in memory Method.

Such system first can be classified user, carry out user clustering in a classification such that it is able to select Suitable targeted customer carries out cluster analysis, on the one hand more targetedly, can reduce the data volume of computing, on the other hand can Interference of the inhomogeneous user data for Clustering Effect is excluded, user group is divided more accurately, be easy to according to customer group Body division result carries out accurate formula, personalized service.

Description of the drawings

Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, this Bright schematic description and description does not constitute inappropriate limitation of the present invention for explaining the present invention.In the accompanying drawings：

Fig. 1 is the flow chart of one embodiment of the data digging method of the present invention.

Fig. 2 is the flow chart of one embodiment of user clustering in data digging method of the invention.

Fig. 3 is the flow chart of another embodiment of the data digging method of the present invention.

Fig. 4 is the schematic diagram of one embodiment of the data mining device of the present invention.

Fig. 5 is the schematic diagram of one embodiment of user's diversity module in data mining device of the invention.

Fig. 6 is the schematic diagram of another embodiment of the data mining device of the present invention.

Fig. 7 is the schematic diagram of one embodiment of the data digging system of the present invention.

Fig. 8 is the schematic diagram of another embodiment of the data digging system of the present invention.

Specific embodiment

Below by drawings and Examples, technical scheme is described in further detail.

The flow chart of one embodiment of the data digging method of the present invention is as shown in Figure 1.

In a step 101, the predefined action data of user are obtained, predefined action data include the effectiveness data of predefined action With the generation time of predefined action.Same user can have a plurality of predefined action data, including the generation of the predefined action data Time and effectiveness data.In one embodiment, it is possible to obtain the predefined action data of multi-user.

In a step 102, according to the quantity for generating time and predefined action data of the predefined action data of each user User is classified, determines that targeted customer gathers.In one embodiment, can be according to the generation time of predefined action data Classified, it is also possible to which the generation quantity according to predefined action data is classified, or both are more careful with reference to carrying out Classification.One or more classification can as required be selected respectively as targeted customer's set.

In step 103, according to predefined action data genaration targeted customer set in each user single user feature to Amount.In one embodiment, can be according to the quantity of predefined action data, the effectiveness data of predefined action data, generation time Residing time interval etc. determines single user characteristic vector.

At step 104, according to single user characteristic vector, targeted customer's set is classified based on clustering algorithm, really Determine hierarchic user's set, wherein, the quantity of hierarchic user's set is equal with predetermined classification quantity.In one embodiment, can be with Initial cluster center is selected, the predetermined classification quantity identical initial center point of quantity of the initial cluster center of selection is equal using K Value-based algorithm carries out cluster operation.

In one embodiment, scheduled time threshold value can be set and predetermined quantity is classified to user.If predetermined The generation time of behavioral data earlier than scheduled time threshold value, and the quantity of predefined action data be more than predetermined quantity threshold value when, Determine that user is first kind user；If the generation time of predefined action data is earlier than scheduled time threshold value, and predefined action number According to quantity be not more than predetermined quantity threshold value when, determine user be Equations of The Second Kind user；If there is the generation of predefined action data Between be no earlier than scheduled time threshold value, and generate that it is late in scheduled time threshold value predefined action data quantity more than predetermined During amount threshold, determine that user is the 3rd class user；If it is late in scheduled time threshold to there is the generation of predefined action data Value, and generate that it is late when the quantity of the predefined action data of scheduled time threshold value is not more than predetermined quantity threshold value, it is determined that User is the 4th class user.

By such method, can according to the quantity for generating time and predefined action data of predefined action data to Family is classified, and the user for selecting the classification for needing gathers as targeted customer, or the user of each classification can be gathered Cluster operation is carried out respectively, user's classification of each classification is realized such that it is able to is realized user's classification of generic user, is carried The degree of accuracy of high user's classification.

In one embodiment, the user produced without predefined action data in longer period of time can be excluded, due to This kind of user long-time carries out having little significance for user behavior analysis and data mining without activity such that it is able to reduce Operand, it is also possible to reduce the impact to grading effect, and operation cost can be reduced during market demand.

In one embodiment, predefined action data also include predetermined condition mark and effectiveness deduction data.Effectiveness deducts Data can be the deduction effectiveness produced because predefined action meets predetermined condition, such as make effectiveness data than standard effectiveness number According to amount for reducing etc..In one embodiment, predefined action can be judged by the predetermined condition of predefined action data mark Whether conform to a predetermined condition, the predefined action data of the predefined action for conforming to a predetermined condition can be referred to as the first predefined action number According to.Single user characteristic vector can reflect ratio, the impact of generation shared by the predefined action for conforming to a predetermined condition, so as to pass through Data mining realized to user behavior feature, particularly to the analysis of the susceptibility of predetermined condition.In one embodiment, can be with The fisrt feature of user is determined according to the ratio of the quantity of the quantity and predefined action data of the first predefined action data of user To figureofmerit；In another embodiment, it may be determined that the effectiveness deduction data of each predefined action data of user and effectiveness The ratio of data, and ratio is taken into average, determine the second feature of user to figureofmerit；In yet another embodiment, can be with root Determine the third feature of user to figureofmerit with the ratio of effectiveness data sum according to the effectiveness deduction data sum of user；Another In individual embodiment, according to the effectiveness of user deduction data sum the fourth feature of user can be determined to figureofmerit；Can be with root Determine the fifth feature of user to figureofmerit according to the quantity of the first predefined action data of user；Furthermore it is also possible to according to user Presence the first predefined action data time period quantity it is true with the ratio of the time segment number begun to pass through from user-network access The sixth feature of user is determined to figureofmerit.

By multiple characteristic vector index constitutive characteristics vector, sensitivity of the user to predetermined condition can be accurately depicted Degree, so as in cluster calculation, can significantly be embodied user for the user of predetermined condition sensitivity difference is classified, just Targetedly apply in being carried out based on hierarchic user, user is carried out and is targetedly serviced.

The flow chart of one embodiment of user's classification is as shown in Figure 2 in the data digging method of the present invention.

In step 201, high-density region user is determined according to the single user characteristic vector of each user.In an enforcement In example, can centered on the single user characteristic vector point of user point, it is determined that special including the other users single user of predetermined quantity The radius in the region of vector point is levied, if radius is less than predetermined threshold, then it is assumed that user is high-density region user.In an enforcement In example, can centered on the single user characteristic vector point of user point, determine the alone of other users in the region of predetermined radii The quantity of family characteristic vector point, if the quantity reaches predetermined quantity, then it is assumed that user is high-density region user.

In step 202., the user of initial cluster center, initial cluster center are selected as from high-density region user Quantity with it is predetermined classification quantity it is equal.For example, the user during if desired targeted customer is gathered is divided into Pyatyi by cluster, then Need to choose 5 initial cluster centers in high-density region.

In step 203, according to initial cluster center, determine that hierarchic user gathers based on K mean algorithms.

Generally, highdensity data area can be separated by low-density data area, and these are located at density regions Data point be generally known as isolated point.At present existing clustering algorithm is mostly randomly to choose initial cluster center, and this is neglected Depending on the distribution situation of data, because the selection of initial cluster center in K mean algorithms can produce impact on result, therefore at random Choose initial cluster center can greatly affect final Clustering Effect.By the method in the embodiment of the present invention, Neng Goubao Card initial cluster center is high-density region user, it is to avoid cause user to be classified using some Standalone customers as initial cluster center It is inaccurate.

In one embodiment, the single user characteristic vector that can be based on user carries out computing, in high-density region user It is middle to select the maximum data point of density parameter as the first initial cluster center, and by the first initial cluster center from high density area Delete in the user of domain；From the initial cluster center of selected distance first in high-density region user, farthest user is initial as second Cluster centre, and the second initial cluster center is deleted from high-density region user；Choose from high-density region user away from Farthest user is used as the 3rd initial cluster center with a distance from the first initial cluster center and the second initial cluster center set, And delete the 3rd initial cluster center from high-density region user；The like until determining whole initial cluster centers.

By such method, the farthest user of mutual distance can be selected in high-density region user as initial poly- Class center, on the one hand can exclude selection Standalone customers cluster result is impacted as initial cluster center, on the other hand Due to the farthest initial cluster center point of mutual distance it is more more representative than what is randomly selected, by the method obtain just Beginning cluster centre is also more representative, can optimize Clustering Effect, obtains more representational user's classification results.

In one embodiment, the distance between 2 points can be calculated using Euclidean distance, implement formula such as Under：

Wherein, x, y be two point identifications, (x₁, x₂……x_n) for x characteristic vector, x₁、x₂……x_nFor the characteristic vector of x Index；(y₁, y₂……y_n) for y characteristic vector, y₁、y₂……y_nFor the characteristic vector index of y, n is characterized the index of vector Quantity.

The distance between one data point x and data point set z for the data point with all data points in data set most Near distance, computing formula is as follows：

Dist (x, z)=min (dist (x, y)), y ∈ z

Wherein, y is each point in z.

The distance between two data point sets x, y for it is nearest be located at respectively two data points that two data points concentrate it Between distance, computing formula is as follows：

Dist (x, y)=min (dist (u, v)), u ∈ x, v ∈ y

Wherein, u is each point in x, and v is each point in y.

By such method, the density parameter of each data point can be calculated, then according to distance between data point Calculating, between data point and set distance calculating, and the calculating of relation determines initial cluster center between set.

In k mean algorithms, calculate Euclidean distance of each data point apart from k initial cluster center, by data point and The initial cluster center point closest with it is classified as a cluster, in now judging whether that reaching the condition for stopping cluster clustering The heart no longer changes, and exits if stop condition is met, and otherwise updates the cluster centre point of each cluster, takes in each cluster and owns Used as new cluster centre, circulation performs above-mentioned calculating process to the average of point, until cluster centre no longer changes.By this The method of sample, can complete cluster operation, obtain hierarchic user's set.

In one embodiment, often occur that some much deviate the pole of normal level in the characteristic index of different user These extremums are generally referred to as exceptional value by the big value in end and extreme small.In order to not make these exceptional values affect follow-up cluster Effect, can be identified before cluster to exceptional value.In one embodiment, can by the effectiveness of user deduction data it With the user more than predetermined quantile as abnormal user, abnormal user is deleted from the targeted customer's set for being used to cluster computing Remove.In the single user characteristic vector according to user in the targeted customer's set after excluding abnormal user, based on clustering algorithm to mesh Mark user's set is classified, and can be that abnormal user selects similar hierarchic user's set after determining hierarchic user's set, And abnormal user is incorporated in hierarchic user's set, such as the effectiveness deduction data sum of user is more than into the user of predetermined quantile In being incorporated to hierarchic user's set extremely sensitive to predetermined condition；The user that effectiveness deduction data are 0 is incorporated to predetermined condition In extremely insensitive hierarchic user's set.In one embodiment, it is possible to use the second feature vector being mentioned above refers to Mark a carries out the classification of abnormal user, and as shown in table 1, the second feature of user i is a to figureofmerit_i：

a_iThe standard deviation of the average+a of >=a	To predetermined condition extreme sensitivity
		Average≤a of a_iThe standard deviation of the average+a of ＜ a	It is extremely sensitive to predetermined condition
Standard deviation≤a of the average-a of a_iThe average of ＜ a	To predetermined condition medium sensitivity
		a_iThe standard deviation of the average-a of ＜ a	To predetermined condition slight sensitive
a_i=0	It is insensitive to predetermined condition

The abnormal user of table 1 is sorted out

By such method, the impact that abnormal user is caused to cluster calculation on the one hand can be excluded；On the other hand In the range of can abnormal user be accounted for, rather than simply rejected, so as to improve covering for user's classification results Lid scope, it is to avoid the leakage to certain customers is analyzed.

In one embodiment, carry out needing to be standardized characteristic vector achievement data before clustering algorithm, to disappear Except the impact that different dimensions are brought to cluster result, for example some characteristic vector indexs are percentage, some characteristic vector indexs It is quantity, some characteristic vector indexs are effectiveness, cannot be directly compared between these indexs, it is therefore desirable to changed into comparable Compared with, eliminate dimension impact standardized feature vector achievement data.In one embodiment, can be standardized using standard deviation Method is standardized to data, and standard deviation standardization is referred to and for characteristic vector achievement data to deduct this feature vector achievement data Average, then divided by its standard deviation.Average is to weigh the intensity of data distribution, and computing formula is：

Average

Standard deviation is to weigh the dispersion degree of data, and computing formula is：

Standard deviation

According to standard deviation standardized calculation formula：

Characteristic vector achievement data after being standardized, wherein, X₁…X_i…X_nVectorial achievement data is characterized, i is 1 To the natural number between n, n is the quantity of user in the targeted customer's set for participate in cluster；X_scaleiIt is by X_iSpy after standardization Levy vectorial achievement data.

By such method, cluster calculation will can be again carried out after characteristic vector achievement data standardization, so as to The impact that different dimensions are produced to Clustering Effect is eliminated, the accuracy and reliability of user's classification is improved.

The flow chart of another embodiment of the data digging method of the present invention is as shown in Figure 3.

In step 301, the predefined action data of user are obtained, predefined action data include the effectiveness data of predefined action With the generation time of predefined action.Same user can have a plurality of predefined action data, including the generation of the predefined action data Time and effectiveness data.In one embodiment, it is possible to obtain the predefined action data of multi-user.

In step 302, according to the quantity for generating time and predefined action data of the predefined action data of each user User is classified, determines that targeted customer gathers.In one embodiment, can be according to the generation time of predefined action data Classified, it is also possible to which the generation quantity according to predefined action data is classified, or both are more careful with reference to carrying out Classification.One or more classification can as required be selected respectively as targeted customer's set.

In step 303, according to predefined action data genaration targeted customer set in each user single user feature to Amount.In one embodiment, can be according to the quantity of predefined action data, the effectiveness data of predefined action data, generation time Residing time interval etc. determines single user characteristic vector.

In step 304, the user more than predetermined quantile is used as abnormal user for the data sum that effectiveness deducted, will be abnormal User is from for deletion in the targeted customer's set for clustering computing.

In step 305, characteristic vector achievement data is standardized, cluster result is brought with eliminates different dimensions Impact.

Within step 306, according to the single user characteristic vector after standardization, based on clustering algorithm to suppressing exception user after Targeted customer set be classified, determine hierarchic user gather, wherein, hierarchic user set quantity and predetermined classification quantity It is equal.In one embodiment, the quantity identical initial cluster center with predetermined classification quantity can be selected, using K averages Algorithm carries out cluster operation.In one embodiment, can also be that abnormal user selects similar hierarchic user's set, and will be different Conventional family is incorporated in hierarchic user's set.

By such method, first user can be classified, in a classification user clustering is carried out, be excluded different The user data of class makes user group divide more accurately for the interference of Clustering Effect, is easy to what is divided according to user group As a result accurate formula, personalized service are carried out；Ensure that initial cluster center is high-density region user, it is to avoid some are lonely Vertical point causes the inaccurate of user's classification as initial cluster center；The shadow that abnormal user is caused to cluster calculation can excluded While sound, by abnormal user account in the range of ensure that the coverage of user's classification results；Eliminate different dimensions pair The impact that Clustering Effect is produced, improves the accuracy and reliability of user's classification.

In one embodiment, final cluster centre can be gathered according to hierarchic user and determines different hierarchic user's collection Close the susceptibility to predetermined condition.In one embodiment, the cluster centre that can gather several hierarchic user is respectively each Sue for peace in individual characteristic vector index dimension, the size sequence after summation according to value is worth maximum cluster centre to tackling predetermined condition Extreme sensitivity, by that analogy, is worth minimum cluster centre insensitive to tackling predetermined condition.By such method, can be right Hierarchic user's set gives the meaning of reality, makes user have the set of different hierarchic user and intuitively experiences, right so as to realize Hierarchic user's set is targetedly applied, serviced.

In e-commerce field, can be clustered according to the various actions feature of user, purchase user group is divided If into Ganlei, so also allowing for market analysis and operation personnel clearly understanding the feature of customer base, to carry out accurate formula, individual The marketing of property.Promotion susceptibility is the index for weighing user to the sensitivity of all kinds of promotional offers.Some users are closed very much The commodity of note promotional offer great efforts, Jing often muptiple-use purchase, or when system of users provides reward voucher, user is just Buying behavior can be produced using reward voucher, show that such user is more sensitive to promoting；And some users not because of commodity whether Participate in promotion and bought, and the granting to reward voucher is also lost interest in, and shows that such user is to promotional offer and unwise Sense.User can be divided into by different colonies based on such behavioural characteristic, this facilitates implementation the precision marketing for user And personalized recommendation such that it is able to leader user purchase again, lifts turnover.

All users in prior art in meeting selecting system database, calculate preferential amount of money accounting and preferential order volume is accounted for Than the two indexs, the method using initial cluster center is randomly selected, user is divided into extremely sensitive, light to promoting to promoting Spend sensitivity and to promoting insensitive three class.

In one embodiment of the invention, can be selected in customer group, for example, there are Shopping Behaviors within nearly 3 years User as the target group of promotion susceptibility identification, on the one hand meet user coverage rate, on the other hand, identification does not have for nearly 3 years There is the promotion susceptibility of the user for carrying out doing shopping nonsensical, multiple purchase is carried out it is difficult to it can be rebooted by marketing, this Marketing resource can be wasted.Then, then to nearly user for there are Shopping Behaviors in 3 years it is finely divided, can be purchased according to user's last time Buy the time and this certain customers is divided into four big class by the shopping frequency the two indexs：Nearly one is only the user for buying once； There is within nearly 1 year the user for purchasing behavior again；Last time buying behavior occurred before 1 year and only bought once the year before User；There is before 1 year and had the year before the user of purchase behavior again in last time buying behavior.Then according to reality Application scenarios are respectively by this four big class subscriber segmentation into 5 classes：It is extreme sensitivity, extremely sensitive, medium sensitivity, slight sensitive, unwise Sense.In one embodiment, the user that can choose a big class is finely divided, it is also possible to which the user of each big class is entered respectively Row subdivision.It is to be easy to service application side to carry out more accurate, fine, personalization so by the purpose that user carries out fine division Operation, with it is maximized meet marketing demand.

In one embodiment, the promotion sensitive kind of user can be entered using more abundant characteristic vector index Row is distinguished, as shown in table 2.

The characteristic vector index that the user of table 2 promotion sensitivity type is chosen

In some cases, for example, the preferential amount of money that some users were only bought in 1 time, and this list accounts for original cost 80%, but only 10 yuan of original cost；And it is repeatedly and every time preferential order that other users bought, and total preferential amount of money is accounted for The 50% of original cost, but original cost is up to 100,000 yuan, the preferential order accounting of now simple dependence and preferential amount of money accounting are judging user Promotion sensitive kind be inaccurate.Method in embodiments of the invention can adopt more abundant index to weigh user Promotion susceptibility, more rationally and accurately.

In one embodiment, can be to choose exceptional value according to total preferential amount of money, such as by the data of each feature of analysis Distribution finds that total preferential amount of money occurs some extreme larges, can be by the preferential amount of money more than the quantile of the preferential amount of money 0.995 User is classified as abnormal user, and this certain customers is not involved in cluster, but after cluster terminates, can be according to average per single preferential amount of money Accounting is sorted out, it is determined which hierarchic user's set belonged to.As shown in table 3：

User i is average per single preferential amount of money a_iThe standard deviation of the average+a of >=a	Extreme sensitivity
		Average≤a of a_iThe standard deviation of the average+a of ＜ a	It is extremely sensitive
Standard deviation≤a of the average-a of a_iThe average of ＜ a	Medium sensitivity
		a_iThe standard deviation of the average-a of ＜ a	Slight sensitive
a_i=0	It is insensitive

Abnormal user is sorted out and is judged in the promotional offer susceptibility of table 3 cluster

Wherein, a is that single user is average per single preferential amount of money.

Original implementation is not processed exceptional value, and exceptional value can greatly affect Clustering Effect, and this will Cause the result badly for clustering.By the method in the embodiment of the present invention, can be with reference to specific service application scene to peeling off Point is identified, and identifies and simply do not rejected after outlier, but promotion sensitive kinds have been also carried out to outlier The classification of type, which enhances the user coverage rate of model.

The schematic diagram of one embodiment of the data mining device of the present invention is as shown in Figure 4.Wherein, data acquisition module The 401 predefined action data that can obtain user, the effectiveness data of predefined action data including predefined action and predefined action The generation time.Same user can have a plurality of predefined action data, including generation time and the effectiveness number of the predefined action data According to.In one embodiment, it is possible to obtain the predefined action data of multi-user.User's sort module 402 can be according to each use The generation time of the predefined action data at family and the quantity of predefined action data are classified to user, determine that targeted customer collects Close.In one embodiment, can be classified according to the generation time of predefined action data, it is also possible to according to predefined action number According to generation quantity classified, or by both combine carry out more careful classification.Can select as required one or Multiple classification are respectively as targeted customer's set.Feature vector generation module 403 can be according to predefined action data genaration target The single user characteristic vector of each user in user's set.In one embodiment, can according to the quantity of predefined action data, Time interval residing for the effectiveness data of predefined action data, generation time etc. determines single user characteristic vector.User is classified mould Block 404 can be classified based on clustering algorithm according to single user characteristic vector to targeted customer's set, determine that hierarchic user collects Close, wherein, the quantity of hierarchic user's set is equal with predetermined classification quantity.In one embodiment, initial clustering can be selected Center, the predetermined classification quantity identical initial center point of quantity of the initial cluster center of selection, is gathered using K mean algorithms Generic operation.

Such device first can be classified user, carry out user clustering in a classification such that it is able to select Suitable targeted customer carries out cluster analysis, on the one hand more targetedly, can reduce the data volume of computing, on the other hand can Interference of the inhomogeneous user data for Clustering Effect is excluded, user group is divided more accurately, be easy to according to customer group The result that body is divided carries out accurate formula, personalized service.

Such device can enter according to the quantity of the generation time of predefined action data and predefined action data to user Row classification, the user for selecting the classification for needing gathers as targeted customer, or can be to the user of each classification set difference Cluster operation is carried out, user's classification of each classification is realized such that it is able to is realized user's classification of generic user, is improved and use The degree of accuracy of family classification.

In one embodiment, user's sort module 402 can be excluded in longer period of time does not have predefined action data The user of generation, because this kind of user long-time is without activity, therefore carries out the meaning of user behavior analysis and data mining not Greatly such that it is able to reduce operand, it is also possible to reduce the impact to grading effect, and fortune can be reduced during market demand Battalion's cost.

By with multiple characteristic vector index constitutive characteristics vector, can accurately depict user sensitive to predetermined condition The characteristics of spending, so as in cluster calculation, can significantly be embodied user of the user for predetermined condition sensitivity difference Classification, is easy to be carried out based on hierarchic user and is targetedly applied, and user is carried out and is targetedly serviced.

The schematic diagram of one embodiment of user's diversity module is as shown in Figure 5 in the data mining device of the present invention.Wherein, High density user determining unit 501 can determine high-density region user according to the single user characteristic vector of each user.One In individual embodiment, can centered on the single user characteristic vector point of user point, it is determined that including the other users list of predetermined quantity The radius in the region of user characteristics vector point, if radius is less than predetermined threshold, then it is assumed that user is high-density region user.One In individual embodiment, can centered on the single user characteristic vector point of user point, determine other users in the region of predetermined radii Single user characteristic vector point quantity, if the quantity reaches predetermined quantity, then it is assumed that user be high-density region user.Initially Center determining unit 502 can be selected as the user of initial cluster center, initial cluster center from high-density region user Quantity with it is predetermined classification quantity it is equal.For example, the user during if desired targeted customer is gathered is divided into Pyatyi by cluster, then Need to choose 5 initial cluster centers in high-density region.Cluster cell 503 can be equal based on K according to initial cluster center Value-based algorithm determines that hierarchic user gathers.

Such device ensure that initial cluster center be high-density region user, it is to avoid using some Standalone customers as Initial cluster center causes the inaccurate of user's classification.

In one embodiment, initial center determining unit 502 can be transported based on the single user characteristic vector of user Calculate, the maximum data point of density parameter is selected in high-density region user as the first initial cluster center, and by the beginning of first Beginning cluster centre is deleted from high-density region user；From the initial cluster center of selected distance first in high-density region user most Remote user deletes the second initial cluster center from high-density region user as the second initial cluster center；From height The farthest user of the initial cluster center of selected distance first and the distance of the second initial cluster center set in density area user As the 3rd initial cluster center, and the 3rd initial cluster center is deleted from high-density region user；The like until It is determined that whole initial cluster centers.

Such device can select the farthest user of mutual distance as in initial clustering in high-density region user The heart, on the one hand can exclude selection Standalone customers cluster result is impacted as initial cluster center, on the other hand due to The farthest initial cluster center point of mutual distance is more more representative than what is randomly selected, is initially gathered by what the method was obtained Class center is also more representative, can optimize Clustering Effect, obtains more representational user's classification results.

In one embodiment, often occur that some much deviate the pole of normal level in the characteristic index of different user These extremums are generally referred to as exceptional value by the big value in end and extreme small.In order to not make these exceptional values affect follow-up cluster Effect, can be identified before cluster to exceptional value.In one embodiment, can by the effectiveness of user deduction data it With the user more than predetermined quantile as abnormal user, abnormal user is deleted from the targeted customer's set for being used to cluster computing Remove.In the single user characteristic vector according to user in the targeted customer's set after excluding abnormal user, based on clustering algorithm to mesh Mark user's set is classified, and can be that abnormal user selects similar hierarchic user's set after determining hierarchic user's set, And abnormal user is incorporated in hierarchic user's set, such as the effectiveness deduction data sum of user is more than into the user of predetermined quantile In being incorporated to hierarchic user's set extremely sensitive to predetermined condition；The user that effectiveness deduction data are 0 is incorporated to predetermined condition In extremely insensitive hierarchic user's set.In one embodiment, can be according to the above-mentioned second feature of user to figureofmerit Value determines that the classification that abnormal user belongs to is used with the average of second feature index, the magnitude relationship of standard deviation in targeted customer's set Gather at family.

On the one hand such device can exclude the impact that abnormal user is caused to cluster calculation；On the other hand also can be by In the range of abnormal user is accounted for, rather than simply rejected, so as to improve the coverage of user's classification results, The leakage to certain customers is avoided to analyze.

In one embodiment, carry out needing to be standardized characteristic vector achievement data before clustering algorithm, to disappear Except the impact that different dimensions are brought to cluster result, for example some characteristic vector indexs are percentage, some characteristic vector indexs It is quantity, some characteristic vector indexs are effectiveness, cannot be directly compared between these indexs, it is therefore desirable to changed into comparable Compared with, eliminate dimension impact standardized feature vector achievement data.In one embodiment, standardization mould can be included Block, for being standardized to data.In one embodiment, standardization module can adopt the standardized side of standard deviation Method carries out data normalization process.Standard deviation standardization is referred to and for characteristic vector achievement data to deduct this feature vector achievement data Average, then divided by its standard deviation.Average is to weigh the intensity of data distribution, and computing formula is：

Average

Standard deviation

According to standard deviation standardized calculation formula：

Characteristic vector achievement data after being standardized, wherein, X₁…X_i…X_nVectorial achievement data is characterized, i is certainly So count, n is the quantity of user in the targeted customer's set for participate in cluster；X_scaleiIt is by X_iCharacteristic vector index after standardization Data.

Such device will can again carry out cluster calculation after characteristic vector achievement data standardization, so as to eliminate not With the impact that dimension is produced to Clustering Effect, the accuracy and reliability of user's classification are improved.

The schematic diagram of another embodiment of the data mining device of the present invention is as shown in Figure 6.Wherein, data acquisition module 601st, the 26S Proteasome Structure and Function of user's sort module 602 and feature vector generation module 603 is similar to the embodiment of Fig. 4.Data Excavating gear also includes that abnormal user excludes module 605 and standardization module 606.Abnormal user excludes module 605 can Data sum that effectiveness is deducted more than predetermined quantile user as abnormal user, by abnormal user from for clustering computing Delete in targeted customer's set.Standardization module 606 can be standardized to characteristic vector achievement data, to eliminate not With the impact that dimension is brought to cluster result.User's diversity module 604 can be according to the single user characteristic vector after standardization, base The targeted customer's set after suppressing exception user is classified in clustering algorithm, determines that hierarchic user gathers, additionally it is possible to for different Conventional family selects similar hierarchic user's set, and abnormal user is incorporated in hierarchic user's set.

Such device first can be classified user, and in a classification user clustering is carried out, and be excluded inhomogeneous User data makes user group divide more accurately for the interference of Clustering Effect, is easy to the result divided according to user group Carry out accurate formula, personalized service；Ensure that initial cluster center is high-density region user, it is to avoid by some isolated points The inaccurate of user's classification is caused as initial cluster center；The impact that abnormal user is caused to cluster calculation can excluded Meanwhile, by abnormal user account in the range of ensure that the coverage of user's classification results；Different dimensions are eliminated to cluster The impact that effect is produced, improves the accuracy and reliability of user's classification.

In one embodiment, user's diversity module 604 can gather final cluster centre and determine according to hierarchic user Gather the susceptibility to predetermined condition in different hierarchic user.In one embodiment, several hierarchic user can be gathered Cluster centre is sued for peace respectively in each characteristic vector index dimension, the size sequence after summation according to value, in being worth maximum cluster The heart by that analogy, is worth minimum cluster centre insensitive to tackling predetermined condition to tackling predetermined condition extreme sensitivity.It is such Device, can gather the meaning for giving reality to hierarchic user, make user have the set of different hierarchic user and intuitively experience, So as to realize that hierarchic user's set is targetedly applied, serviced.

In one embodiment, in order to use for each application scenarios, hierarchic user's collective data can be processed into specification The tables of data of change, in being stored in file system, can be directly invoked by Database Systems, or in the way of application programming interfaces Service application is pushed to, the application that is directed to is carried out for user behavior feature to facilitate.

The schematic diagram of one embodiment of the data digging system of the present invention is as shown in Figure 7.The data digging system includes Memory 701 and processor 702.Wherein：

Memory 701 can be disk, flash memory or other any non-volatile memory mediums.The finger of accumulator system operation Order.

Processor 702 is coupled to memory 701, can implement as one or more integrated circuits, such as microprocessor Device or microcontroller.The processor 702 is used to perform the instruction of storage in memory, and then realizes that acquisition efficiently, accurately divides The purpose of level user's set.

The schematic diagram of another embodiment of the data digging system of the present invention is as shown in Figure 8.

Data mining device 800 includes memory 810 and processor 820.Processor 820 can include processor 820a, 820b…820n.Processor 820a-820n is coupled to memory 810 by BUS buses 830.Data based on distributed formula are dug Pick system, can carry out rapid computations, improve the operational efficiency of data mining.The data digging system 800 can also pass through The externally connected storage device 850 of memory interface 840 can also be connected to call external data by network interface 860 Network or an other computer system (not shown).No longer describe in detail herein.

In this embodiment, instructed by memory stores data, then above-mentioned instruction is processed by processor, and then realized Efficiently, accurate user's classification, is easy to provide corresponding service according to user behavior feature.

Finally it should be noted that：Above example is only to illustrate technical scheme rather than a limitation；To the greatest extent Pipe has been described in detail with reference to preferred embodiment to the present invention, and those of ordinary skill in the art should be understood：Still The specific embodiment of the present invention can be modified or equivalent is carried out to some technical characteristics；Without deviating from this The spirit of bright technical scheme, it all should cover in the middle of the technical scheme scope being claimed in the present invention.

Claims

1. a kind of data digging method, it is characterised in that include：

Obtain the predefined action data of user, effectiveness data of predefined action data including the predefined action and described pre- Determine the generation time of behavior；

According to the generation time of the predefined action data of each user and the quantity of the predefined action data to the use Family is classified, and determines that targeted customer gathers；

The single user characteristic vector of each user in targeted customer's set according to the predefined action data genaration；

According to the single user characteristic vector, targeted customer set is classified based on clustering algorithm, it is determined that classification is used Gather at family.

2. method according to claim 1, it is characterised in that

The predefined action data also include predetermined condition mark and effectiveness deduction data, are identified according to the predetermined condition Recognize the first predefined action data；

The single user characteristic vector include first eigenvector index, second feature to figureofmerit third feature to figureofmerit, Fourth feature is to figureofmerit, fifth feature to figureofmerit and/or sixth feature to figureofmerit；

The single user characteristic vector bag of each user in the set of the targeted customer according to the predefined action data genaration Include：

It is true with the ratio of the quantity of the predefined action data according to the quantity of the first predefined action data of the user The first eigenvector index of the fixed user；

Determine the user each predefined action data the effectiveness deduction data and the effectiveness data ratio, And the ratio is taken into average, determine the second feature of the user to figureofmerit；

Determine the user's with the ratio of the effectiveness data sum according to the effectiveness of user deduction data sum The third feature is to figureofmerit；

Determine the fourth feature of the user to figureofmerit according to the effectiveness of user deduction data sum；

Determine that the fifth feature vector of the user refers to according to the quantity of the first predefined action data of the user Mark；And/or,

According to the presence of the user quantity of the time period of the first predefined action data with begin to pass through from user-network access The ratio of time segment number determine the sixth feature of the user to figureofmerit.

3. method according to claim 1, it is characterised in that described according to the single user characteristic vector, based on cluster Algorithm is classified to targeted customer set, determines that hierarchic user's set includes：

High-density region user is determined according to the single user characteristic vector of each user；

Be selected as the user of initial cluster center from the high-density region user, the quantity of the initial cluster center with The predetermined classification quantity is equal；

According to the initial cluster center, hierarchic user's set is determined based on K mean algorithms.

4. method according to claim 3, it is characterised in that described to select initial poly- in the high-density region user Class center includes：

The maximum user of density parameter is selected as the in the high-density region user according to the single user characteristic vector One initial cluster center；

From the first initial cluster center described in selected distance in the high-density region user, farthest user is initial as second Cluster centre；

From the first initial cluster center described in selected distance in the high-density region user and second initial cluster center The farthest user of the distance of set is used as the 3rd initial cluster center；

The like until determining all initial cluster centers.

5. method according to claim 2, it is characterised in that also include：

Exclude the abnormal user in targeted customer set, the abnormal user include user effectiveness deduction data it With the user more than predetermined quantile；

It is described according to the single user characteristic vector targeted customer set to be classified based on clustering algorithm, it is determined that point Level user's set includes：

According to the single user characteristic vector of user in the targeted customer set excluded after abnormal user, calculated based on cluster Method is classified to targeted customer set, determines that hierarchic user gathers；

It is that the abnormal user selects hierarchic user's set based on predetermined policy, and the abnormal user is incorporated to into the classification use In the set of family.

6. method according to claim 1, it is characterised in that also include：By the feature in the single user characteristic vector Data normalization process is carried out to figureofmerit；

The single user characteristic vector after according to standardization, is carried out point based on clustering algorithm to targeted customer set Level, determines that hierarchic user gathers.

7. a kind of data mining device, it is characterised in that include：

Data acquisition module, for obtaining the predefined action data of user, the predefined action data include the predefined action Effectiveness data and the predefined action the generation time；

User's sort module, for according to the generation time of the predefined action data of each user and the predefined action number According to quantity the user is classified, determine targeted customer gather；

Feature vector generation module, for each user in targeted customer's set according to the predefined action data genaration Single user characteristic vector；

User's diversity module, for according to the single user characteristic vector, based on clustering algorithm the targeted customer is gathered into Row classification, determines that hierarchic user gathers.

8. device according to claim 7, it is characterised in that

9. device according to claim 7, it is characterised in that user's diversity module includes：

High density user's determining unit, for determining that high-density region is used according to the single user characteristic vector of each user Family；

Initial center determining unit, it is described for being selected as the user of initial cluster center in the high-density region user The quantity of initial cluster center is equal with the predetermined classification quantity；

Cluster cell, for according to the initial cluster center, based on K mean algorithms hierarchic user's set being determined.

10. device according to claim 9, it is characterised in that the initial center determining unit is used for：

The maximum user of density parameter is selected in the high-density region user as the first initial cluster center；

The like until determining all initial cluster centers.

11. devices according to claim 8, it is characterised in that also include：

Abnormal user excludes module, and for excluding the abnormal user during the targeted customer gathers, the abnormal user includes using User of the effectiveness deduction data sum at family more than predetermined quantile；

User's diversity module is used for：

12. devices according to claim 7, it is characterised in that also include：

Standardization module, for the characteristic vector index in the single user characteristic vector to be carried out at data normalization Reason；

User's diversity module be used for according to standardization after the single user characteristic vector, based on clustering algorithm to institute State targeted customer's set to be classified, determine that hierarchic user gathers.

A kind of 13. data digging systems, it is characterised in that：

Including memory；And

The processor of the memory is coupled to, the processor is configured to be performed based on the instruction for being stored in the memory Method as described in any one of claim 1 to 6.