CN106649517A - Data mining method, device and system - Google Patents
Data mining method, device and system Download PDFInfo
- Publication number
- CN106649517A CN106649517A CN201610901862.0A CN201610901862A CN106649517A CN 106649517 A CN106649517 A CN 106649517A CN 201610901862 A CN201610901862 A CN 201610901862A CN 106649517 A CN106649517 A CN 106649517A
- Authority
- CN
- China
- Prior art keywords
- user
- data
- predefined action
- characteristic vector
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000007418 data mining Methods 0.000 title claims abstract description 17
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 37
- 230000009471 action Effects 0.000 claims description 143
- 230000002159 abnormal effect Effects 0.000 claims description 53
- 230000006399 behavior Effects 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 abstract description 23
- 238000004458 analytical method Methods 0.000 abstract description 7
- 230000035945 sensitivity Effects 0.000 description 14
- 230000003542 behavioural effect Effects 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 10
- 230000001737 promoting effect Effects 0.000 description 8
- 238000007621 cluster analysis Methods 0.000 description 6
- 239000006185 dispersion Substances 0.000 description 2
- 108010022579 ATP dependent 26S protease Proteins 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000007306 turnover Effects 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a data mining method, device and system, and relates to the field of big data. The data mining method provided by the invention comprises the following steps: obtaining predetermined behavior data of users; classifying the users according to the generation time of the predetermined behavior data of each user and the number of the predetermined behavior data to determine a target user set; generating a single user feature vector of each user in the target user set according to the predetermined behavior data; and grading the target user set based on a clustering algorithm according to the single user feature vector to determine a grading user set. By adoption of such method, the users can be classified at first, user clustering is carried out in one category, so that appropriate target users can be selected to carry out clustering analysis, on one hand, better pertinence is guaranteed, and the operation data size is reduced, on the other hand, the interference of user data of different types to the clustering effect can be eliminated, and thus the user group division is more accurate.
Description
Technical field
The present invention relates to big data field, particularly a kind of data digging method, apparatus and system.
Background technology
In big data application, often user group can be divided into according to the various actions feature of user some
Class, the feature in order to be directed to customer group carries out accurate formula, personalized service.Cluster is that user group is carried out to divide a kind of
Mode.Cluster is, by the classified process of data object, to make the object in same class have very high similarity, and makes difference
Object height in class is different.Distinctiveness ratio is usually used distance to be measured.Cluster analysis is widely used to various fields,
Such as market survey, data analysis, pattern-recognition etc..
But, effect user group divided for user behavior feature in cluster operation to a great extent according to
The quality in basic data, the existing user group based on clustering algorithm is relied to divide and tend not to enough reflect user's well
Behavioural characteristic, has that cluster is inaccurate, it is difficult to accurate formula, personalized clothes are carried out to customer group using cluster result
Business.
The content of the invention
It is an object of the present invention to improve the degree of accuracy of user group's division.
According to an aspect of the present invention, a kind of data digging method is proposed, including:Obtain the predefined action number of user
According to predefined action data include the effectiveness data of predefined action and the generation time of predefined action;According to the predetermined of each user
The generation time of behavioral data and the quantity of predefined action data are classified to user, determine that targeted customer gathers;According to pre-
Determine the single user characteristic vector that behavioral data generates each user in targeted customer's set;According to single user characteristic vector, it is based on
Clustering algorithm is classified to targeted customer's set, determines that hierarchic user gathers.
Alternatively, predefined action data also include predetermined condition mark and effectiveness deduction data, are identified according to predetermined condition
Recognize the first predefined action data;Single user characteristic vector includes first eigenvector index, second feature to figureofmerit the 3rd
Characteristic vector index, fourth feature are to figureofmerit, fifth feature to figureofmerit and/or sixth feature to figureofmerit;According to predetermined
Behavioral data generates the single user characteristic vector of each user in targeted customer's set to be included:According to first predefined action of user
The quantity of data determines the first eigenvector index of user with the ratio of the quantity of predefined action data;Determine user each
The effectiveness deduction data of predefined action data and the ratio of effectiveness data, and ratio is taken into average, determine the second feature of user
To figureofmerit;The third feature vector of user is determined according to the effectiveness of user deduction data sum and the ratio of effectiveness data sum
Index;Determine the fourth feature of user to figureofmerit according to the effectiveness of user deduction data sum;It is predetermined according to the first of user
The quantity of behavioral data determines the fifth feature of user to figureofmerit;And/or, according to presence the first predefined action data of user
The quantity of time period determine that the sixth feature vector of user refers to the ratio of the time segment number begun to pass through from user-network access
Mark.
Alternatively, according to single user characteristic vector, targeted customer's set is classified based on clustering algorithm, it is determined that classification
User's set includes:High-density region user is determined according to the single user characteristic vector of each user;From high-density region user
In be selected as the user of initial cluster center, the quantity of initial cluster center is equal with predetermined classification quantity;According to initial poly-
Class center, determines that hierarchic user gathers based on K mean algorithms.
Alternatively, initial cluster center is selected to include in high-density region user:According to single user characteristic vector in height
The maximum user of density parameter is selected in density area user as the first initial cluster center;Select from high-density region user
The farthest user of the initial cluster center of distance first is taken as the second initial cluster center;Choose from high-density region user away from
Farthest user is used as the 3rd initial cluster center with a distance from the first initial cluster center and the second initial cluster center set;
The like until determining whole initial cluster centers.
Alternatively, exclude the abnormal user in targeted customer's set, the effectiveness deduction data of abnormal user including user it
With the user more than predetermined quantile;According to single user characteristic vector, targeted customer's set is classified based on clustering algorithm,
Determine that hierarchic user's set includes:According to the single user characteristic vector of user in the targeted customer's set excluded after abnormal user,
Targeted customer's set is classified based on clustering algorithm, determines that hierarchic user gathers;It is abnormal user choosing based on predetermined policy
Hierarchic user's set is selected, and abnormal user is incorporated in hierarchic user's set.
Alternatively, also include:Characteristic vector index in single user characteristic vector is carried out into data normalization process;According to
Single user characteristic vector, is classified based on clustering algorithm to targeted customer's set, determines that hierarchic user's set includes:According to mark
Single user characteristic vector after quasi-ization process, is classified based on clustering algorithm to targeted customer's set, determines that hierarchic user collects
Close.
By such method, first user can be classified, in a classification user clustering is carried out such that it is able to
Select suitable targeted customer to carry out cluster analysis, the data volume of computing on the one hand more targetedly, can be reduced, on the other hand
Interference of the inhomogeneous user data for Clustering Effect can be excluded, user group is divided more accurately, be easy to according to
The result that family colony divides carries out accurate formula, personalized service.
According to another aspect of the present invention, a kind of data mining device is proposed, including:Data acquisition module, for obtaining
The predefined action data at family are taken, predefined action data include the effectiveness data of predefined action and the generation time of predefined action;
User's sort module, for according to the quantity for generating time and predefined action data of the predefined action data of each user to
Family is classified, and determines that targeted customer gathers;Feature vector generation module, for according to predefined action data genaration targeted customer
The single user characteristic vector of each user in set;User's diversity module, for according to single user characteristic vector, being calculated based on cluster
Method is classified to targeted customer's set, determines that hierarchic user gathers.
Alternatively, predefined action data also include predetermined condition mark and effectiveness deduction data, are identified according to predetermined condition
Recognize the first predefined action data;Single user characteristic vector includes first eigenvector index, second feature to figureofmerit the 3rd
Characteristic vector index, fourth feature are to figureofmerit, fifth feature to figureofmerit and/or sixth feature to figureofmerit;According to predetermined
Behavioral data generates the single user characteristic vector of each user in targeted customer's set to be included:According to first predefined action of user
The quantity of data determines the first eigenvector index of user with the ratio of the quantity of predefined action data;Determine user each
The effectiveness deduction data of predefined action data and the ratio of effectiveness data, and ratio is taken into average, determine the second feature of user
To figureofmerit;The third feature vector of user is determined according to the effectiveness of user deduction data sum and the ratio of effectiveness data sum
Index;Determine the fourth feature of user to figureofmerit according to the effectiveness of user deduction data sum;It is predetermined according to the first of user
The quantity of behavioral data determines the fifth feature of user to figureofmerit;And/or, according to presence the first predefined action data of user
The quantity of time period determine that the sixth feature vector of user refers to the ratio of the time segment number begun to pass through from user-network access
Mark.
Alternatively, user's diversity module includes:High density user's determining unit, for special according to the single user of each user
Levy vector and determine high-density region user;Initial center determining unit, it is initial for being selected as from high-density region user
The user of cluster centre, the quantity of initial cluster center is equal with predetermined classification quantity;Cluster cell, for according to initial clustering
Center, determines that hierarchic user gathers based on K mean algorithms.
Alternatively, initial center determining unit is used for:Selected in high-density region user according to single user characteristic vector
The maximum user of density parameter is used as the first initial cluster center;From the initial clustering of selected distance first in high-density region user
The farthest user in center is used as the second initial cluster center;From the initial cluster center of selected distance first in high-density region user
The user farthest with the distance of the second initial cluster center set is used as the 3rd initial cluster center;The like until determining complete
Portion's initial cluster center.
Alternatively, also include:Abnormal user excludes module, abnormal for excluding the abnormal user during targeted customer gathers
User includes the user of the effectiveness deduction data sum more than predetermined quantile of user;User's diversity module is used for:According to exclusion
The single user characteristic vector of user, is carried out based on clustering algorithm to targeted customer's set in targeted customer's set after abnormal user
Classification, determines that hierarchic user gathers;It is that abnormal user selects hierarchic user's set based on predetermined policy, and abnormal user is incorporated to
In hierarchic user's set.
Alternatively, also include:Standardization module, for the characteristic vector index in single user characteristic vector to be carried out
Data normalization process;User's diversity module is used for according to the single user characteristic vector after standardization, based on clustering algorithm
Targeted customer's set is classified, determines that hierarchic user gathers.
Such device first can be classified user, carry out user clustering in a classification such that it is able to select
Suitable targeted customer carries out cluster analysis, on the one hand more targetedly, can reduce the data volume of computing, on the other hand can
Interference of the inhomogeneous user data for Clustering Effect is excluded, user group is divided more accurately, be easy to according to customer group
Body division result carries out accurate formula, personalized service.
According to a further aspect of the invention, a kind of data digging system is proposed, including memory;And it is coupled to storage
The processor of device, processor is configured to perform any one side as mentioned in the text based on the instruction for being stored in memory
Method.
Such system first can be classified user, carry out user clustering in a classification such that it is able to select
Suitable targeted customer carries out cluster analysis, on the one hand more targetedly, can reduce the data volume of computing, on the other hand can
Interference of the inhomogeneous user data for Clustering Effect is excluded, user group is divided more accurately, be easy to according to customer group
Body division result carries out accurate formula, personalized service.
Description of the drawings
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, this
Bright schematic description and description does not constitute inappropriate limitation of the present invention for explaining the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of one embodiment of the data digging method of the present invention.
Fig. 2 is the flow chart of one embodiment of user clustering in data digging method of the invention.
Fig. 3 is the flow chart of another embodiment of the data digging method of the present invention.
Fig. 4 is the schematic diagram of one embodiment of the data mining device of the present invention.
Fig. 5 is the schematic diagram of one embodiment of user's diversity module in data mining device of the invention.
Fig. 6 is the schematic diagram of another embodiment of the data mining device of the present invention.
Fig. 7 is the schematic diagram of one embodiment of the data digging system of the present invention.
Fig. 8 is the schematic diagram of another embodiment of the data digging system of the present invention.
Specific embodiment
Below by drawings and Examples, technical scheme is described in further detail.
The flow chart of one embodiment of the data digging method of the present invention is as shown in Figure 1.
In a step 101, the predefined action data of user are obtained, predefined action data include the effectiveness data of predefined action
With the generation time of predefined action.Same user can have a plurality of predefined action data, including the generation of the predefined action data
Time and effectiveness data.In one embodiment, it is possible to obtain the predefined action data of multi-user.
In a step 102, according to the quantity for generating time and predefined action data of the predefined action data of each user
User is classified, determines that targeted customer gathers.In one embodiment, can be according to the generation time of predefined action data
Classified, it is also possible to which the generation quantity according to predefined action data is classified, or both are more careful with reference to carrying out
Classification.One or more classification can as required be selected respectively as targeted customer's set.
In step 103, according to predefined action data genaration targeted customer set in each user single user feature to
Amount.In one embodiment, can be according to the quantity of predefined action data, the effectiveness data of predefined action data, generation time
Residing time interval etc. determines single user characteristic vector.
At step 104, according to single user characteristic vector, targeted customer's set is classified based on clustering algorithm, really
Determine hierarchic user's set, wherein, the quantity of hierarchic user's set is equal with predetermined classification quantity.In one embodiment, can be with
Initial cluster center is selected, the predetermined classification quantity identical initial center point of quantity of the initial cluster center of selection is equal using K
Value-based algorithm carries out cluster operation.
By such method, first user can be classified, in a classification user clustering is carried out such that it is able to
Select suitable targeted customer to carry out cluster analysis, the data volume of computing on the one hand more targetedly, can be reduced, on the other hand
Interference of the inhomogeneous user data for Clustering Effect can be excluded, user group is divided more accurately, be easy to according to
The result that family colony divides carries out accurate formula, personalized service.
In one embodiment, scheduled time threshold value can be set and predetermined quantity is classified to user.If predetermined
The generation time of behavioral data earlier than scheduled time threshold value, and the quantity of predefined action data be more than predetermined quantity threshold value when,
Determine that user is first kind user;If the generation time of predefined action data is earlier than scheduled time threshold value, and predefined action number
According to quantity be not more than predetermined quantity threshold value when, determine user be Equations of The Second Kind user;If there is the generation of predefined action data
Between be no earlier than scheduled time threshold value, and generate that it is late in scheduled time threshold value predefined action data quantity more than predetermined
During amount threshold, determine that user is the 3rd class user;If it is late in scheduled time threshold to there is the generation of predefined action data
Value, and generate that it is late when the quantity of the predefined action data of scheduled time threshold value is not more than predetermined quantity threshold value, it is determined that
User is the 4th class user.
By such method, can according to the quantity for generating time and predefined action data of predefined action data to
Family is classified, and the user for selecting the classification for needing gathers as targeted customer, or the user of each classification can be gathered
Cluster operation is carried out respectively, user's classification of each classification is realized such that it is able to is realized user's classification of generic user, is carried
The degree of accuracy of high user's classification.
In one embodiment, the user produced without predefined action data in longer period of time can be excluded, due to
This kind of user long-time carries out having little significance for user behavior analysis and data mining without activity such that it is able to reduce
Operand, it is also possible to reduce the impact to grading effect, and operation cost can be reduced during market demand.
In one embodiment, predefined action data also include predetermined condition mark and effectiveness deduction data.Effectiveness deducts
Data can be the deduction effectiveness produced because predefined action meets predetermined condition, such as make effectiveness data than standard effectiveness number
According to amount for reducing etc..In one embodiment, predefined action can be judged by the predetermined condition of predefined action data mark
Whether conform to a predetermined condition, the predefined action data of the predefined action for conforming to a predetermined condition can be referred to as the first predefined action number
According to.Single user characteristic vector can reflect ratio, the impact of generation shared by the predefined action for conforming to a predetermined condition, so as to pass through
Data mining realized to user behavior feature, particularly to the analysis of the susceptibility of predetermined condition.In one embodiment, can be with
The fisrt feature of user is determined according to the ratio of the quantity of the quantity and predefined action data of the first predefined action data of user
To figureofmerit;In another embodiment, it may be determined that the effectiveness deduction data of each predefined action data of user and effectiveness
The ratio of data, and ratio is taken into average, determine the second feature of user to figureofmerit;In yet another embodiment, can be with root
Determine the third feature of user to figureofmerit with the ratio of effectiveness data sum according to the effectiveness deduction data sum of user;Another
In individual embodiment, according to the effectiveness of user deduction data sum the fourth feature of user can be determined to figureofmerit;Can be with root
Determine the fifth feature of user to figureofmerit according to the quantity of the first predefined action data of user;Furthermore it is also possible to according to user
Presence the first predefined action data time period quantity it is true with the ratio of the time segment number begun to pass through from user-network access
The sixth feature of user is determined to figureofmerit.
By multiple characteristic vector index constitutive characteristics vector, sensitivity of the user to predetermined condition can be accurately depicted
Degree, so as in cluster calculation, can significantly be embodied user for the user of predetermined condition sensitivity difference is classified, just
Targetedly apply in being carried out based on hierarchic user, user is carried out and is targetedly serviced.
The flow chart of one embodiment of user's classification is as shown in Figure 2 in the data digging method of the present invention.
In step 201, high-density region user is determined according to the single user characteristic vector of each user.In an enforcement
In example, can centered on the single user characteristic vector point of user point, it is determined that special including the other users single user of predetermined quantity
The radius in the region of vector point is levied, if radius is less than predetermined threshold, then it is assumed that user is high-density region user.In an enforcement
In example, can centered on the single user characteristic vector point of user point, determine the alone of other users in the region of predetermined radii
The quantity of family characteristic vector point, if the quantity reaches predetermined quantity, then it is assumed that user is high-density region user.
In step 202., the user of initial cluster center, initial cluster center are selected as from high-density region user
Quantity with it is predetermined classification quantity it is equal.For example, the user during if desired targeted customer is gathered is divided into Pyatyi by cluster, then
Need to choose 5 initial cluster centers in high-density region.
In step 203, according to initial cluster center, determine that hierarchic user gathers based on K mean algorithms.
Generally, highdensity data area can be separated by low-density data area, and these are located at density regions
Data point be generally known as isolated point.At present existing clustering algorithm is mostly randomly to choose initial cluster center, and this is neglected
Depending on the distribution situation of data, because the selection of initial cluster center in K mean algorithms can produce impact on result, therefore at random
Choose initial cluster center can greatly affect final Clustering Effect.By the method in the embodiment of the present invention, Neng Goubao
Card initial cluster center is high-density region user, it is to avoid cause user to be classified using some Standalone customers as initial cluster center
It is inaccurate.
In one embodiment, the single user characteristic vector that can be based on user carries out computing, in high-density region user
It is middle to select the maximum data point of density parameter as the first initial cluster center, and by the first initial cluster center from high density area
Delete in the user of domain;From the initial cluster center of selected distance first in high-density region user, farthest user is initial as second
Cluster centre, and the second initial cluster center is deleted from high-density region user;Choose from high-density region user away from
Farthest user is used as the 3rd initial cluster center with a distance from the first initial cluster center and the second initial cluster center set,
And delete the 3rd initial cluster center from high-density region user;The like until determining whole initial cluster centers.
By such method, the farthest user of mutual distance can be selected in high-density region user as initial poly-
Class center, on the one hand can exclude selection Standalone customers cluster result is impacted as initial cluster center, on the other hand
Due to the farthest initial cluster center point of mutual distance it is more more representative than what is randomly selected, by the method obtain just
Beginning cluster centre is also more representative, can optimize Clustering Effect, obtains more representational user's classification results.
In one embodiment, the distance between 2 points can be calculated using Euclidean distance, implement formula such as
Under:
Wherein, x, y be two point identifications, (x1, x2……xn) for x characteristic vector, x1、x2……xnFor the characteristic vector of x
Index;(y1, y2……yn) for y characteristic vector, y1、y2……ynFor the characteristic vector index of y, n is characterized the index of vector
Quantity.
The distance between one data point x and data point set z for the data point with all data points in data set most
Near distance, computing formula is as follows:
Dist (x, z)=min (dist (x, y)), y ∈ z
Wherein, y is each point in z.
The distance between two data point sets x, y for it is nearest be located at respectively two data points that two data points concentrate it
Between distance, computing formula is as follows:
Dist (x, y)=min (dist (u, v)), u ∈ x, v ∈ y
Wherein, u is each point in x, and v is each point in y.
By such method, the density parameter of each data point can be calculated, then according to distance between data point
Calculating, between data point and set distance calculating, and the calculating of relation determines initial cluster center between set.
In k mean algorithms, calculate Euclidean distance of each data point apart from k initial cluster center, by data point and
The initial cluster center point closest with it is classified as a cluster, in now judging whether that reaching the condition for stopping cluster clustering
The heart no longer changes, and exits if stop condition is met, and otherwise updates the cluster centre point of each cluster, takes in each cluster and owns
Used as new cluster centre, circulation performs above-mentioned calculating process to the average of point, until cluster centre no longer changes.By this
The method of sample, can complete cluster operation, obtain hierarchic user's set.
In one embodiment, often occur that some much deviate the pole of normal level in the characteristic index of different user
These extremums are generally referred to as exceptional value by the big value in end and extreme small.In order to not make these exceptional values affect follow-up cluster
Effect, can be identified before cluster to exceptional value.In one embodiment, can by the effectiveness of user deduction data it
With the user more than predetermined quantile as abnormal user, abnormal user is deleted from the targeted customer's set for being used to cluster computing
Remove.In the single user characteristic vector according to user in the targeted customer's set after excluding abnormal user, based on clustering algorithm to mesh
Mark user's set is classified, and can be that abnormal user selects similar hierarchic user's set after determining hierarchic user's set,
And abnormal user is incorporated in hierarchic user's set, such as the effectiveness deduction data sum of user is more than into the user of predetermined quantile
In being incorporated to hierarchic user's set extremely sensitive to predetermined condition;The user that effectiveness deduction data are 0 is incorporated to predetermined condition
In extremely insensitive hierarchic user's set.In one embodiment, it is possible to use the second feature vector being mentioned above refers to
Mark a carries out the classification of abnormal user, and as shown in table 1, the second feature of user i is a to figureofmeriti:
aiThe standard deviation of the average+a of >=a | To predetermined condition extreme sensitivity |
Average≤a of aiThe standard deviation of the average+a of < a | It is extremely sensitive to predetermined condition |
Standard deviation≤a of the average-a of aiThe average of < a | To predetermined condition medium sensitivity |
aiThe standard deviation of the average-a of < a | To predetermined condition slight sensitive |
ai=0 | It is insensitive to predetermined condition |
The abnormal user of table 1 is sorted out
By such method, the impact that abnormal user is caused to cluster calculation on the one hand can be excluded;On the other hand
In the range of can abnormal user be accounted for, rather than simply rejected, so as to improve covering for user's classification results
Lid scope, it is to avoid the leakage to certain customers is analyzed.
In one embodiment, carry out needing to be standardized characteristic vector achievement data before clustering algorithm, to disappear
Except the impact that different dimensions are brought to cluster result, for example some characteristic vector indexs are percentage, some characteristic vector indexs
It is quantity, some characteristic vector indexs are effectiveness, cannot be directly compared between these indexs, it is therefore desirable to changed into comparable
Compared with, eliminate dimension impact standardized feature vector achievement data.In one embodiment, can be standardized using standard deviation
Method is standardized to data, and standard deviation standardization is referred to and for characteristic vector achievement data to deduct this feature vector achievement data
Average, then divided by its standard deviation.Average is to weigh the intensity of data distribution, and computing formula is:
Average
Standard deviation is to weigh the dispersion degree of data, and computing formula is:
Standard deviation
According to standard deviation standardized calculation formula:
Characteristic vector achievement data after being standardized, wherein, X1…Xi…XnVectorial achievement data is characterized, i is 1
To the natural number between n, n is the quantity of user in the targeted customer's set for participate in cluster;XscaleiIt is by XiSpy after standardization
Levy vectorial achievement data.
By such method, cluster calculation will can be again carried out after characteristic vector achievement data standardization, so as to
The impact that different dimensions are produced to Clustering Effect is eliminated, the accuracy and reliability of user's classification is improved.
The flow chart of another embodiment of the data digging method of the present invention is as shown in Figure 3.
In step 301, the predefined action data of user are obtained, predefined action data include the effectiveness data of predefined action
With the generation time of predefined action.Same user can have a plurality of predefined action data, including the generation of the predefined action data
Time and effectiveness data.In one embodiment, it is possible to obtain the predefined action data of multi-user.
In step 302, according to the quantity for generating time and predefined action data of the predefined action data of each user
User is classified, determines that targeted customer gathers.In one embodiment, can be according to the generation time of predefined action data
Classified, it is also possible to which the generation quantity according to predefined action data is classified, or both are more careful with reference to carrying out
Classification.One or more classification can as required be selected respectively as targeted customer's set.
In step 303, according to predefined action data genaration targeted customer set in each user single user feature to
Amount.In one embodiment, can be according to the quantity of predefined action data, the effectiveness data of predefined action data, generation time
Residing time interval etc. determines single user characteristic vector.
In step 304, the user more than predetermined quantile is used as abnormal user for the data sum that effectiveness deducted, will be abnormal
User is from for deletion in the targeted customer's set for clustering computing.
In step 305, characteristic vector achievement data is standardized, cluster result is brought with eliminates different dimensions
Impact.
Within step 306, according to the single user characteristic vector after standardization, based on clustering algorithm to suppressing exception user after
Targeted customer set be classified, determine hierarchic user gather, wherein, hierarchic user set quantity and predetermined classification quantity
It is equal.In one embodiment, the quantity identical initial cluster center with predetermined classification quantity can be selected, using K averages
Algorithm carries out cluster operation.In one embodiment, can also be that abnormal user selects similar hierarchic user's set, and will be different
Conventional family is incorporated in hierarchic user's set.
By such method, first user can be classified, in a classification user clustering is carried out, be excluded different
The user data of class makes user group divide more accurately for the interference of Clustering Effect, is easy to what is divided according to user group
As a result accurate formula, personalized service are carried out;Ensure that initial cluster center is high-density region user, it is to avoid some are lonely
Vertical point causes the inaccurate of user's classification as initial cluster center;The shadow that abnormal user is caused to cluster calculation can excluded
While sound, by abnormal user account in the range of ensure that the coverage of user's classification results;Eliminate different dimensions pair
The impact that Clustering Effect is produced, improves the accuracy and reliability of user's classification.
In one embodiment, final cluster centre can be gathered according to hierarchic user and determines different hierarchic user's collection
Close the susceptibility to predetermined condition.In one embodiment, the cluster centre that can gather several hierarchic user is respectively each
Sue for peace in individual characteristic vector index dimension, the size sequence after summation according to value is worth maximum cluster centre to tackling predetermined condition
Extreme sensitivity, by that analogy, is worth minimum cluster centre insensitive to tackling predetermined condition.By such method, can be right
Hierarchic user's set gives the meaning of reality, makes user have the set of different hierarchic user and intuitively experiences, right so as to realize
Hierarchic user's set is targetedly applied, serviced.
In e-commerce field, can be clustered according to the various actions feature of user, purchase user group is divided
If into Ganlei, so also allowing for market analysis and operation personnel clearly understanding the feature of customer base, to carry out accurate formula, individual
The marketing of property.Promotion susceptibility is the index for weighing user to the sensitivity of all kinds of promotional offers.Some users are closed very much
The commodity of note promotional offer great efforts, Jing often muptiple-use purchase, or when system of users provides reward voucher, user is just
Buying behavior can be produced using reward voucher, show that such user is more sensitive to promoting;And some users not because of commodity whether
Participate in promotion and bought, and the granting to reward voucher is also lost interest in, and shows that such user is to promotional offer and unwise
Sense.User can be divided into by different colonies based on such behavioural characteristic, this facilitates implementation the precision marketing for user
And personalized recommendation such that it is able to leader user purchase again, lifts turnover.
All users in prior art in meeting selecting system database, calculate preferential amount of money accounting and preferential order volume is accounted for
Than the two indexs, the method using initial cluster center is randomly selected, user is divided into extremely sensitive, light to promoting to promoting
Spend sensitivity and to promoting insensitive three class.
In one embodiment of the invention, can be selected in customer group, for example, there are Shopping Behaviors within nearly 3 years
User as the target group of promotion susceptibility identification, on the one hand meet user coverage rate, on the other hand, identification does not have for nearly 3 years
There is the promotion susceptibility of the user for carrying out doing shopping nonsensical, multiple purchase is carried out it is difficult to it can be rebooted by marketing, this
Marketing resource can be wasted.Then, then to nearly user for there are Shopping Behaviors in 3 years it is finely divided, can be purchased according to user's last time
Buy the time and this certain customers is divided into four big class by the shopping frequency the two indexs:Nearly one is only the user for buying once;
There is within nearly 1 year the user for purchasing behavior again;Last time buying behavior occurred before 1 year and only bought once the year before
User;There is before 1 year and had the year before the user of purchase behavior again in last time buying behavior.Then according to reality
Application scenarios are respectively by this four big class subscriber segmentation into 5 classes:It is extreme sensitivity, extremely sensitive, medium sensitivity, slight sensitive, unwise
Sense.In one embodiment, the user that can choose a big class is finely divided, it is also possible to which the user of each big class is entered respectively
Row subdivision.It is to be easy to service application side to carry out more accurate, fine, personalization so by the purpose that user carries out fine division
Operation, with it is maximized meet marketing demand.
In one embodiment, the promotion sensitive kind of user can be entered using more abundant characteristic vector index
Row is distinguished, as shown in table 2.
The characteristic vector index that the user of table 2 promotion sensitivity type is chosen
In some cases, for example, the preferential amount of money that some users were only bought in 1 time, and this list accounts for original cost
80%, but only 10 yuan of original cost;And it is repeatedly and every time preferential order that other users bought, and total preferential amount of money is accounted for
The 50% of original cost, but original cost is up to 100,000 yuan, the preferential order accounting of now simple dependence and preferential amount of money accounting are judging user
Promotion sensitive kind be inaccurate.Method in embodiments of the invention can adopt more abundant index to weigh user
Promotion susceptibility, more rationally and accurately.
In one embodiment, can be to choose exceptional value according to total preferential amount of money, such as by the data of each feature of analysis
Distribution finds that total preferential amount of money occurs some extreme larges, can be by the preferential amount of money more than the quantile of the preferential amount of money 0.995
User is classified as abnormal user, and this certain customers is not involved in cluster, but after cluster terminates, can be according to average per single preferential amount of money
Accounting is sorted out, it is determined which hierarchic user's set belonged to.As shown in table 3:
User i is average per single preferential amount of money aiThe standard deviation of the average+a of >=a | Extreme sensitivity |
Average≤a of aiThe standard deviation of the average+a of < a | It is extremely sensitive |
Standard deviation≤a of the average-a of aiThe average of < a | Medium sensitivity |
aiThe standard deviation of the average-a of < a | Slight sensitive |
ai=0 | It is insensitive |
Abnormal user is sorted out and is judged in the promotional offer susceptibility of table 3 cluster
Wherein, a is that single user is average per single preferential amount of money.
Original implementation is not processed exceptional value, and exceptional value can greatly affect Clustering Effect, and this will
Cause the result badly for clustering.By the method in the embodiment of the present invention, can be with reference to specific service application scene to peeling off
Point is identified, and identifies and simply do not rejected after outlier, but promotion sensitive kinds have been also carried out to outlier
The classification of type, which enhances the user coverage rate of model.
The schematic diagram of one embodiment of the data mining device of the present invention is as shown in Figure 4.Wherein, data acquisition module
The 401 predefined action data that can obtain user, the effectiveness data of predefined action data including predefined action and predefined action
The generation time.Same user can have a plurality of predefined action data, including generation time and the effectiveness number of the predefined action data
According to.In one embodiment, it is possible to obtain the predefined action data of multi-user.User's sort module 402 can be according to each use
The generation time of the predefined action data at family and the quantity of predefined action data are classified to user, determine that targeted customer collects
Close.In one embodiment, can be classified according to the generation time of predefined action data, it is also possible to according to predefined action number
According to generation quantity classified, or by both combine carry out more careful classification.Can select as required one or
Multiple classification are respectively as targeted customer's set.Feature vector generation module 403 can be according to predefined action data genaration target
The single user characteristic vector of each user in user's set.In one embodiment, can according to the quantity of predefined action data,
Time interval residing for the effectiveness data of predefined action data, generation time etc. determines single user characteristic vector.User is classified mould
Block 404 can be classified based on clustering algorithm according to single user characteristic vector to targeted customer's set, determine that hierarchic user collects
Close, wherein, the quantity of hierarchic user's set is equal with predetermined classification quantity.In one embodiment, initial clustering can be selected
Center, the predetermined classification quantity identical initial center point of quantity of the initial cluster center of selection, is gathered using K mean algorithms
Generic operation.
Such device first can be classified user, carry out user clustering in a classification such that it is able to select
Suitable targeted customer carries out cluster analysis, on the one hand more targetedly, can reduce the data volume of computing, on the other hand can
Interference of the inhomogeneous user data for Clustering Effect is excluded, user group is divided more accurately, be easy to according to customer group
The result that body is divided carries out accurate formula, personalized service.
In one embodiment, scheduled time threshold value can be set and predetermined quantity is classified to user.If predetermined
The generation time of behavioral data earlier than scheduled time threshold value, and the quantity of predefined action data be more than predetermined quantity threshold value when,
Determine that user is first kind user;If the generation time of predefined action data is earlier than scheduled time threshold value, and predefined action number
According to quantity be not more than predetermined quantity threshold value when, determine user be Equations of The Second Kind user;If there is the generation of predefined action data
Between be no earlier than scheduled time threshold value, and generate that it is late in scheduled time threshold value predefined action data quantity more than predetermined
During amount threshold, determine that user is the 3rd class user;If it is late in scheduled time threshold to there is the generation of predefined action data
Value, and generate that it is late when the quantity of the predefined action data of scheduled time threshold value is not more than predetermined quantity threshold value, it is determined that
User is the 4th class user.
Such device can enter according to the quantity of the generation time of predefined action data and predefined action data to user
Row classification, the user for selecting the classification for needing gathers as targeted customer, or can be to the user of each classification set difference
Cluster operation is carried out, user's classification of each classification is realized such that it is able to is realized user's classification of generic user, is improved and use
The degree of accuracy of family classification.
In one embodiment, user's sort module 402 can be excluded in longer period of time does not have predefined action data
The user of generation, because this kind of user long-time is without activity, therefore carries out the meaning of user behavior analysis and data mining not
Greatly such that it is able to reduce operand, it is also possible to reduce the impact to grading effect, and fortune can be reduced during market demand
Battalion's cost.
In one embodiment, predefined action data also include predetermined condition mark and effectiveness deduction data.Effectiveness deducts
Data can be the deduction effectiveness produced because predefined action meets predetermined condition, such as make effectiveness data than standard effectiveness number
According to amount for reducing etc..In one embodiment, predefined action can be judged by the predetermined condition of predefined action data mark
Whether conform to a predetermined condition, the predefined action data of the predefined action for conforming to a predetermined condition can be referred to as the first predefined action number
According to.Single user characteristic vector can reflect ratio, the impact of generation shared by the predefined action for conforming to a predetermined condition, so as to pass through
Data mining realized to user behavior feature, particularly to the analysis of the susceptibility of predetermined condition.In one embodiment, can be with
The fisrt feature of user is determined according to the ratio of the quantity of the quantity and predefined action data of the first predefined action data of user
To figureofmerit;In another embodiment, it may be determined that the effectiveness deduction data of each predefined action data of user and effectiveness
The ratio of data, and ratio is taken into average, determine the second feature of user to figureofmerit;In yet another embodiment, can be with root
Determine the third feature of user to figureofmerit with the ratio of effectiveness data sum according to the effectiveness deduction data sum of user;Another
In individual embodiment, according to the effectiveness of user deduction data sum the fourth feature of user can be determined to figureofmerit;Can be with root
Determine the fifth feature of user to figureofmerit according to the quantity of the first predefined action data of user;Furthermore it is also possible to according to user
Presence the first predefined action data time period quantity it is true with the ratio of the time segment number begun to pass through from user-network access
The sixth feature of user is determined to figureofmerit.
By with multiple characteristic vector index constitutive characteristics vector, can accurately depict user sensitive to predetermined condition
The characteristics of spending, so as in cluster calculation, can significantly be embodied user of the user for predetermined condition sensitivity difference
Classification, is easy to be carried out based on hierarchic user and is targetedly applied, and user is carried out and is targetedly serviced.
The schematic diagram of one embodiment of user's diversity module is as shown in Figure 5 in the data mining device of the present invention.Wherein,
High density user determining unit 501 can determine high-density region user according to the single user characteristic vector of each user.One
In individual embodiment, can centered on the single user characteristic vector point of user point, it is determined that including the other users list of predetermined quantity
The radius in the region of user characteristics vector point, if radius is less than predetermined threshold, then it is assumed that user is high-density region user.One
In individual embodiment, can centered on the single user characteristic vector point of user point, determine other users in the region of predetermined radii
Single user characteristic vector point quantity, if the quantity reaches predetermined quantity, then it is assumed that user be high-density region user.Initially
Center determining unit 502 can be selected as the user of initial cluster center, initial cluster center from high-density region user
Quantity with it is predetermined classification quantity it is equal.For example, the user during if desired targeted customer is gathered is divided into Pyatyi by cluster, then
Need to choose 5 initial cluster centers in high-density region.Cluster cell 503 can be equal based on K according to initial cluster center
Value-based algorithm determines that hierarchic user gathers.
Such device ensure that initial cluster center be high-density region user, it is to avoid using some Standalone customers as
Initial cluster center causes the inaccurate of user's classification.
In one embodiment, initial center determining unit 502 can be transported based on the single user characteristic vector of user
Calculate, the maximum data point of density parameter is selected in high-density region user as the first initial cluster center, and by the beginning of first
Beginning cluster centre is deleted from high-density region user;From the initial cluster center of selected distance first in high-density region user most
Remote user deletes the second initial cluster center from high-density region user as the second initial cluster center;From height
The farthest user of the initial cluster center of selected distance first and the distance of the second initial cluster center set in density area user
As the 3rd initial cluster center, and the 3rd initial cluster center is deleted from high-density region user;The like until
It is determined that whole initial cluster centers.
Such device can select the farthest user of mutual distance as in initial clustering in high-density region user
The heart, on the one hand can exclude selection Standalone customers cluster result is impacted as initial cluster center, on the other hand due to
The farthest initial cluster center point of mutual distance is more more representative than what is randomly selected, is initially gathered by what the method was obtained
Class center is also more representative, can optimize Clustering Effect, obtains more representational user's classification results.
In one embodiment, often occur that some much deviate the pole of normal level in the characteristic index of different user
These extremums are generally referred to as exceptional value by the big value in end and extreme small.In order to not make these exceptional values affect follow-up cluster
Effect, can be identified before cluster to exceptional value.In one embodiment, can by the effectiveness of user deduction data it
With the user more than predetermined quantile as abnormal user, abnormal user is deleted from the targeted customer's set for being used to cluster computing
Remove.In the single user characteristic vector according to user in the targeted customer's set after excluding abnormal user, based on clustering algorithm to mesh
Mark user's set is classified, and can be that abnormal user selects similar hierarchic user's set after determining hierarchic user's set,
And abnormal user is incorporated in hierarchic user's set, such as the effectiveness deduction data sum of user is more than into the user of predetermined quantile
In being incorporated to hierarchic user's set extremely sensitive to predetermined condition;The user that effectiveness deduction data are 0 is incorporated to predetermined condition
In extremely insensitive hierarchic user's set.In one embodiment, can be according to the above-mentioned second feature of user to figureofmerit
Value determines that the classification that abnormal user belongs to is used with the average of second feature index, the magnitude relationship of standard deviation in targeted customer's set
Gather at family.
On the one hand such device can exclude the impact that abnormal user is caused to cluster calculation;On the other hand also can be by
In the range of abnormal user is accounted for, rather than simply rejected, so as to improve the coverage of user's classification results,
The leakage to certain customers is avoided to analyze.
In one embodiment, carry out needing to be standardized characteristic vector achievement data before clustering algorithm, to disappear
Except the impact that different dimensions are brought to cluster result, for example some characteristic vector indexs are percentage, some characteristic vector indexs
It is quantity, some characteristic vector indexs are effectiveness, cannot be directly compared between these indexs, it is therefore desirable to changed into comparable
Compared with, eliminate dimension impact standardized feature vector achievement data.In one embodiment, standardization mould can be included
Block, for being standardized to data.In one embodiment, standardization module can adopt the standardized side of standard deviation
Method carries out data normalization process.Standard deviation standardization is referred to and for characteristic vector achievement data to deduct this feature vector achievement data
Average, then divided by its standard deviation.Average is to weigh the intensity of data distribution, and computing formula is:
Average
Standard deviation is to weigh the dispersion degree of data, and computing formula is:
Standard deviation
According to standard deviation standardized calculation formula:
Characteristic vector achievement data after being standardized, wherein, X1…Xi…XnVectorial achievement data is characterized, i is certainly
So count, n is the quantity of user in the targeted customer's set for participate in cluster;XscaleiIt is by XiCharacteristic vector index after standardization
Data.
Such device will can again carry out cluster calculation after characteristic vector achievement data standardization, so as to eliminate not
With the impact that dimension is produced to Clustering Effect, the accuracy and reliability of user's classification are improved.
The schematic diagram of another embodiment of the data mining device of the present invention is as shown in Figure 6.Wherein, data acquisition module
601st, the 26S Proteasome Structure and Function of user's sort module 602 and feature vector generation module 603 is similar to the embodiment of Fig. 4.Data
Excavating gear also includes that abnormal user excludes module 605 and standardization module 606.Abnormal user excludes module 605 can
Data sum that effectiveness is deducted more than predetermined quantile user as abnormal user, by abnormal user from for clustering computing
Delete in targeted customer's set.Standardization module 606 can be standardized to characteristic vector achievement data, to eliminate not
With the impact that dimension is brought to cluster result.User's diversity module 604 can be according to the single user characteristic vector after standardization, base
The targeted customer's set after suppressing exception user is classified in clustering algorithm, determines that hierarchic user gathers, additionally it is possible to for different
Conventional family selects similar hierarchic user's set, and abnormal user is incorporated in hierarchic user's set.
Such device first can be classified user, and in a classification user clustering is carried out, and be excluded inhomogeneous
User data makes user group divide more accurately for the interference of Clustering Effect, is easy to the result divided according to user group
Carry out accurate formula, personalized service;Ensure that initial cluster center is high-density region user, it is to avoid by some isolated points
The inaccurate of user's classification is caused as initial cluster center;The impact that abnormal user is caused to cluster calculation can excluded
Meanwhile, by abnormal user account in the range of ensure that the coverage of user's classification results;Different dimensions are eliminated to cluster
The impact that effect is produced, improves the accuracy and reliability of user's classification.
In one embodiment, user's diversity module 604 can gather final cluster centre and determine according to hierarchic user
Gather the susceptibility to predetermined condition in different hierarchic user.In one embodiment, several hierarchic user can be gathered
Cluster centre is sued for peace respectively in each characteristic vector index dimension, the size sequence after summation according to value, in being worth maximum cluster
The heart by that analogy, is worth minimum cluster centre insensitive to tackling predetermined condition to tackling predetermined condition extreme sensitivity.It is such
Device, can gather the meaning for giving reality to hierarchic user, make user have the set of different hierarchic user and intuitively experience,
So as to realize that hierarchic user's set is targetedly applied, serviced.
In one embodiment, in order to use for each application scenarios, hierarchic user's collective data can be processed into specification
The tables of data of change, in being stored in file system, can be directly invoked by Database Systems, or in the way of application programming interfaces
Service application is pushed to, the application that is directed to is carried out for user behavior feature to facilitate.
The schematic diagram of one embodiment of the data digging system of the present invention is as shown in Figure 7.The data digging system includes
Memory 701 and processor 702.Wherein:
Memory 701 can be disk, flash memory or other any non-volatile memory mediums.The finger of accumulator system operation
Order.
Processor 702 is coupled to memory 701, can implement as one or more integrated circuits, such as microprocessor
Device or microcontroller.The processor 702 is used to perform the instruction of storage in memory, and then realizes that acquisition efficiently, accurately divides
The purpose of level user's set.
The schematic diagram of another embodiment of the data digging system of the present invention is as shown in Figure 8.
Data mining device 800 includes memory 810 and processor 820.Processor 820 can include processor 820a,
820b…820n.Processor 820a-820n is coupled to memory 810 by BUS buses 830.Data based on distributed formula are dug
Pick system, can carry out rapid computations, improve the operational efficiency of data mining.The data digging system 800 can also pass through
The externally connected storage device 850 of memory interface 840 can also be connected to call external data by network interface 860
Network or an other computer system (not shown).No longer describe in detail herein.
In this embodiment, instructed by memory stores data, then above-mentioned instruction is processed by processor, and then realized
Efficiently, accurate user's classification, is easy to provide corresponding service according to user behavior feature.
Finally it should be noted that:Above example is only to illustrate technical scheme rather than a limitation;To the greatest extent
Pipe has been described in detail with reference to preferred embodiment to the present invention, and those of ordinary skill in the art should be understood:Still
The specific embodiment of the present invention can be modified or equivalent is carried out to some technical characteristics;Without deviating from this
The spirit of bright technical scheme, it all should cover in the middle of the technical scheme scope being claimed in the present invention.
Claims (13)
1. a kind of data digging method, it is characterised in that include:
Obtain the predefined action data of user, effectiveness data of predefined action data including the predefined action and described pre-
Determine the generation time of behavior;
According to the generation time of the predefined action data of each user and the quantity of the predefined action data to the use
Family is classified, and determines that targeted customer gathers;
The single user characteristic vector of each user in targeted customer's set according to the predefined action data genaration;
According to the single user characteristic vector, targeted customer set is classified based on clustering algorithm, it is determined that classification is used
Gather at family.
2. method according to claim 1, it is characterised in that
The predefined action data also include predetermined condition mark and effectiveness deduction data, are identified according to the predetermined condition
Recognize the first predefined action data;
The single user characteristic vector include first eigenvector index, second feature to figureofmerit third feature to figureofmerit,
Fourth feature is to figureofmerit, fifth feature to figureofmerit and/or sixth feature to figureofmerit;
The single user characteristic vector bag of each user in the set of the targeted customer according to the predefined action data genaration
Include:
It is true with the ratio of the quantity of the predefined action data according to the quantity of the first predefined action data of the user
The first eigenvector index of the fixed user;
Determine the user each predefined action data the effectiveness deduction data and the effectiveness data ratio,
And the ratio is taken into average, determine the second feature of the user to figureofmerit;
Determine the user's with the ratio of the effectiveness data sum according to the effectiveness of user deduction data sum
The third feature is to figureofmerit;
Determine the fourth feature of the user to figureofmerit according to the effectiveness of user deduction data sum;
Determine that the fifth feature vector of the user refers to according to the quantity of the first predefined action data of the user
Mark;And/or,
According to the presence of the user quantity of the time period of the first predefined action data with begin to pass through from user-network access
The ratio of time segment number determine the sixth feature of the user to figureofmerit.
3. method according to claim 1, it is characterised in that described according to the single user characteristic vector, based on cluster
Algorithm is classified to targeted customer set, determines that hierarchic user's set includes:
High-density region user is determined according to the single user characteristic vector of each user;
Be selected as the user of initial cluster center from the high-density region user, the quantity of the initial cluster center with
The predetermined classification quantity is equal;
According to the initial cluster center, hierarchic user's set is determined based on K mean algorithms.
4. method according to claim 3, it is characterised in that described to select initial poly- in the high-density region user
Class center includes:
The maximum user of density parameter is selected as the in the high-density region user according to the single user characteristic vector
One initial cluster center;
From the first initial cluster center described in selected distance in the high-density region user, farthest user is initial as second
Cluster centre;
From the first initial cluster center described in selected distance in the high-density region user and second initial cluster center
The farthest user of the distance of set is used as the 3rd initial cluster center;
The like until determining all initial cluster centers.
5. method according to claim 2, it is characterised in that also include:
Exclude the abnormal user in targeted customer set, the abnormal user include user effectiveness deduction data it
With the user more than predetermined quantile;
It is described according to the single user characteristic vector targeted customer set to be classified based on clustering algorithm, it is determined that point
Level user's set includes:
According to the single user characteristic vector of user in the targeted customer set excluded after abnormal user, calculated based on cluster
Method is classified to targeted customer set, determines that hierarchic user gathers;
It is that the abnormal user selects hierarchic user's set based on predetermined policy, and the abnormal user is incorporated to into the classification use
In the set of family.
6. method according to claim 1, it is characterised in that also include:By the feature in the single user characteristic vector
Data normalization process is carried out to figureofmerit;
It is described according to the single user characteristic vector targeted customer set to be classified based on clustering algorithm, it is determined that point
Level user's set includes:
The single user characteristic vector after according to standardization, is carried out point based on clustering algorithm to targeted customer set
Level, determines that hierarchic user gathers.
7. a kind of data mining device, it is characterised in that include:
Data acquisition module, for obtaining the predefined action data of user, the predefined action data include the predefined action
Effectiveness data and the predefined action the generation time;
User's sort module, for according to the generation time of the predefined action data of each user and the predefined action number
According to quantity the user is classified, determine targeted customer gather;
Feature vector generation module, for each user in targeted customer's set according to the predefined action data genaration
Single user characteristic vector;
User's diversity module, for according to the single user characteristic vector, based on clustering algorithm the targeted customer is gathered into
Row classification, determines that hierarchic user gathers.
8. device according to claim 7, it is characterised in that
The predefined action data also include predetermined condition mark and effectiveness deduction data, are identified according to the predetermined condition
Recognize the first predefined action data;
The single user characteristic vector include first eigenvector index, second feature to figureofmerit third feature to figureofmerit,
Fourth feature is to figureofmerit, fifth feature to figureofmerit and/or sixth feature to figureofmerit;
The single user characteristic vector bag of each user in the set of the targeted customer according to the predefined action data genaration
Include:
It is true with the ratio of the quantity of the predefined action data according to the quantity of the first predefined action data of the user
The first eigenvector index of the fixed user;
Determine the user each predefined action data the effectiveness deduction data and the effectiveness data ratio,
And the ratio is taken into average, determine the second feature of the user to figureofmerit;
Determine the user's with the ratio of the effectiveness data sum according to the effectiveness of user deduction data sum
The third feature is to figureofmerit;
Determine the fourth feature of the user to figureofmerit according to the effectiveness of user deduction data sum;
Determine that the fifth feature vector of the user refers to according to the quantity of the first predefined action data of the user
Mark;And/or,
According to the presence of the user quantity of the time period of the first predefined action data with begin to pass through from user-network access
The ratio of time segment number determine the sixth feature of the user to figureofmerit.
9. device according to claim 7, it is characterised in that user's diversity module includes:
High density user's determining unit, for determining that high-density region is used according to the single user characteristic vector of each user
Family;
Initial center determining unit, it is described for being selected as the user of initial cluster center in the high-density region user
The quantity of initial cluster center is equal with the predetermined classification quantity;
Cluster cell, for according to the initial cluster center, based on K mean algorithms hierarchic user's set being determined.
10. device according to claim 9, it is characterised in that the initial center determining unit is used for:
The maximum user of density parameter is selected in the high-density region user as the first initial cluster center;
From the first initial cluster center described in selected distance in the high-density region user, farthest user is initial as second
Cluster centre;
From the first initial cluster center described in selected distance in the high-density region user and second initial cluster center
The farthest user of the distance of set is used as the 3rd initial cluster center;
The like until determining all initial cluster centers.
11. devices according to claim 8, it is characterised in that also include:
Abnormal user excludes module, and for excluding the abnormal user during the targeted customer gathers, the abnormal user includes using
User of the effectiveness deduction data sum at family more than predetermined quantile;
User's diversity module is used for:
According to the single user characteristic vector of user in the targeted customer set excluded after abnormal user, calculated based on cluster
Method is classified to targeted customer set, determines that hierarchic user gathers;
It is that the abnormal user selects hierarchic user's set based on predetermined policy, and the abnormal user is incorporated to into the classification use
In the set of family.
12. devices according to claim 7, it is characterised in that also include:
Standardization module, for the characteristic vector index in the single user characteristic vector to be carried out at data normalization
Reason;
User's diversity module be used for according to standardization after the single user characteristic vector, based on clustering algorithm to institute
State targeted customer's set to be classified, determine that hierarchic user gathers.
A kind of 13. data digging systems, it is characterised in that:
Including memory;And
The processor of the memory is coupled to, the processor is configured to be performed based on the instruction for being stored in the memory
Method as described in any one of claim 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610901862.0A CN106649517A (en) | 2016-10-17 | 2016-10-17 | Data mining method, device and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610901862.0A CN106649517A (en) | 2016-10-17 | 2016-10-17 | Data mining method, device and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106649517A true CN106649517A (en) | 2017-05-10 |
Family
ID=58855799
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610901862.0A Pending CN106649517A (en) | 2016-10-17 | 2016-10-17 | Data mining method, device and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649517A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107562793A (en) * | 2017-08-01 | 2018-01-09 | 佛山市深研信息技术有限公司 | A kind of big data method for digging |
CN108322363A (en) * | 2018-02-12 | 2018-07-24 | 腾讯科技(深圳)有限公司 | Propelling data abnormality monitoring method, device, computer equipment and storage medium |
CN109034957A (en) * | 2018-07-06 | 2018-12-18 | 北京摩拜科技有限公司 | A kind of Products Show method, server and system for sharing articles |
CN109582741A (en) * | 2018-11-15 | 2019-04-05 | 阿里巴巴集团控股有限公司 | Characteristic treating method and apparatus |
CN109978575A (en) * | 2017-12-27 | 2019-07-05 | ***通信集团广东有限公司 | A kind of method and device excavated customer flow and manage scene |
CN110070383A (en) * | 2018-09-04 | 2019-07-30 | 中国平安人寿保险股份有限公司 | Abnormal user recognition methods and device based on big data analysis |
CN110390415A (en) * | 2018-04-18 | 2019-10-29 | 北京嘀嘀无限科技发展有限公司 | A kind of method and system carrying out trip mode recommendation based on user's trip big data |
WO2019218927A1 (en) * | 2018-05-14 | 2019-11-21 | 新华三信息安全技术有限公司 | Abnormal user identification method |
WO2019232891A1 (en) * | 2018-06-06 | 2019-12-12 | 平安科技(深圳)有限公司 | Method and device for acquiring user portrait, computer apparatus and storage medium |
CN111125197A (en) * | 2019-12-27 | 2020-05-08 | 成都康赛信息技术有限公司 | MIC and MP based data set abnormal data processing method |
CN112131484A (en) * | 2019-06-25 | 2020-12-25 | 北京京东尚科信息技术有限公司 | Multi-person session establishing method, device, equipment and storage medium |
-
2016
- 2016-10-17 CN CN201610901862.0A patent/CN106649517A/en active Pending
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107562793A (en) * | 2017-08-01 | 2018-01-09 | 佛山市深研信息技术有限公司 | A kind of big data method for digging |
CN109978575B (en) * | 2017-12-27 | 2021-06-04 | ***通信集团广东有限公司 | Method and device for mining user flow operation scene |
CN109978575A (en) * | 2017-12-27 | 2019-07-05 | ***通信集团广东有限公司 | A kind of method and device excavated customer flow and manage scene |
CN108322363B (en) * | 2018-02-12 | 2020-11-13 | 腾讯科技(深圳)有限公司 | Pushed data abnormity monitoring method and device, computer equipment and storage medium |
CN108322363A (en) * | 2018-02-12 | 2018-07-24 | 腾讯科技(深圳)有限公司 | Propelling data abnormality monitoring method, device, computer equipment and storage medium |
US11151680B2 (en) | 2018-04-18 | 2021-10-19 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for recommending transportation means |
CN110390415A (en) * | 2018-04-18 | 2019-10-29 | 北京嘀嘀无限科技发展有限公司 | A kind of method and system carrying out trip mode recommendation based on user's trip big data |
US11671434B2 (en) | 2018-05-14 | 2023-06-06 | New H3C Security Technologies Co., Ltd. | Abnormal user identification |
WO2019218927A1 (en) * | 2018-05-14 | 2019-11-21 | 新华三信息安全技术有限公司 | Abnormal user identification method |
WO2019232891A1 (en) * | 2018-06-06 | 2019-12-12 | 平安科技(深圳)有限公司 | Method and device for acquiring user portrait, computer apparatus and storage medium |
CN109034957A (en) * | 2018-07-06 | 2018-12-18 | 北京摩拜科技有限公司 | A kind of Products Show method, server and system for sharing articles |
CN110070383A (en) * | 2018-09-04 | 2019-07-30 | 中国平安人寿保险股份有限公司 | Abnormal user recognition methods and device based on big data analysis |
CN110070383B (en) * | 2018-09-04 | 2024-04-05 | 中国平安人寿保险股份有限公司 | Abnormal user identification method and device based on big data analysis |
CN109582741B (en) * | 2018-11-15 | 2023-09-05 | 创新先进技术有限公司 | Feature data processing method and device |
CN109582741A (en) * | 2018-11-15 | 2019-04-05 | 阿里巴巴集团控股有限公司 | Characteristic treating method and apparatus |
CN112131484A (en) * | 2019-06-25 | 2020-12-25 | 北京京东尚科信息技术有限公司 | Multi-person session establishing method, device, equipment and storage medium |
CN111125197A (en) * | 2019-12-27 | 2020-05-08 | 成都康赛信息技术有限公司 | MIC and MP based data set abnormal data processing method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649517A (en) | Data mining method, device and system | |
CN109035003A (en) | Anti- fraud model modelling approach and anti-fraud monitoring method based on machine learning | |
CN106156791B (en) | Business data classification method and device | |
US10474792B2 (en) | Dynamic topological system and method for efficient claims processing | |
CN109063966B (en) | Risk account identification method and device | |
CN105490823B (en) | data processing method and device | |
CN107516246B (en) | User type determination method, user type determination device, medium and electronic equipment | |
Sifa et al. | Customer lifetime value prediction in non-contractual freemium settings: Chasing high-value users using deep neural networks and SMOTE | |
CN110489642A (en) | Method of Commodity Recommendation, system, equipment and the medium of Behavior-based control signature analysis | |
CN109872232A (en) | It is related to illicit gain to legalize account-classification method, device, computer equipment and the storage medium of behavior | |
CN114187112A (en) | Training method of account risk model and determination method of risk user group | |
CN111754287B (en) | Article screening method, apparatus, device and storage medium | |
CN106875185A (en) | A kind of air control model training method and device | |
CN107705175B (en) | Method and device for determining similarity between user and merchant and electronic equipment | |
CN106372964A (en) | Behavior loyalty identification and management method, system and terminal | |
CN109858947A (en) | Retail user value analysis system and method | |
CN109191185A (en) | A kind of visitor's heap sort method and system | |
Pu et al. | Research on optimization of customer value segmentation based on improved K-means clustering algorithm | |
CN112150179B (en) | Information pushing method and device | |
CN107562793A (en) | A kind of big data method for digging | |
Bartels | Cluster analysis for customer segmentation with open banking data | |
CN115689708A (en) | Screening method, risk assessment method, device, equipment and medium of training data | |
CN109919626A (en) | A kind of recognition methods of high risk bank card and device | |
CN115965468A (en) | Transaction data-based abnormal behavior detection method, device, equipment and medium | |
CN110570301B (en) | Risk identification method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 1237912 Country of ref document: HK |
|
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170510 |
|
RJ01 | Rejection of invention patent application after publication | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: WD Ref document number: 1237912 Country of ref document: HK |