CN109582741A - Characteristic treating method and apparatus - Google Patents

Characteristic treating method and apparatus Download PDF

Info

Publication number
CN109582741A
CN109582741A CN201811359743.2A CN201811359743A CN109582741A CN 109582741 A CN109582741 A CN 109582741A CN 201811359743 A CN201811359743 A CN 201811359743A CN 109582741 A CN109582741 A CN 109582741A
Authority
CN
China
Prior art keywords
sample set
data
specified
scaled
outlier data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811359743.2A
Other languages
Chinese (zh)
Other versions
CN109582741B (en
Inventor
刘松吟
董扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811359743.2A priority Critical patent/CN109582741B/en
Publication of CN109582741A publication Critical patent/CN109582741A/en
Application granted granted Critical
Publication of CN109582741B publication Critical patent/CN109582741B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This specification embodiment discloses a kind of characteristic treating method and apparatus.This method comprises: determining the Outlier Data in the specific characteristic of sample set;Sample set after processing is scaled is zoomed in and out to the Outlier Data in the sample set, wherein the data after the Outlier Data scaling are greater than the non-Outlier Data in the specific characteristic for scaling preceding sample set;Clustering processing is carried out to sample set after the scaling;Based on multiple clusters after clustering processing, specific characteristic data of the sample set after the scaling in the corresponding specific characteristic section of each cluster are normalized respectively.

Description

Feature data processing method and device
Technical Field
The embodiment of the specification relates to the field of data processing, in particular to a feature data processing method and device.
Background
With the development of the internet, the more feature data generated by users in using the internet, the more feature data can be widely used and converted into useful information, such as user access loyalty points, user value points or user browsing plate stickiness points based on the feature data such as the amount of money purchased by users, the number of purchases, plate browsing records, etc. The scoring value can provide reference basis for product operation, and can also be used as discretization data for model training.
In the process of scoring the users, it is found that many characteristic data, such as the fund purchase amounts of the users, are obviously distributed in a long tail, that is, the purchase amounts of a large number of users are concentrated in a small interval at the head, the purchase amounts of a small number of users are far larger than the average number, and the purchase amounts of the small number of users can be called as outlier data.
In the prior art, when the feature data with obvious long-tail distribution is normalized, the distribution of the normalized feature data is still long-tail distribution, which causes the normalized feature data to be concentrated in a very small value range, so that the discrimination between the normalized feature data is still very small, and the user cannot be visually and reasonably evaluated.
Disclosure of Invention
The embodiment of the specification provides a feature data processing method and device, which are used for solving the problem that the feature data after normalization processing is small in distinguishing degree due to long tail distribution of the feature data.
The embodiment of the specification adopts the following technical scheme:
in a first aspect, a feature data processing method is provided, including:
determining outlier data in specified features of a sample set;
scaling outlier data in the sample set to obtain a scaled sample set, wherein the scaled outlier data is larger than non-outlier data in the specified features of the pre-scaled sample set;
clustering the scaled sample set;
based on the clustered clusters, respectively carrying out normalization processing on the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster.
In a second aspect, there is provided a feature data processing apparatus comprising:
the outlier data determining module is used for determining outlier data in the specified characteristics of the sample set;
the outlier data scaling module is used for scaling outlier data in the sample set to obtain a scaled sample set, wherein the scaled outlier data is larger than non-outlier data in the specified features of the pre-scaled sample set;
the aggregation processing module is used for clustering the zoomed sample set;
and the normalization processing module is used for respectively normalizing the specified characteristic data of the scaled sample set in the specified characteristic interval corresponding to each cluster based on the clustered clusters.
In a third aspect, an electronic device is provided, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor performing the operations of:
determining outlier data in specified features of a sample set;
scaling outlier data in the sample set to obtain a scaled sample set, wherein the scaled outlier data is larger than non-outlier data in the specified features of the pre-scaled sample set;
clustering the scaled sample set;
based on the clustered clusters, respectively carrying out normalization processing on the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster.
In a fourth aspect, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the operations of:
determining outlier data in specified features of a sample set;
scaling outlier data in the sample set to obtain a scaled sample set, wherein the scaled outlier data is larger than non-outlier data in the specified features of the pre-scaled sample set;
clustering the scaled sample set;
based on the clustered clusters, respectively carrying out normalization processing on the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster.
The embodiment of the specification adopts at least one technical scheme which can achieve the following beneficial effects: by determining outlier data in the specified features of the sample set and scaling the outlier data, the discrimination of the feature data after normalization processing can be improved; meanwhile, the zoomed data of the outlier data is still larger than the non-outlier data, so that the difference between the zoomed data and the non-outlier data is ensured; in addition, in the embodiments of the present description, a clustering method is used to obtain a plurality of clusters, and normalization processing is performed on the specified feature data in the specified feature interval corresponding to each cluster, so that the aggregation rule of the user can be fully reflected, and the discrimination of the feature data after normalization processing is further improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic flow chart diagram illustrating a feature data processing method according to an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart of a feature data processing method according to another embodiment of the present disclosure;
FIG. 3 is a block diagram of a feature data processing apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram of a hardware structure of an electronic device for implementing various embodiments of the present specification.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As shown in fig. 1, an embodiment of the present specification provides a feature data processing method 100, which is used for solving the problem that the feature data after normalization processing has a small degree of distinction due to long tail distribution of the feature data, where the embodiment 100 includes the following steps:
s102: outlier data in specified features of the sample set is determined.
The sample set mentioned in the embodiment of the present disclosure may be a user sample set, or may be a sample set of other types (e.g., animals and plants), and the embodiment will be described with the user sample set as an example.
The user sample set typically includes a large number of user samples. For each user sample, characteristics with multiple dimensions are generally included, such as the user's age, occupation, fund purchase amount, fund purchase times, online purchase amount, website login times, account registration age, and the like.
The designated feature mentioned in the embodiments of the present specification may be one of the above multidimensional features, for example, the designated feature is a fund purchase amount of the user, and for example, the designated feature is an online purchase amount of the user, and the like.
The specific characteristics generally include outliers, for example, the specific characteristics are fund purchase amounts (which may be total amount) of users, fund purchase amounts of a large number of user samples are distributed between 0 to 10,000 yuan, and fund purchase amounts of a very small number of user samples are far beyond the distribution interval, in this specification, the characteristic data with a larger difference compared with most characteristic data in the specific characteristics may be determined as outliers, for example, fund purchase amounts larger than 10,000 may be determined as outliers.
Alternatively, embodiments of the present description may employ chebyshev's theorem, or the maximum-minimum method through box plots, to determine outlier data in a given feature of a sample set.
S104: and scaling the outlier data in the sample set to obtain a scaled sample set, wherein the scaled outlier data is larger than the non-outlier data in the specified features of the pre-scaled sample set.
As described above, the outlier data is generally much larger than the non-outlier data, and therefore, the outlier data can be scaled by this step, that is, the value of the outlier data is reduced; and (4) not processing the non-outlier data, and further obtaining a sample set after zooming.
It should be noted that, in the embodiments of the present specification, both the outlier data and the non-outlier data are for the above-mentioned specified features, and other feature data of the sample set is not generally processed.
In the embodiment of the present specification, the data after scaling the outlier data is larger than the non-outlier data in the specified features of the sample set before scaling, for example, the feature data of the user whose fund purchase amount is in the range of 0-10,000 is called non-outlier data, the feature data larger than 10,000 is called outlier data, and the outlier data is still larger than 10,000 after scaling processing, but is usually much smaller than the value before scaling.
In addition, for outlier data with different sizes, the size relationship of the scaled data still maintains the size relationship before scaling, for example, for outlier data, the scaled data of 20,000 is still smaller than the scaled data of 40,000, and thus the difference between the data can be ensured.
By scaling the outlier data, the long tail effect can be greatly reduced, and meanwhile, the data after scaling the outlier data is larger than the non-outlier data in the specified features of the sample set before scaling, and the difference between the data can still be ensured.
Optionally, in an embodiment of the present disclosure, a scaling factor may be predetermined, and the scaling factor may adopt a logarithmic process, so as to scale the outlier data based on the scaling factor.
S106: and clustering the scaled sample set.
The embodiment of the present disclosure may use effective clustering algorithms including a Kmeans algorithm, a maximum expectation algorithm EM, a density-based clustering algorithm DBSCAN, and the like to perform clustering processing on the scaled sample set to obtain a plurality of clusters, each cluster generally includes a plurality of user samples, and the plurality of user samples have a higher similarity, for example, the user values of the user samples in the same cluster are relatively similar, or the stickiness of the browsing blocks of the website is relatively similar, or the access loyalty is relatively similar, and the like.
In this step, when clustering is performed on the scaled sample set, not only the above-mentioned specified features may be used, but also a plurality of features other than the specified features may be applied. For example, the specified feature is the fund purchase amount of the user, and when the scaled sample set is clustered, the fund purchase amount of the user is utilized, and the features of the user, such as age, occupation, fund purchase frequency, login frequency, registration age and the like, may also be used, so as to improve the accuracy of the clustering result, that is, to ensure that the similarity between users in one cluster is high, and the similarity between user samples in different clusters is weak.
S108: based on the clustered clusters, respectively carrying out normalization processing on the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster.
Step S106, clustering the zoomed sample set to obtain a plurality of clusters, wherein each cluster generally comprises a plurality of user samples, and the step can be used for preferentially obtaining the central point of each cluster; and then, sequencing the clusters based on the sequence of the central points from small to large, and finely adjusting the boundaries of the sequenced clusters to obtain a plurality of specified characteristic intervals.
And all the specified feature data fall into the corresponding specified feature intervals based on the new boundaries, and generally speaking, one cluster corresponds to one specified feature interval. The specific feature data mentioned herein includes not only the non-outlier data mentioned above, but also scaled outlier data.
In the embodiment of the specification, a clustering method is adopted to obtain a plurality of specified characteristic intervals, so that the user samples in the sample set are divided into a plurality of intervals, the aggregation rule of the users can be fully reflected, and the discrimination between the specified characteristic data of the users is improved.
After the designated feature intervals are obtained through the above operations, normalization processing may be performed on the designated feature data falling into the designated feature intervals, for example, if 5 designated feature intervals are obtained, normalization processing may be performed on all the designated feature data falling into each designated feature interval.
The step can specifically adopt a max-min standardization method, a standard fraction z-score standardization method, equal frequency binning and the like to carry out normalization processing on the specified characteristic data.
Optionally, as an embodiment, the step may perform normalization processing on the designated feature data in the designated feature interval corresponding to each cluster based on the following formula:
j represents the number of the designated characteristic interval corresponding to each cluster, and can take a value between 1 and k, wherein k is the total number of the designated characteristic intervals;
xithe ith designated feature data before normalization processing in the jth designated feature interval is represented;
the ith of the normalized specified characteristic data in the jth specified characteristic interval is represented;
represents the maximum value of the j-th specified characteristic interval;
represents the minimum value of the jth specified characteristic interval.
Through the formula, the scores of the user samples in each specified characteristic interval can be obtained while normalization is performed, for example, the score of the user sample in the 1 st specified characteristic interval is between 0 and 1; the score of the user sample for the 2 nd specified feature interval is between 1 and 2; … …, respectively; the score of the user sample for the kth specified feature interval is between k-1 and k.
The points may also tend to suggest user value, loyalty of the user, and the like. The score can be directly used as reference data of operation and maintenance, and can also be used as input for model training or use.
In the feature data processing method provided in the embodiment of the present specification, by determining outlier data in the specified features of the sample set and scaling the outlier data, the discrimination of the feature data after normalization processing can be improved; meanwhile, the zoomed data of the outlier data is still larger than the non-outlier data, so that the difference between the zoomed data and the non-outlier data is ensured; in addition, in the embodiments of the present description, a clustering method is used to obtain a plurality of clusters, and normalization processing is performed on the specified feature data in the specified feature interval corresponding to each cluster, so that the aggregation rule of the user can be fully reflected, and the discrimination of the feature data after normalization processing is further improved.
The embodiment of the specification is equivalent to that feature data corresponding to specified features are scaled twice, and only outlier data are scaled for the first time, so that long-tail features can be removed and differences among the feature data are still kept; and the second time, the scaling processing is performed on all feature data (including non-outlier data and feature data after scaling processing) in a normalization mode, so that the feature data discrimination after the normalization processing is improved.
For the above-mentioned feature data discrimination after the normalization processing is improved, for example, two users with a total purchase amount of 1,000 yuan and 100,000 yuan for the fund obviously belong to two categories, but if there are users with a total purchase amount of 10,000,000 in the whole sample set, the two users are likely to be or mistakenly classified into one category due to the influence of outlier data, that is, the discrimination between 1,000 yuan and 100,000 yuan is small, which is obviously unreasonable.
Conventional data normalization approaches include data binning, e.g., equal width binning. The equal-width bins are the number of fixed bins, the width of data divided into each bin is guaranteed to be equal based on the maximum value and the minimum value of a data set, the equal-width bins are sensitive to outlier data, the width of the bins in long-tail distribution is larger, a plurality of data are contained in the bins at the head part, but the bins at the tail part basically have no data, the distinguishing degree of the data in the bins at the head part cannot be guaranteed, and 1,000-element data and 100,000-element data are probably divided into the same bin. In the embodiment of the specification, the outlier data in the specified features of the sample set is predetermined and is scaled, so that the influence of long tail distribution is reduced as much as possible, and the feature data discrimination after normalization processing can be improved.
The embodiment of the specification adopts a method for scaling and clustering outlier data, so that the influence of long tail distribution is eliminated, and meanwhile, the aggregation rule of users can be fully reflected, for example, two users with fund purchase total of 1,000 yuan and 100,000 yuan are clustered into two categories, and the distinction degree between data is maintained.
In the step S102 of the embodiment 100, the outlier data in the specified feature of the sample set is determined, which may specifically adopt the following method:
determining a mean and a standard deviation of specified features of the sample set;
and determining the feature data which is out of m standard deviation ranges of the average number in the specified features of the sample set as outliers, wherein m is a positive number.
Considering that general feature data are all positive numbers, the feature data outside the range of m standard deviations of the mean may be more than the mean + m standard deviations.
By adopting the method for determining the outlier data, the accuracy of the acquired outlier data can be improved due to the fact that the average number and the standard deviation of the specified features are fully utilized.
Further, in step S104 of the above embodiment 100, it is mentioned that the scaling process is performed on the outlier data in the sample set, and specifically, the scaling process may be performed on the outlier data in the sample set based on the following formula:
wherein ,representing the ith of the outlier scaled data;
xirepresenting the ith outlier data in the sample set;
μ represents the mean of the specified features of the sample set;
σ represents the standard deviation of the specified feature of the sample set.
The adoption of the formula is equivalent to carrying out logarithmic processing on outlier data, and the scaling dimension in the formulaGreater than 1, it can be guaranteed that the outlier scaled data is larger than the non-outlier, since all data in the μ + m σ range are non-outlier.
In some of the above embodiments, before normalizing the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster, the method may further include the following steps:
sequencing the clustered clusters based on the central points of the clustered clusters;
and finely adjusting the boundaries of the sorted clusters to obtain the designated characteristic interval corresponding to each cluster.
Through the embodiment, the plurality of specified characteristic intervals are obtained based on the clustering method, the specified characteristic data of different types can be divided into different intervals, and the distinguishing degree between the data is improved.
In some of the above embodiments, before normalizing the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster, the method may further include the following steps:
judging whether the designated characteristic data in the clusters after the clustering processing is long-tail distribution or not;
optionally, the determining whether to re-determine outlier data in the specified features of the sample set based on the determination result includes:
and if the designated feature data falling into the clusters after the clustering processing is long-tail distribution, reducing the value of m, and re-determining outlier data in the designated features of the sample set based on the reduced value of m.
Through the embodiment, the influence caused by the outlier data can be further avoided, and the discrimination between the data is improved.
Optionally, the clustering process performed on the scaled sample set in the several embodiments includes:
and clustering the zoomed sample set according to the characteristics of the zoomed sample set and a preset clustering algorithm.
As shown in fig. 2, an embodiment of the present specification provides a feature data processing method 200, which is used for solving the problem that the feature data after normalization processing has a small degree of distinction due to long tail distribution of the feature data, where the embodiment 200 includes the following steps:
s202: and determining a correlation coefficient matrix among the plurality of characteristics of the sample set, and screening the plurality of characteristics based on the correlation coefficient matrix.
The sample set in the embodiment of the present specification may be specifically a user sample set, each user sample generally includes features of multiple dimensions, and in this step, the features may be filtered, and redundant features may be eliminated, so that the processing efficiency of clustering in subsequent steps is improved.
The step may specifically be performed by determining a correlation coefficient matrix between a plurality of features of the sample set, and based on the correlation coefficient matrix, if the correlation coefficient between two features is closer to 1 or-1, it indicates that the stronger the correlation between the two features is, the closer the correlation coefficient is to 0, the weaker the correlation is.
Embodiments of the present disclosure may be based on the above-mentioned correlation coefficient matrix, and if the correlation coefficient between a certain feature and the rest of the features is close to 1 or-1, the feature may be removed. The correlation coefficient may specifically be a pearson coefficient, and optionally, the embodiment may further obtain a correlation coefficient matrix by using a chi-square test, an R-square test, and the like.
S204: determining outlier data in the specified characteristics of the sample set, and scaling the outlier data to obtain a scaled sample set.
In the embodiment of the present specification, the specification feature is described by taking the user fund purchase amount as an example. This step may first obtain the fund purchase amount (x) in the sample set1,x2,…,xn) And then, according to chebyshev's theorem, feature data that lie outside the range of m standard deviations σ of the mean μ are determined as outlier data.
In the embodiment of the present specification, since the fund purchase amount is positive, the feature data outside the m standard deviations σ of the average μ are large values, and negative values do not generally occur. Of course, in other embodiments, characteristic data outside the range of m standard deviations σ of the mean μmay also exhibit negative values.
After determining outliers, the outliers may be scaled based on the following equation:
the scaling dimension scale in the formula adopts logarithmic processing, so that the long tail characteristic of specified feature data can be removed, and the difference between the data is still ensured:
it can be seen from the above two equations that no processing is performed on the specified feature data (i.e., non-outlier data) that lies within the m standard deviations σ of the mean μ, and only the outlier data is scaled.
The scaling dimension scale is larger than 1, so that outlier data can still be larger than non-outlier data after being scaled, and the difference between the data can be further ensured. In addition, through the dimension scale, for outlier data with different sizes, the size relationship of the data after scaling still maintains the size relationship before scaling, and the difference between the data can be further ensured.
In the above two formulas:
xithe ith (which may be outlier data or non-outlier data) of the specified feature data representing the sample set;
μ represents the mean of the specified characteristic data;
σ represents the standard deviation of the specified feature data.
xiIndicating the ith of the processed specified feature data (including the outlier scaled data and the non-outlier data that is not scaled).
S206: and clustering the scaled sample set to obtain a plurality of clusters.
The step may specifically be to perform clustering processing on the scaled sample set based on the processed specified feature data and a plurality of other feature data to obtain k clusters, so that the processed specified feature data (x) is obtained1,x2,...,xn) And also fall into the k clusters described above.
Then the center points (y) of the k clusters are obtained1,y2,…,yk) And a corresponding amount of data per cluster, optionally, the embodiment may further determine whether to re-determine outlier data in the specified features of the sample set based on the number of each cluster, specifically:
if the number distribution of the clusters still belongs to the obvious long-tail distribution, it needs to be determined whether partial outlier data is missed due to an excessively large m value in step S204, and the outlier data needs to be re-screened by reducing the m value.
And sequencing the final clustering results from small to large according to the central point.
S208: and determining the designated characteristic interval corresponding to each cluster.
This step may be based on the cluster center point (y)1,y2,...,yk) Fine-tuning the clustering boundary by using a visual division method to obtain a boundary (y)0,y1,y2,...,yk) Wherein [ y0,y1]Form a first interval, [ y ]1,y2]Constitutes a second interval, … …, [ yk-1,yk]Constituting the k-th interval.
When the boundary fine-tuning is performed, the principle that the boundary division has definite meaning for reading and understanding can be followed, for example, the interval [985,10976] is finely tuned to [1000,10000] this interval.
After the k-grid interval is obtained, all the processed specified characteristic data (x) can be obtained1,x2,...,xn) Falling within the corresponding interval based on the new boundary.
S210: and normalizing the specified feature data falling into the specified feature interval.
Assuming that the number of the designated feature intervals (simply referred to as intervals) is k, for k intervals, the designated feature data falling in each interval is subjected to outlier determination and outlier scaling according to step S204 to ensure that no outlier exists in each interval,
then, the designated characteristic data falling into the designated characteristic interval is subjected to max-min linear normalization and falls into the jth interval [ y ]j-1,yj]Data x ofiThe transformation is:
thus, the processed specified feature data (x)1,x2,...,xn) Then fall to [0, k]And the influence of long tail data can be removed.
In the above formula, j represents the number of the interval, and j can take the value between 1 and k, wherein k is the total number of the interval;
xiindicating the ith of the specified characteristic data before normalization processing in the jth interval;
the ith interval represents the designated characteristic data after normalization processing;
represents the maximum value of the jth interval;
represents the minimum value of the jth interval.
Through the formula, the scores of the user samples in each specified characteristic interval can be obtained while normalization is performed, for example, the score of the user sample in the 1 st specified characteristic interval is between 0 and 1; the score of the user sample for the 2 nd specified feature interval is between 1 and 2; … …, respectively; the score of the user sample for the kth specified feature interval is between k-1 and k.
The points may also tend to suggest user value, loyalty of the user, and the like. The score can be directly used as reference data of operation and maintenance, and can also be used as input for model training or use.
In the embodiment of the specification, the data are divided into several groups through a clustering algorithm so as to normalize each group, and the discrimination between the data is ensured while the influence of long-tail data is eliminated as much as possible.
In the feature data processing method provided in the embodiment of the present specification, by determining outlier data in the specified features of the sample set and scaling the outlier data, the discrimination of the feature data after normalization processing can be improved; meanwhile, the zoomed data of the outlier data is still larger than the non-outlier data, so that the difference between the zoomed data and the non-outlier data is ensured; in addition, in the embodiments of the present description, a clustering method is used to obtain a plurality of clusters, and normalization processing is performed on the specified feature data in the specified feature interval corresponding to each cluster, so that the aggregation rule of the user can be fully reflected, and the discrimination of the feature data after normalization processing is further improved.
Based on the method provided in the embodiments of the present specification, the value of the user is scored based on the amount of fund purchase of the user, and the proportion of users within each score of 0 to 5 is [ 58.64%, 27.79%, 8.74%, 4.32%, 0.52% ]. If the method of the embodiment of the invention is not adopted, the proportion of users in each score of 0-5 scores obtained based on equal-width buckets is [ 99.98%, 0.02%, 0,0, 0 ].
For the fund purchase amount evaluation of the user, the difference between the low net value user and the high net value user is considered, and the difference degree of the same group of people is also considered. It is clearly not reasonable to limit the equal-width buckets to the long-tailed data impact of classifying a large number of users into categories of low net-value users. The dividing method of the embodiment of the specification is more accurate, the influence of data long tail distribution is eliminated, and the difference of user purchasing power of different levels is considered.
The above description section introduces an embodiment of the feature data processing method in detail, as shown in fig. 3, and the present specification further provides a feature data processing apparatus, as shown in fig. 3, where the apparatus 300 includes:
an outliers determination module 302 that may be used to determine outliers in specified features of the sample set;
the outlier scaling module 304 may be configured to scale outlier data in the sample set to obtain a scaled sample set, where the scaled outlier data is larger than non-outlier data in the specified feature of the sample set before scaling;
an aggregation processing module 306, configured to perform clustering processing on the scaled sample set;
the normalization processing module 308 may be configured to, based on the clustered multiple clusters, perform normalization processing on the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster, respectively.
The feature data processing device provided in the embodiment of the present specification can improve the discrimination of the feature data after normalization processing by determining outlier data in the specified features of the sample set and scaling the outlier data; meanwhile, the zoomed data of the outlier data is still larger than the non-outlier data, so that the difference between the zoomed data and the non-outlier data is ensured; in addition, in the embodiments of the present description, a clustering method is used to obtain a plurality of clusters, and normalization processing is performed on the specified feature data in the specified feature interval corresponding to each cluster, so that the aggregation rule of the user can be fully reflected, and the discrimination of the feature data after normalization processing is further improved.
Optionally, as an embodiment, the determining outlier data module 302 determines outlier data in the specified features of the sample set includes:
determining a mean and a standard deviation of specified features of the sample set;
and determining the feature data which is out of m standard deviation ranges of the average number in the specified features of the sample set as outliers, wherein m is a positive number.
Optionally, as an embodiment, the scaling of the outlier data in the sample set by the outlier data scaling module 304 includes: the outlier scaling module 304 scales outlier data in the sample set based on the following formula:
wherein ,representing the ith of the outlier scaled data;
xirepresenting the ith outlier data in the sample set;
μ represents the mean of the specified features of the sample set;
σ represents the standard deviation of the specified feature of the sample set.
Optionally, as an embodiment, the apparatus 300 further includes a specified feature interval obtaining module (not shown), which may be configured to:
sequencing the clustered clusters based on the central points of the clustered clusters;
and finely adjusting the boundaries of the sorted clusters to obtain the designated characteristic interval corresponding to each cluster.
Optionally, as an embodiment, the apparatus 300 further includes a first determining module (not shown), which may be configured to:
judging whether the designated characteristic data in the clusters after the clustering processing is long-tail distribution or not;
based on the determination, it is determined whether to re-determine outlier data in the specified features of the sample set.
Optionally, as an embodiment, the determining whether to re-determine outlier data in the specified features of the sample set based on the determination result includes:
and if the designated feature data falling into the clusters after the clustering processing is long-tail distribution, reducing the value of m, and re-determining outlier data in the designated features of the sample set based on the reduced value of m.
Optionally, as an embodiment, the clustering the scaled sample set by the aggregation processing module 306 includes:
and clustering the zoomed sample set according to the characteristics of the zoomed sample set and a preset clustering algorithm.
Optionally, as an embodiment, the apparatus 300 further includes a feature filtering module (not shown), which may be configured to:
determining a matrix of correlation coefficients between a plurality of features of the sample set;
and screening a plurality of characteristics of the sample set based on the correlation coefficient matrix.
Optionally, as an embodiment, the normalizing the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster by the normalization processing module 308 respectively includes:
respectively carrying out normalization processing on the designated feature data in the designated feature interval corresponding to each cluster based on the following formula:
wherein j represents the number of the designated characteristic interval corresponding to each cluster;
xithe ith designated feature data before normalization processing in the jth designated feature interval is represented;
the ith of the normalized specified characteristic data in the jth specified characteristic interval is represented;
represents the maximum value of the j-th specified characteristic interval;
represents the minimum value of the jth specified characteristic interval.
Optionally, as an embodiment, the apparatus 300 further includes a second determining module (not shown), which may be configured to:
judging whether the designated characteristic interval corresponding to each cluster has outlier data or not;
if yes, scaling processing is carried out on the outlier data in the specified characteristic interval.
The above feature data processing apparatus 300 according to the embodiment of the present specification may refer to the flows of the feature data processing methods 100 and 200 corresponding to the embodiments of the previous text specification, and each unit/module and the above other operations and/or functions in the feature data processing apparatus 300 are respectively for implementing the corresponding flows in the feature data processing methods 100 and 200, and are not described herein again for brevity.
An electronic device according to an embodiment of the present specification will be described in detail below with reference to fig. 4. Referring to fig. 4, at a hardware level, the electronic device includes a processor, optionally an internal bus, a network interface, and a memory. As shown in fig. 4, the Memory may include a Memory, such as a Random-Access Memory (RAM), and may also include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware needed to implement other services.
The processor, the network interface, and the memory may be interconnected by an internal bus, which may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an extended EISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 4, but that does not indicate only one bus or one type of bus.
And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.
The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form a device for forwarding the chat information on a logic level. A processor executing a program stored in the memory and specifically configured to perform the operations of method embodiments 100 and 200 as described herein.
The methods performed by the methods and apparatuses disclosed in the embodiments of fig. 1 to 2 may be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.
The electronic device shown in fig. 4 may also execute the methods shown in fig. 1 to fig. 2, and implement the functions of the feature data processing method in the embodiments shown in fig. 1 to fig. 2, which are not described herein again in this specification.
Of course, besides the software implementation, the electronic device of the present application does not exclude other implementations, such as a logic device or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or a logic device.
The embodiments of this specification further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the processes of the method embodiments 100 and 200, and can achieve the same technical effects, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (13)

1. A method of feature data processing, comprising:
determining outlier data in specified features of a sample set;
scaling outlier data in the sample set to obtain a scaled sample set, wherein the scaled outlier data is larger than non-outlier data in the specified features of the pre-scaled sample set;
clustering the scaled sample set;
based on the clustered clusters, respectively carrying out normalization processing on the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster.
2. The method of claim 1, the determining outlier data in a specified feature of a sample set comprising:
determining a mean and a standard deviation of specified features of the sample set;
and determining the feature data which is out of m standard deviation ranges of the average number in the specified features of the sample set as outliers, wherein m is a positive number.
3. The method of claim 2, scaling outlier data in the sample set comprising: scaling outlier data in the sample set based on the following formula:
wherein ,representing the ith of the outlier scaled data;
xirepresenting the ith outlier data in the sample set;
μ represents the mean of the specified features of the sample set;
σ represents the standard deviation of the specified feature of the sample set.
4. The method according to claim 3, before normalizing the specified feature data in the specified feature interval corresponding to each cluster of the scaled sample set, the method further comprising:
sequencing the clustered clusters based on the central points of the clustered clusters;
and finely adjusting the boundaries of the sorted clusters to obtain the designated characteristic interval corresponding to each cluster.
5. The method according to claim 4, before normalizing the specified feature data in the specified feature interval corresponding to each cluster of the scaled sample set, the method further comprising:
judging whether the designated characteristic data in the clusters after the clustering processing is long-tail distribution or not;
based on the determination, it is determined whether to re-determine outlier data in the specified features of the sample set.
6. The method of claim 5, the determining whether to re-determine outlier data in the specified features of the sample set based on the determination comprising:
and if the designated feature data falling into the clusters after the clustering processing is long-tail distribution, reducing the value of m, and re-determining outlier data in the designated features of the sample set based on the reduced value of m.
7. The method of any of claims 1 to 6, the clustering the scaled sample set comprising:
and clustering the zoomed sample set according to the characteristics of the zoomed sample set and a preset clustering algorithm.
8. The method of claim 7, prior to the determining outlier data in the specified features of the sample set, the method further comprising:
determining a matrix of correlation coefficients between a plurality of features of the sample set;
and screening a plurality of characteristics of the sample set based on the correlation coefficient matrix.
9. The method according to claim 1, wherein the normalizing the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster respectively comprises:
respectively carrying out normalization processing on the designated feature data in the designated feature interval corresponding to each cluster based on the following formula:
wherein j represents the number of the designated characteristic interval corresponding to each cluster;
xithe ith designated feature data before normalization processing in the jth designated feature interval is represented;
the ith of the normalized specified characteristic data in the jth specified characteristic interval is represented;
represents the maximum value of the j-th specified characteristic interval;
represents the minimum value of the jth specified characteristic interval.
10. The method according to claim 9, before normalizing the specified feature data in the specified feature interval corresponding to each cluster of the scaled sample set, the method further comprising:
judging whether the designated characteristic interval corresponding to each cluster has outlier data or not;
if yes, scaling processing is carried out on the outlier data in the specified characteristic interval.
11. A feature data processing apparatus comprising:
the outlier data determining module is used for determining outlier data in the specified characteristics of the sample set;
the outlier data scaling module is used for scaling outlier data in the sample set to obtain a scaled sample set, wherein the scaled outlier data is larger than non-outlier data in the specified features of the pre-scaled sample set;
the aggregation processing module is used for clustering the zoomed sample set;
and the normalization processing module is used for respectively normalizing the specified characteristic data of the scaled sample set in the specified characteristic interval corresponding to each cluster based on the clustered clusters.
12. An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor performing the operations of:
determining outlier data in specified features of a sample set;
scaling outlier data in the sample set to obtain a scaled sample set, wherein the scaled outlier data is larger than non-outlier data in the specified features of the pre-scaled sample set;
clustering the scaled sample set;
based on the clustered clusters, respectively carrying out normalization processing on the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster.
13. A computer-readable storage medium having a computer program stored thereon, which when executed by a processor, performs operations comprising:
determining outlier data in specified features of a sample set;
scaling outlier data in the sample set to obtain a scaled sample set, wherein the scaled outlier data is larger than non-outlier data in the specified features of the pre-scaled sample set;
clustering the scaled sample set;
based on the clustered clusters, respectively carrying out normalization processing on the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster.
CN201811359743.2A 2018-11-15 2018-11-15 Feature data processing method and device Active CN109582741B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811359743.2A CN109582741B (en) 2018-11-15 2018-11-15 Feature data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811359743.2A CN109582741B (en) 2018-11-15 2018-11-15 Feature data processing method and device

Publications (2)

Publication Number Publication Date
CN109582741A true CN109582741A (en) 2019-04-05
CN109582741B CN109582741B (en) 2023-09-05

Family

ID=65922485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811359743.2A Active CN109582741B (en) 2018-11-15 2018-11-15 Feature data processing method and device

Country Status (1)

Country Link
CN (1) CN109582741B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555717A (en) * 2019-07-29 2019-12-10 华南理工大学 method for mining potential purchased goods and categories of users based on user behavior characteristics
CN111243743A (en) * 2020-01-17 2020-06-05 深圳前海微众银行股份有限公司 Data processing method, device, equipment and computer readable storage medium
CN111581499A (en) * 2020-04-21 2020-08-25 北京龙云科技有限公司 Data normalization method, device and equipment and readable storage medium
WO2020258670A1 (en) * 2019-06-28 2020-12-30 平安科技(深圳)有限公司 Network access abnormality determination method and apparatus, server, and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070033122A1 (en) * 2005-08-04 2007-02-08 First American Real Estate Solutions, Lp Method and apparatus for computing selection criteria for an automated valuation model
CN104216941A (en) * 2013-05-31 2014-12-17 三星Sds株式会社 Data analysis apparatus and method
CN104462802A (en) * 2014-11-26 2015-03-25 浪潮电子信息产业股份有限公司 Method for analyzing outlier data in large-scale data
US20150286707A1 (en) * 2014-04-08 2015-10-08 International Business Machines Corporation Distributed clustering with outlier detection
CN105378714A (en) * 2013-06-14 2016-03-02 微软技术许可有限责任公司 Fast grouping of time series
CN105612554A (en) * 2013-10-11 2016-05-25 冒纳凯阿技术公司 Method for characterizing images acquired through video medical device
CN106156142A (en) * 2015-04-13 2016-11-23 深圳市腾讯计算机***有限公司 The processing method of a kind of text cluster, server and system
CN106203103A (en) * 2016-06-23 2016-12-07 百度在线网络技术(北京)有限公司 The method for detecting virus of file and device
US20170124478A1 (en) * 2015-10-30 2017-05-04 Citrix Systems, Inc. Anomaly detection with k-means clustering and artificial outlier injection
CN106649517A (en) * 2016-10-17 2017-05-10 北京京东尚科信息技术有限公司 Data mining method, device and system
CN107330092A (en) * 2017-07-04 2017-11-07 广西电网有限责任公司电力科学研究院 A kind of production business noise data detection and separation method
CN107644032A (en) * 2016-07-21 2018-01-30 中兴通讯股份有限公司 Outlier detection method and apparatus

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070033122A1 (en) * 2005-08-04 2007-02-08 First American Real Estate Solutions, Lp Method and apparatus for computing selection criteria for an automated valuation model
CN104216941A (en) * 2013-05-31 2014-12-17 三星Sds株式会社 Data analysis apparatus and method
CN105378714A (en) * 2013-06-14 2016-03-02 微软技术许可有限责任公司 Fast grouping of time series
CN105612554A (en) * 2013-10-11 2016-05-25 冒纳凯阿技术公司 Method for characterizing images acquired through video medical device
US20150286707A1 (en) * 2014-04-08 2015-10-08 International Business Machines Corporation Distributed clustering with outlier detection
CN104462802A (en) * 2014-11-26 2015-03-25 浪潮电子信息产业股份有限公司 Method for analyzing outlier data in large-scale data
CN106156142A (en) * 2015-04-13 2016-11-23 深圳市腾讯计算机***有限公司 The processing method of a kind of text cluster, server and system
US20170124478A1 (en) * 2015-10-30 2017-05-04 Citrix Systems, Inc. Anomaly detection with k-means clustering and artificial outlier injection
CN106203103A (en) * 2016-06-23 2016-12-07 百度在线网络技术(北京)有限公司 The method for detecting virus of file and device
CN107644032A (en) * 2016-07-21 2018-01-30 中兴通讯股份有限公司 Outlier detection method and apparatus
CN106649517A (en) * 2016-10-17 2017-05-10 北京京东尚科信息技术有限公司 Data mining method, device and system
CN107330092A (en) * 2017-07-04 2017-11-07 广西电网有限责任公司电力科学研究院 A kind of production business noise data detection and separation method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020258670A1 (en) * 2019-06-28 2020-12-30 平安科技(深圳)有限公司 Network access abnormality determination method and apparatus, server, and storage medium
CN110555717A (en) * 2019-07-29 2019-12-10 华南理工大学 method for mining potential purchased goods and categories of users based on user behavior characteristics
CN111243743A (en) * 2020-01-17 2020-06-05 深圳前海微众银行股份有限公司 Data processing method, device, equipment and computer readable storage medium
CN111581499A (en) * 2020-04-21 2020-08-25 北京龙云科技有限公司 Data normalization method, device and equipment and readable storage medium

Also Published As

Publication number Publication date
CN109582741B (en) 2023-09-05

Similar Documents

Publication Publication Date Title
CN109544166B (en) Risk identification method and risk identification device
CN109582741A (en) Characteristic treating method and apparatus
CN108763952B (en) Data classification method and device and electronic equipment
CN109711440B (en) Data anomaly detection method and device
CN110489449B (en) Chart recommendation method and device and electronic equipment
CN110019785B (en) Text classification method and device
CN111353850B (en) Risk identification strategy updating method and device and risk merchant identification method and device
CN110751515A (en) Decision-making method and device based on user consumption behaviors, electronic equipment and storage medium
CN112966189B (en) Fund product recommendation system
CN110930218B (en) Method and device for identifying fraudulent clients and electronic equipment
CN110633989A (en) Method and device for determining risk behavior generation model
CN110827086A (en) Product marketing prediction method and device, computer equipment and readable storage medium
CN106878242B (en) Method and device for determining user identity category
CN105989066A (en) Information processing method and device
CN115858774A (en) Data enhancement method and device for text classification, electronic equipment and medium
CN108229564B (en) Data processing method, device and equipment
CN112243247B (en) Base station optimization priority determining method and device and computing equipment
CN110458581B (en) Method and device for identifying business turnover abnormality of commercial tenant
CN113177603B (en) Training method of classification model, video classification method and related equipment
CN115238194A (en) Book recommendation method, computing device and computer storage medium
CN113284027A (en) Method for training group recognition model, and method and device for recognizing abnormal group
CN110458416B (en) Wind control method and wind control device
CN112685650A (en) Commodity searching method, system, equipment and readable storage medium
CN111461892A (en) Method and device for selecting derived variables of risk identification model
CN114820003A (en) Pricing information abnormity identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201012

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201012

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant