CN110288468B

CN110288468B - Data feature mining method and device, electronic equipment and storage medium

Info

Publication number: CN110288468B
Application number: CN201910630499.7A
Authority: CN
Inventors: 叶素兰; 李国才; 刘卉; 王秋施; 贾怡
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-04-19
Filing date: 2019-07-12
Publication date: 2023-06-06
Anticipated expiration: 2039-07-12
Also published as: CN110288468A

Abstract

The invention relates to the technical field of data analysis, and discloses a data feature mining method, a device, electronic equipment and a storage medium. The method comprises the following steps: the method comprises the steps of collecting data samples in the vertical field, which meet preset standards, constructing a training data set, processing the training data set to obtain a standard data model, analyzing data to be analyzed to construct feature vectors of the data to be analyzed, inputting the feature vectors of the data to be analyzed into the standard data model, and obtaining probability that the data to be analyzed meet the preset standards. And constructing a standard sample feature vector and carrying out clustering treatment to obtain a standard data model, so that under the condition that data to be analyzed is obtained, the user features of the data to be analyzed can be obtained through analysis of the standard data model, and the probability that the data to be analyzed meets the preset standard can be accurately obtained.

Description

Data feature mining method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data analysis technologies, and in particular, to a data feature mining method, a data feature mining device, an electronic device, and a storage medium.

Background

In the vertical field, in order to predict possible behaviors of a data sample, a senior person usually digs to obtain sample features according to service experience, and establishes a sample database according to the sample features and historical sample data, so as to predict possible behaviors of the data sample. However, if the data to be analyzed is first in the vertical field, since the data to be analyzed does not have a historical behavior, the behavior of the data to be analyzed cannot be predicted according to the sample database, and the method relies on human analysis, is greatly affected by limitation of human cognition, and has low accuracy. Therefore, the traditional data characteristic mining method cannot effectively reveal the characteristics of the data sample, and the prediction accuracy of the behavior of the data sample is not high.

Disclosure of Invention

The invention provides a data feature mining method, a device, electronic equipment and a storage medium, which are used for solving the problems that the data sample features cannot be revealed and the prediction accuracy of the data sample behaviors is not high in the traditional data feature mining method.

The first aspect of the embodiment of the invention discloses a data feature mining method, which comprises the following steps:

collecting data samples meeting preset standards in the vertical field, and constructing a training data set based on the data samples meeting the preset standards; the training data set comprises a plurality of sample feature vectors, each sample feature vector corresponding to a data sample meeting a predetermined criterion;

processing the training data set to obtain a standard data model;

analyzing the data to be analyzed to construct a feature vector of the data to be analyzed;

and inputting the feature vector of the data to be analyzed into the standard data model to obtain the probability that the data to be analyzed accords with the preset standard.

The second aspect of the embodiment of the invention discloses a data feature mining device, which comprises:

the training unit is used for collecting data samples meeting preset standards in the vertical field and constructing a training data set based on the data samples meeting the preset standards; the training data set comprises a plurality of sample feature vectors, each sample feature vector corresponding to a data sample meeting a predetermined criterion;

The clustering unit is used for processing the training data set to obtain a standard data model;

the construction unit is used for analyzing the data to be analyzed to construct the feature vector of the data to be analyzed;

and the analysis unit is used for inputting the feature vector of the data to be analyzed into the standard data model to obtain the probability that the data to be analyzed accords with the preset standard.

A third aspect of the embodiment of the present invention discloses an electronic device, including:

a processor;

and the memory is stored with computer readable instructions, and when the computer readable instructions are executed by the processor, the data feature mining method disclosed in the first aspect of the embodiment of the invention is realized.

A fourth aspect of the embodiments of the present invention discloses a computer-readable storage medium storing a computer program that causes a computer to execute a data feature mining method disclosed in the first aspect of the embodiments of the present invention.

The technical scheme provided by the embodiment of the invention can comprise the following beneficial effects:

the data feature mining method provided by the invention comprises the following steps: standard data sample data in the vertical field are collected, and a training data set is constructed based on the standard data sample data; processing the training data set to obtain a standard data model; analyzing the data to be analyzed to construct a feature vector of the data to be analyzed; and inputting the feature vector of the data to be analyzed into a standard data model to obtain the probability that the data to be analyzed accords with the preset standard.

According to the method, the standard sample feature vector is constructed and processed by using a k-means clustering algorithm to obtain the standard data model, so that under the condition that data to be analyzed are obtained, the user features of the data to be analyzed can be obtained through analysis of the standard data model, and the probability that the data to be analyzed meets the preset standard can be accurately obtained.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic view of an apparatus according to an embodiment of the present invention;

FIG. 2 is a flow chart of a data feature mining method disclosed in an embodiment of the present invention;

FIG. 3 is a flow chart of another data feature mining method disclosed in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a data feature mining apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of another data feature mining apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

The implementation environment of the invention can be an electronic device, such as a smart phone, a tablet computer, a desktop computer. The data sample data meeting the predetermined criteria may be: data samples that are blacklisted in an industry, or data samples that meet a particular behavior, etc.

Fig. 1 is a schematic structural diagram of a data feature mining apparatus according to an embodiment of the present invention. The data feature mining apparatus 100 may be the above-described electronic device. As shown in fig. 1, the data feature mining apparatus 100 may include one or more of the following components: a processing component 102, a memory 104, a power supply component 106, a multimedia component 108, an audio component 110, a sensor component 114, and a communication component 116.

The processing component 102 generally controls overall operation of the data feature mining apparatus 100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations, among others. The processing component 102 may include one or more processors 118 to execute instructions to perform all or part of the steps of the methods described below. Further, the processing component 102 can include one or more modules to facilitate interactions between the processing component 102 and other components. For example, the processing component 102 may include a multimedia module for facilitating interaction between the multimedia component 108 and the processing component 102.

The memory 104 is configured to store various types of data to support operation at the data feature mining apparatus 100. Examples of such data include instructions for any application or method operating on the data feature mining apparatus 100. The Memory 104 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as static random access Memory (Static Random Access Memory, SRAM), electrically erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk. Also stored in the memory 104 are one or more modules configured to be executed by the one or more processors 118 to perform all or part of the steps in the methods shown below.

The power supply assembly 106 provides power to the various components of the data feature mining apparatus 100. Power components 106 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for data feature mining apparatus 100.

The multimedia component 108 includes a screen between the data feature mining apparatus 100 and the user that provides an output interface. In some embodiments, the screen may include a liquid crystal display (Liquid Crystal Display, LCD for short) and a touch panel. If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or sliding action, but also the duration and pressure associated with the touch or sliding operation. The screen may also include an organic electroluminescent display (Organic Light Emitting Display, OLED for short).

The audio component 110 is configured to output and/or input audio signals. For example, the audio component 110 includes a Microphone (MIC) configured to receive external audio signals when the data feature mining apparatus 100 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 104 or transmitted via the communication component 116. In some embodiments, the audio component 110 further comprises a speaker for outputting audio signals.

The sensor assembly 114 includes one or more sensors for providing a status assessment of various aspects of the data feature mining apparatus 100. For example, the sensor assembly 114 may detect an on/off state of the data feature mining device 100, a relative positioning of the assemblies, the sensor assembly 114 may also detect a change in position of the data feature mining device 100 or a component of the data feature mining device 100, and a change in temperature of the data feature mining device 100. In some embodiments, the sensor assembly 114 may also include a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 116 is configured to facilitate wired or wireless communication between the data feature mining apparatus 100 and other devices. The data feature mining apparatus 100 may access a Wireless network based on a communication standard, such as WiFi (Wireless-Fidelity). In an embodiment of the present invention, the communication component 116 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In an embodiment of the present invention, the communication component 116 further includes a near field communication (Near Field Communication, abbreviated as NFC) module for facilitating short range communications. For example, the NFC module may be implemented based on radio frequency identification (Radio Frequency Identification, RFID) technology, infrared data association (Infrared Data Association, irDA) technology, ultra Wideband (UWB) technology, bluetooth technology, and other technologies.

In an exemplary embodiment, the data feature mining apparatus 100 may be implemented by one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASICs), digital signal processors, digital signal processing devices, programmable logic devices, field programmable gate arrays, controllers, microcontrollers, microprocessors, or other electronic elements for performing the methods described below.

Referring to fig. 2, fig. 2 is a flow chart of a data feature mining method according to an embodiment of the present invention. The data feature mining method as shown in fig. 2 may include the steps of:

201. data samples meeting a predetermined standard in the vertical field are collected, and a training data set is constructed based on the data samples meeting the predetermined standard.

In an embodiment of the present invention, the training data set includes a plurality of sample feature vectors, each sample feature vector corresponding to a data sample meeting a predetermined criterion.

In the embodiment of the invention, when the training data set is constructed, a data sample meeting a predetermined standard in a certain vertical field needs to be collected first, for example, if the vertical field refers to the credit industry, the data sample meeting the predetermined standard may refer to a blacklist user in the credit industry, and the user sample included in the blacklist is the data sample at that time.

As an alternative embodiment, collecting data samples meeting predetermined criteria in the vertical domain and constructing a training data set based on the data samples meeting the predetermined criteria may be achieved by: collecting data samples in the vertical field, which meet preset standards, processing the data samples according to preset screening rules to obtain standard data samples, analyzing the standard data samples to obtain personal information, equipment fingerprints and behavior data contained in the standard data, setting the personal information, the equipment fingerprints and the behavior data as characteristic indexes of the standard data, and constructing standard sample characteristic vectors according to the characteristic indexes of the standard data to obtain a training data set; the preset screening rules are used for screening out data samples with nonstandard data formats.

Specifically, assuming that the vertical field is set as the credit industry and the data sample meeting the predetermined standard is set as the blacklist data sample of the credit industry, the collected blacklist data sample may be a blacklisted user sample of the credit industry, a public credit loss executed list user sample, or the like. Firstly, acquiring detailed information such as contact information, personal information, identity evidence, working income evidence, bank card running evidence, loan application evidence, applied credit loan business information and the like of a blacklist user sample, then processing the detailed information according to preset screening rules, and compiling codes according to the preset screening rules to screen the detailed information, so that invalid blacklist user samples with missing detailed information or wrong detailed information formats can be conveniently screened out, and complete standard data of sample information in the blacklist user samples and corresponding standard data samples are obtained; extracting characteristic indexes in the standard data sample, for example, setting a first characteristic index as personal information such as age, academic and work of a user; setting a second characteristic index as the equipment fingerprint information of the user such as the identification code of the equipment of the user, the physical address of the user equipment, the common Wi-Fi address of the user equipment and the like; setting a third characteristic index as the transaction application frequency of the user, geographic movement data and other behavior information of the user, setting the extracted characteristic indexes as characteristic vectors of standard data samples, for example, extracting the characteristic vectors of the standard data A (special department, men, staff; 192.168.1.1, 1234567890112345, 192.168.1.2; annual application credit loan 3 times, 2017 in Guangzhou), extracting the characteristic vectors of a plurality of standard data, and packaging the characteristic vectors of the standard data to obtain a training data set.

Therefore, by implementing the embodiment of the invention, invalid data samples in the scattered data samples can be screened out to obtain the standard data samples, and detailed feature vectors of the standard data samples are extracted to obtain the standard data samples and user features corresponding to the standard data samples.

202. And processing the training data set to obtain a standard data model.

In the embodiment of the invention, the standard sample feature vectors in the training data set can be divided into a plurality of cluster sets by using the k-means clustering algorithm to train the training data set because the data volume of the training data set is huge and the standard sample feature vectors included in the training data set need to be classified, so that the standard sample feature vectors in each cluster set have higher similarity.

As an alternative embodiment, the processing of the training data set to obtain the standard data model may be implemented by:

selecting a preset number of standard sample feature vectors to be set as clustering center points;

for each cluster center, when the resulting cluster set is not the final cluster set, the following steps are performed each time:

Respectively setting standard sample feature vectors except for a clustering center point in the training data set as clustering distribution points;

for each cluster distribution point, calculating the weighted Euclidean distance between the cluster distribution point and each cluster center point according to the weight of each feature index in the standard sample feature vector corresponding to the cluster distribution point;

clustering each cluster distribution point into a cluster set corresponding to the cluster center point with the shortest weighted Euclidean distance of the cluster distribution point according to the weighted Euclidean distance between each cluster distribution point and each cluster center point, so as to obtain a preset number of cluster sets;

averaging the cluster distribution points in each cluster set and the cluster center points of the cluster set to obtain an average value as a new cluster center point of the cluster set;

determining the preset number of cluster sets obtained at the time as a final cluster set when judging that the cluster center point of each new cluster set determined at the time is the same as the cluster center point in the cluster set determined at the last time;

and determining the probability that the data corresponding to each final cluster set accords with the preset standard based on the number of the standard sample feature vectors contained in each final cluster set, and taking the probability that each final cluster set and the corresponding standard data accord with the preset standard as a standard data model.

Specifically, the bank can select a plurality of representative standard sample feature vectors to be set as clustering center points according to personal experience of experts, so that the data samples corresponding to the standard sample feature vectors included in the clustering center points are blacklist data samples with representative user features, and fraud is likely to be performed in the process of applying credit business; after a plurality of cluster center points are set, setting the rest standard sample feature vectors as cluster distribution points, processing training data sets by using a k-means clustering algorithm according to the cluster center points, obtaining weighted Euclidean distances between the cluster distribution points and the cluster center points, clustering the cluster distribution points into a cluster set corresponding to the cluster center point with the shortest weighted Euclidean distance between the cluster distribution points, thereby obtaining a plurality of cluster sets, wherein if the standard sample feature vectors corresponding to the cluster center points of the expert-set B cluster set have the following feature indexes (senior university, no industry, annual application credit loan is more than 3 times), the standard sample feature vectors in the cluster sets all need to have the feature indexes, and the difference of each standard sample feature vector in the cluster set is reflected on other feature indexes such as user equipment fingerprints; after a plurality of cluster sets are obtained, the standard sample feature vector in each cluster set is further required to be averaged, a cluster distribution point corresponding to the average value of the standard sample feature vector is set as a new cluster center point of the cluster set, if the cluster set B is averaged to obtain a feature index (the credit is applied for credit for a senior high school, no business and annual average), the feature vector corresponding to the feature index is set as the cluster center point of the cluster set B, all the cluster sets are repeatedly averaged according to the steps to obtain the new cluster center point, and the preset number of cluster sets obtained at this time are determined to be the final cluster set until the determined cluster center point of each new cluster set is the same as the cluster center point in the cluster set determined last time.

Therefore, the expert selects the clustering center points according to the service experience and uses the k-means clustering algorithm to cluster the standard sample feature vectors in the training data set into a plurality of clustering sets, so that the users corresponding to the standard sample feature vectors are reasonably classified according to the feature indexes of the clustering center points.

Before the weighted euclidean distance between the cluster distribution point and the cluster center point is calculated according to the weights of the feature indexes in the standard sample feature vectors, the weights of the feature indexes in the standard sample feature vectors are determined according to expert rules, so that the weights of the feature indexes with high probability, which are recognized as meeting the preset standard by the expert rules, in the standard sample feature vectors are higher than the weights of the feature indexes with low probability, which are recognized as meeting the preset standard by the expert rules, in the standard sample feature vectors, specifically, when the weighted euclidean distance between the cluster distribution point and the cluster center point is calculated, the expert rules recognize the weights of each feature index, wherein when a user decides whether to perform fraud, the user's learning level has obvious relevance with the user's learning level, and the user's region information decides whether to perform fraud, therefore, in the feature indexes of the user, the learning level will occupy higher weights than the region information, and the weighted euclidean distance between each feature vector and the cluster center point of each cluster is calculated by the following formula:

Wherein the formula adopts n characteristic indexes omega ₁ 、……、ω _n Is the weight value corresponding to n characteristic indexes, x ₁ 、……、x _n N eigenvalue values, y, which are eigenvectors ₁ 、……、y _n Is n characteristic index values of the cluster center point, and d is a weighted Euclidean distance.

It can be seen that by using the above embodiment, the weighted euclidean distance between each feature vector and the cluster center point of each cluster set can be accurately calculated, so that each feature vector is accurately clustered into an appropriate cluster set.

As an alternative embodiment, determining the probability that each final cluster set corresponds to meeting the predetermined criterion based on the number of standard sample feature vectors contained in each final cluster set may be implemented by: and calculating the proportion of the number of the standard sample feature vectors contained in each final cluster set to the total number of all the standard sample feature vectors contained in the training data set as the probability of the final cluster set corresponding to the predetermined standard, so as to obtain the probability of the final cluster set corresponding to each final cluster set corresponding to the predetermined standard. Specifically, if the number of standard sample feature vectors included in the B cluster set is 100, and the total number of all standard sample feature vectors included in the training data set is 200, a probability formula may be used to calculate P (B) =100/200=50%, that is, the user fraud probability corresponding to the B cluster set is considered to be 50%. Therefore, by implementing the embodiment, the probability that the user corresponding to the standard sample feature vector in the clustering set performs the user behavior corresponding to the clustering set can be conveniently obtained.

It should be understood that, for the difference between the calculation modes of the probabilities of different user behaviors in different vertical fields, for example, the calculation mode of the user duration probability in the insurance industry is obviously different from the calculation mode of the user fraud probability in the credit industry, the calculation mode of the user fraud probability in the credit industry is used as an example, and the calculation modes of other vertical fields are not limited.

203. User data of the data to be analyzed is analyzed to construct feature vectors of the data to be analyzed.

In the embodiment of the present invention, after obtaining the data to be analyzed, the processing method of step 201 is referred to, and the feature index in the data to be analyzed is extracted, so as to obtain the feature vector of the data to be analyzed in the standard format, which can be processed by the standard data model, and the process goes to step 204.

204. And inputting the feature vector of the data to be analyzed into a standard data model to obtain the probability that the data to be analyzed accords with the preset standard.

In the embodiment of the invention, the feature vector of the data to be analyzed is input into the standard data model, so that the feature vector of the data to be analyzed can be clustered into a certain clustering set in the standard data model, and the probability corresponding to the clustering set and meeting the preset standard is the probability of the data to be analyzed meeting the preset standard.

As an optional implementation manner, after calculating the distance between the feature vector of the data to be analyzed and the cluster center point of each final cluster set and before determining that the final cluster set corresponding to the cluster center point with the shortest feature vector distance of the data to be analyzed is the final cluster set to which the feature vector of the data to be analyzed belongs, if the shortest distance between the feature vector of the data to be analyzed and the cluster center point of each final cluster set is greater than the maximum distance between the cluster center point of each final cluster set and each cluster distribution point of the final cluster set, determining that the data to be analyzed does not meet the predetermined standard. Specifically, when the shortest distance between the feature vector of the data to be analyzed and the cluster center point of each final cluster set is greater than the greatest distance between the cluster center point in each final cluster set and each cluster distribution point in the final cluster set, the feature vector of the data to be analyzed and the standard sample feature vector in each cluster distribution point are indicated to have larger differences and should not be clustered into a certain cluster set, at the moment, the user features of the data to be analyzed can be considered to be inconsistent with the user features corresponding to a plurality of cluster sets, and the feature vector of the data to be analyzed is judged to be inconsistent with the preset standard. It can be seen that the above-mentioned decision process will screen out feature vectors of the data to be analyzed that do not meet the predetermined criteria, rather than simply clustering them.

As another optional implementation manner, when the shortest distance between the feature vector of the data to be analyzed and the cluster center point of each final cluster set is greater than the maximum distance between the cluster center point in each final cluster set and each cluster distribution point in the final cluster set, user data corresponding to the feature vector of the data to be analyzed is pushed to the expert agent terminal for the expert to identify the data to be analyzed. Therefore, the expert can be used for carrying out manual judgment, so that the situation that the judgment is wrong because the feature vector of a certain type of user is special and an independent cluster set cannot be formed can be avoided.

As an optional implementation manner, if the shortest distance is not greater than the greatest distance, adding the feature vector of the data to be analyzed into a final cluster set to which the feature vector of the data to be analyzed belongs, setting the probability that the data to be analyzed meets the predetermined standard as the probability that the final cluster set corresponds to the final cluster set to which the probability that the data to be analyzed meets the predetermined standard corresponds, determining the final cluster set corresponding to the cluster center point with the shortest feature vector distance of the data to be analyzed as the final cluster set to which the feature vector of the data to be analyzed belongs, taking the probability that the user corresponding to the final cluster set to which the feature vector of the data to be analyzed meets the predetermined standard as the probability that the feature vector of the data to be analyzed meets the predetermined standard, and adding the feature vector of the data to be analyzed into the final cluster set to which the feature vector of the data to be analyzed belongs; and executing the step of calculating the weighted Euclidean distance between the clustering distribution points and the clustering center points respectively so as to update the standard data model. Specifically, when the feature vector of the data to be analyzed can be clustered into a certain cluster set, the probability that the user corresponding to the cluster set accords with the preset standard is set as the probability that the data to be analyzed corresponding to the feature vector of the data to be analyzed accords with the preset standard, in addition, after the feature vector of the data to be analyzed is added into the cluster set, the average value of the cluster set is changed, and under the condition that the number of the data to be analyzed is increased, the actual situation of the cluster set cannot be reflected by the cluster center point in the cluster set, so that when the feature vector of the data to be analyzed is added into the cluster set, the cluster set needs to be averaged to obtain a new cluster center point, and therefore, the standard data model is updated, and the feature vector of the data to be analyzed, which is changed continuously, can be normally processed by the standard data model. Therefore, the standard data model is updated in real time, so that the standard data model and the feature vector of the data to be analyzed are not disjointed, and the machine learning function is realized.

It can be seen that, by implementing the method described in fig. 2, by analyzing the standard data template data and constructing the training data set, an algorithm can be used to obtain the standard data model, and after the feature vector of the data to be analyzed corresponding to the data to be analyzed is input into the standard data model, the probability that the data to be analyzed meets the predetermined standard can be obtained. The efficiency of identifying the data to be analyzed is improved, and the loss caused by manual identification errors is reduced. If the invention is implemented in the credit loan industry, the user characteristics of the data to be analyzed can be timely analyzed and obtained, the probability of fraud implementation of the data to be analyzed can be accurately obtained, and the automatic screening of the data to be analyzed can be realized.

Referring to fig. 3, fig. 3 is a flow chart of another data feature mining method according to an embodiment of the present invention. As shown in fig. 3, the data feature mining method may include the steps of:

301. data samples meeting a predetermined standard in the vertical field are collected, and a training data set is constructed based on the data samples meeting the predetermined standard.

In an embodiment of the present invention, the training data set includes a plurality of sample feature vectors, each sample feature vector corresponds to one data sample, where the sample feature vector is composed of feature indexes of a plurality of dimensions.

302. And analyzing the characteristic indexes included in the standard sample characteristic vectors by using a Gaussian function to obtain probability density distribution samples of the standard sample characteristic vectors with preset quantity, and setting the standard sample characteristic vector with the highest probability in each probability density distribution sample as a clustering center point.

In the embodiment of the invention, the clustering center points are selected based on expert experience, although the clustering center points have certain rationality, because the expert experience extremely depends on the existing experience of the expert, when novel user behaviors appear, the expert experience cannot identify the novel user behaviors in time, so that the selected clustering center points have poor effects, and the result analyzed by the obtained standard data model is naturally inaccurate, therefore, the embodiment of the invention adopts a Gaussian function to analyze the characteristic indexes included in the sample characteristic vectors in the training data set so as to obtain the clustering center points, and the clustering center points are not selected according to the expert experience.

As an optional implementation manner, a gaussian function is used to analyze the feature indexes included in the feature vectors of the standard samples, probability density distribution samples of a preset number of feature vectors of the standard samples are obtained, and the feature vector of the standard sample with the highest probability in each probability density distribution sample is set as a clustering center point. Specifically, feature indexes included in standard sample feature vectors in a training data set are analyzed by using a gaussian function, a plurality of probability density distribution samples can be obtained, in each probability density distribution sample, the standard sample feature vector with the highest probability can be understood as a large number of standard sample feature vectors, and feature indexes of the standard sample feature vector with the highest probability are similar to those of the standard sample feature vector with the highest probability, so that the standard sample feature vector with the highest probability in the plurality of probability density distribution samples can be respectively set as a clustering center point of a clustering set, and the number of probability density distribution samples obtained by using the gaussian function is the number of the clustering center points. Therefore, by using the embodiment of the invention, the good clustering center point can be obtained before training the training data set, and the situation that the standard data model is clustered poorly due to experience limitation when the clustering center point is selected according to expert experience is avoided, and the data to be analyzed is identified and misaligned.

303. And processing the training data set to obtain a standard data model.

304. Analyzing the data to be analyzed to construct feature vectors of the data to be analyzed

305. And inputting the feature vector of the data to be analyzed into a standard data model to obtain the probability that the data to be analyzed accords with the preset standard.

Therefore, by implementing the method described in fig. 3, the clustering center point is reasonably selected for the standard data model by using the gaussian function, so that the influence on the standard data model due to subjective factors of expert experience can be avoided, and the data to be analyzed cannot be accurately identified.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a data feature mining apparatus according to an embodiment of the present invention. As shown in fig. 5, the data feature mining apparatus may include: training unit 401, clustering unit 402, building unit 403, and analysis unit 404, wherein,

a training unit 401, configured to collect data samples meeting a predetermined standard in the vertical field, and construct a training data set based on the data samples meeting the predetermined standard; the training data set comprises a plurality of sample feature vectors, each sample feature vector corresponding to a data sample meeting a predetermined standard;

a clustering unit 402, configured to process the training data set to obtain a standard data model;

A construction unit 403, configured to analyze data to be analyzed to construct feature vectors of the data to be analyzed;

and the analysis unit 404 is configured to input the feature vector of the data to be analyzed into the standard data model, so as to obtain a probability that the data to be analyzed meets a predetermined standard.

In the embodiment of the present invention, after training by the training unit 401 to obtain a training data set, the clustering unit 402 processes the training data set to obtain a standard data model; the construction unit 403 inputs the feature vector of the constructed data to be analyzed to the analysis unit 404 to obtain the probability that the data to be analyzed meets the predetermined standard.

As an alternative embodiment, the training unit 401 collects data samples meeting a predetermined standard in the vertical field, and constructs a training data set based on the data samples meeting the predetermined standard, which may be achieved by: the training unit 401 collects data samples in the vertical field, wherein the data samples accord with a preset standard, the data samples are processed according to a preset screening rule to obtain standard data samples, personal information, equipment fingerprints and behavior data contained in the standard data are obtained according to analysis of the standard data samples and are set as characteristic indexes of the standard data, and then standard sample characteristic vectors are constructed according to the characteristic indexes of the standard data to obtain a training data set; the preset screening rules are used for screening out data samples with nonstandard data formats.

Specifically, assuming that the vertical field is set as the credit industry and the data sample meeting the predetermined standard is set as the blacklist data sample of the credit industry, the blacklist data sample collected by the training unit 401 may be a blacklisted user sample of the credit industry, a public user sample on the credit-losing executed list, and the like. The training unit 401 firstly collects detailed information such as contact information, personal information, identity evidence, work income evidence, bank card running water evidence, loan application evidence and applied credit loan business information of the blacklist user, then processes the detailed information according to preset screening rules, and codes are written according to the preset screening rules to screen the detailed information, so that invalid blacklist user samples with missing detailed information or wrong detailed information format are conveniently screened out, and complete standard data of sample information in the blacklist user samples and corresponding standard data samples are obtained; the training unit 401 extracts characteristic indexes in the standard data sample, for example, sets the first characteristic index as personal information such as age, academic and work of the user; setting a second characteristic index as the equipment fingerprint information of the user such as the identification code of the equipment of the user, the physical address of the user equipment, the common Wi-Fi address of the user equipment and the like; setting a third characteristic index as the transaction application frequency of the user, geographic movement data and other behavior information of the user, setting the extracted characteristic indexes as characteristic vectors of standard data samples, for example, extracting the characteristic vectors of the standard data A (special department, men, staff; 192.168.1.1, 1234567890112345, 192.168.1.2; annual application credit loan 3 times, 2017 in Guangzhou), extracting the characteristic vectors of a plurality of standard data, and packaging the characteristic vectors of the standard data to obtain a training data set.

Therefore, when the embodiment of the invention is implemented, the training unit 401 can screen out invalid data samples in the scattered data samples to obtain standard data samples, and extract detailed feature vectors of the standard data samples, so that the standard data samples and user features corresponding to the standard data samples can be obtained effectively.

As an alternative embodiment, the clustering unit 402 processes the training data set to obtain a standard data model, which may be implemented by the following ways:

the clustering unit 402 selects a preset number of standard sample feature vectors to be set as a clustering center point;

and determining the probability that the user corresponding to each final cluster set accords with the preset standard based on the number of the standard sample feature vectors contained in each final cluster set, and taking the probability that each final cluster set and corresponding standard data accord with the preset standard as a standard data model.

Specifically, the clustering unit 402 may select a plurality of representative standard sample feature vectors to be set as a cluster center point according to personal experience of an expert, so that it is indicated that users corresponding to the standard sample feature vectors included in the plurality of cluster center points are blacklist users with representative user features, and thus fraud is more likely to be performed in the process of applying credit service; after setting a plurality of cluster center points, the clustering unit 402 sets the rest standard sample feature vectors as cluster distribution points, and processes the training data set by using a k-means clustering algorithm according to the cluster center points, so that the weighted Euclidean distance between the cluster distribution points and the cluster center points can be obtained, the clustering unit 402 clusters the cluster distribution points into a cluster set corresponding to the cluster center point with the shortest weighted Euclidean distance between the cluster distribution points, so as to obtain a plurality of cluster sets, if the standard sample feature vectors corresponding to the cluster center points of the expert setting B cluster set have the following feature indexes (senior university, no industry, annual average application credit loan is more than 3 times), the standard sample feature vectors in the cluster sets all need to have the feature indexes, and the difference of each standard sample feature vector in the cluster sets is reflected on other feature indexes such as user equipment fingerprints; after a plurality of cluster sets are obtained, the standard sample feature vector in each cluster set is further required to be averaged, a cluster distribution point corresponding to the average value of the standard sample feature vector is set as a new cluster center point of the cluster set, if the cluster set B is averaged to obtain a feature index (the credit is applied for credit for a senior high school, no business and annual average), the feature vector corresponding to the feature index is set as the cluster center point of the cluster set B, all the cluster sets are repeatedly averaged according to the steps to obtain the new cluster center point, and the preset number of cluster sets obtained at this time are determined to be the final cluster set until the determined cluster center point of each new cluster set is the same as the cluster center point in the cluster set determined last time.

Therefore, the expert selects the clustering center point according to the business experience and uses the k-means clustering algorithm, and the clustering unit 402 can cluster the standard sample feature vectors in the training data set into a plurality of clustering sets, so that the users corresponding to the standard sample feature vectors are reasonably classified according to the feature indexes of the clustering center points.

As an alternative embodiment, the clustering unit 402 determines, based on the number of standard sample feature vectors contained in each final cluster set, the probability that each final cluster set corresponds to meet the predetermined standard, by: the clustering unit 402 calculates, as probabilities of meeting a predetermined criterion corresponding to the final cluster sets, a ratio of the number of standard sample feature vectors contained in each final cluster set to the total number of all standard sample feature vectors contained in the training data set, to obtain probabilities of meeting the predetermined criterion corresponding to the final cluster sets corresponding to the respective final cluster sets. Specifically, if the number of standard sample feature vectors included in the B cluster set is 100, and the total number of all standard sample feature vectors included in the training data set is 200, a probability formula may be used to calculate P (B) =100/200=50%, that is, the user fraud probability corresponding to the B cluster set is considered to be 50%. It can be seen that, by implementing the embodiment, the clustering unit 402 may conveniently obtain the probability that the user corresponding to the standard sample feature vector in the cluster set performs the user behavior corresponding to the cluster set.

As an alternative embodiment, after the clustering unit 402 calculates the distance between the feature vector of the data to be analyzed and the cluster center point of each final cluster set, and before the clustering unit 402 determines that the final cluster set corresponding to the cluster center point with the shortest feature vector distance of the data to be analyzed is the final cluster set to which the feature vector of the data to be analyzed belongs, if the shortest distance between the feature vector of the data to be analyzed and the cluster center point of each final cluster set is greater than the maximum distance between the cluster center point of each final cluster set and each cluster distribution point of the final cluster set, the analysis unit 404 determines that the data to be analyzed does not meet the predetermined criterion. Specifically, when the shortest distance between the feature vector of the data to be analyzed and the cluster center point of each final cluster set is greater than the greatest distance between the cluster center point in each final cluster set and each cluster distribution point in the final cluster set, it is indicated that there is a large difference between the feature vector of the data to be analyzed and the standard sample feature vector in each cluster distribution point, and the feature vector of the data to be analyzed should not be clustered into a certain cluster set, and at this time, the user features of the data to be analyzed may be considered to be inconsistent with the user features corresponding to a plurality of cluster sets, and the analysis unit 404 determines that the feature vector of the data to be analyzed is inconsistent with the predetermined standard. It can be seen that the above-mentioned decision process will screen out feature vectors of the data to be analyzed that do not meet the predetermined criteria, rather than simply clustering them.

As another alternative implementation manner, when the shortest distance between the feature vector of the data to be analyzed and the cluster center point of each final cluster set is greater than the maximum distance between the cluster center point in each final cluster set and each cluster distribution point in the final cluster set, the analysis unit 404 pushes the user data corresponding to the feature vector of the data to be analyzed to the expert agent terminal, so that the expert can identify the data to be analyzed. Therefore, the expert can be used for carrying out manual judgment, so that the situation that the judgment is wrong because the feature vector of a certain type of user is special and an independent cluster set cannot be formed can be avoided.

As an alternative implementation manner, if the shortest distance is not greater than the greatest distance, the analysis unit 404 adds the feature vector of the data to be analyzed to a final cluster set to which the feature vector of the data to be analyzed belongs, sets the probability that the data to be analyzed meets the predetermined standard as the probability that the final cluster set to which the data to be analyzed belongs corresponds to the predetermined standard, determines the final cluster set to which the final cluster set corresponding to the cluster center point with the shortest feature vector distance of the data to be analyzed corresponds to the final cluster set to which the feature vector of the data to be analyzed belongs, and adds the feature vector of the data to be analyzed to the final cluster set to which the feature vector of the data to be analyzed belongs as the probability that the user of the final cluster set to which the feature vector of the data to be analyzed corresponds to the predetermined standard; the analysis unit 404 performs a step of calculating weighted euclidean distances of the cluster distribution points and the cluster center points, respectively, to update the standard data model. Specifically, when the feature vector of the data to be analyzed can be clustered into a certain cluster set, the analysis unit 404 sets the probability that the feature vector of the data to be analyzed corresponds to the user and meets the predetermined standard as the probability that the feature vector of the data to be analyzed corresponds to the predetermined standard, and in addition, after the feature vector of the data to be analyzed is added into the cluster set, the average value of the cluster set will also change. Therefore, the standard data model is updated in real time, so that the standard data model and the feature vector of the data to be analyzed are not disjointed, and the machine learning function is realized.

It can be seen that, by implementing the method described in fig. 4, the analysis unit 404 analyzes the standard data template data and constructs the training data set, an algorithm can be used to obtain the standard data model, and after the feature vector of the data to be analyzed corresponding to the data to be analyzed is input into the standard data model, the probability that the data to be analyzed meets the predetermined standard can be obtained. The efficiency of identifying the data to be analyzed is improved, and the loss caused by manual identification errors is reduced.

Referring to fig. 5, fig. 5 is a schematic structural diagram of another data feature mining apparatus according to an embodiment of the present invention. The image processing apparatus shown in fig. 5 is optimized by the image processing apparatus shown in fig. 4. Compared with the image processing apparatus shown in fig. 4, the image processing apparatus shown in fig. 5 may further include: a cluster center unit 405, wherein,

and the clustering center unit 405 is configured to analyze the feature indexes included in the feature vectors of the standard samples by using a gaussian function, obtain probability density distribution samples of a preset number of feature vectors of the standard samples, and set the feature vector of the standard sample with the highest probability in each probability density distribution sample as a clustering center point.

In the embodiment of the invention, a Gaussian function is adopted to analyze the characteristic indexes included in the sample characteristic vectors in the training data set so as to obtain the clustering center point.

As an alternative embodiment, the clustering center unit 405 analyzes the feature indexes included in the feature vectors of the standard samples by using a gaussian function, obtains probability density distribution samples of a preset number of feature vectors of the standard samples, and sets the feature vector of the standard sample with the highest probability in each probability density distribution sample as a clustering center point. Specifically, the clustering center unit 405 analyzes the feature indexes included in the standard sample feature vectors in the training data set by using a gaussian function, so as to obtain a plurality of probability density distribution samples, and in each probability density distribution sample, the standard sample feature vector with the highest probability can be understood as having a large number of feature indexes of the standard sample feature vector similar to the feature index of the standard sample feature vector with the highest probability, so that the standard sample feature vector with the highest probability in the plurality of probability density distribution samples can be respectively set as the clustering center point of the clustering set, and the number of probability density distribution samples obtained by using the gaussian function is the number of the clustering center points. Therefore, by using the embodiment of the present invention, the clustering center unit 405 can obtain good clustering center points before training the training data set, so as to avoid the situation that the clustering of the standard data model is poor due to experience limitation when the clustering center points are selected according to expert experience, and the data to be analyzed is identified and misaligned.

Therefore, by implementing the method described in fig. 5, the clustering center point is reasonably selected for the standard data model by using the gaussian function, so that the influence on the standard data model due to subjective factors of expert experience can be avoided, and the data to be analyzed cannot be accurately identified.

The invention also provides an electronic device, comprising:

a processor;

a memory having stored thereon computer readable instructions which, when executed by a processor, implement a data feature mining method as previously described.

The electronic device may be the data feature mining apparatus 100 shown in fig. 1.

In an exemplary embodiment, the invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data feature mining method as previously described.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A data feature mining method, comprising:

Collecting data samples meeting preset standards in the vertical field, and constructing a training data set based on the data samples meeting the preset standards; the training data set comprises a plurality of standard sample feature vectors, each standard sample feature vector corresponds to one data sample meeting a preset standard, the standard sample feature vector comprises feature indexes extracted from the data samples meeting the preset standard, and the feature indexes comprise personal information, equipment fingerprints and behavior data;

respectively setting the standard sample feature vectors except the clustering center point in the training data set as clustering distribution points;

for each cluster distribution point, calculating the weighted Euclidean distance between the cluster distribution point and each cluster center point according to the weight of each characteristic index in the standard sample characteristic vector corresponding to the cluster distribution point;

clustering each cluster distribution point into a cluster set corresponding to the cluster center point with the shortest weighted Euclidean distance of the cluster distribution point according to the weighted Euclidean distance between each cluster distribution point and each cluster center point, so as to obtain the cluster set with the preset quantity;

when judging that the cluster center point of each new cluster set determined at the time is the same as the cluster center point in the cluster set determined at the last time, determining the cluster set with the preset number obtained at the time as the final cluster set;

calculating the proportion of the number of the standard sample feature vectors contained in each final cluster set to the total number of all the standard sample feature vectors contained in the training data set as the probability of the corresponding final cluster set meeting the preset standard;

taking each final cluster set and the corresponding probability meeting the preset standard as a standard data model;

calculating the distance between the feature vector of the data to be analyzed and the clustering center point of each final clustering set in the standard data model, and determining the final clustering set corresponding to the clustering center point with the shortest feature vector distance of the data to be analyzed as the final clustering set to which the feature vector of the data to be analyzed belongs;

And taking the probability which corresponds to the final cluster set to which the feature vector of the data to be analyzed belongs and accords with the preset standard as the probability that the data to be analyzed accords with the preset standard.

2. The data feature mining method of claim 1, wherein the collecting data samples meeting a predetermined criterion for the vertical domain and constructing a training data set based on the data samples meeting the predetermined criterion comprises:

collecting a data sample in the vertical field, which accords with a preset standard, and processing the data sample according to a preset screening rule to obtain a standard data sample; the preset screening rule is used for screening out data samples with nonstandard data formats;

analyzing according to the standard data sample to obtain personal information, equipment fingerprints and behavior data contained in the standard data, and setting the personal information, the equipment fingerprints and the behavior data as characteristic indexes of the standard data sample;

and constructing and obtaining the standard sample feature vector as the training data set according to the feature index of the standard data sample.

3. The data feature mining method according to claim 1, wherein before the calculating of the weighted euclidean distance between the cluster distribution point and each of the cluster center points according to the weight of each of the feature indexes in the standard sample feature vector corresponding to the cluster distribution point, the method further comprises:

And determining the weight of the characteristic index in the standard sample characteristic vector according to an expert rule, so that the weight of the characteristic index with high probability, which is determined by the expert rule to be in accordance with the preset standard, in the standard sample characteristic vector is higher than the weight of the characteristic index with low probability, which is determined by the expert rule to be in accordance with the preset standard, in the standard sample characteristic vector.

4. The data feature mining method according to claim 1, wherein the selecting a preset number of the standard sample feature vectors as cluster center points includes:

and analyzing the characteristic indexes included in the standard sample characteristic vectors by using a Gaussian function to obtain probability density distribution samples of the standard sample characteristic vectors with the preset quantity, and setting the standard sample characteristic vector with the highest probability in each probability density distribution sample as the clustering center point.

5. The data feature mining method according to claim 1, characterized in that after said calculating the distance between the feature vector of the data to be analyzed and the cluster center point of each of the final cluster sets in the standard data model, and before said determining that the final cluster set corresponding to the cluster center point having the shortest feature vector distance to the data to be analyzed is the final cluster set to which the feature vector of the data to be analyzed belongs, the method further comprises:

If the shortest distance between the feature vector of the data to be analyzed and the clustering center point of each final clustering set is larger than the maximum distance between the clustering center point in each final clustering set and each clustering distribution point in the final clustering set, determining that the data to be analyzed does not accord with the preset standard;

if the shortest distance is not greater than the greatest distance, executing the step of determining that the final cluster set corresponding to the cluster center point with the shortest feature vector distance of the data to be analyzed is the final cluster set to which the feature vector of the data to be analyzed belongs, and taking the probability of meeting the predetermined standard corresponding to the final cluster set to which the feature vector of the data to be analyzed belongs as the probability of meeting the predetermined standard of the data to be analyzed;

adding the feature vector of the data to be analyzed into a final cluster set to which the feature vector of the data to be analyzed belongs;

and executing the step of averaging the cluster distribution points in each cluster set and the cluster center points of the cluster set so as to update the standard data model.

6. A data feature mining apparatus, comprising:

The training unit is used for collecting data samples meeting preset standards in the vertical field and constructing a training data set based on the data samples meeting the preset standards; the training data set comprises a plurality of standard sample feature vectors, each standard sample feature vector corresponds to one data sample meeting a preset standard, the standard sample feature vector comprises feature indexes extracted from the data samples meeting the preset standard, and the feature indexes comprise personal information, equipment fingerprints and behavior data;

the clustering unit is used for selecting a preset number of standard sample feature vectors to be set as clustering center points; for each cluster center, when the resulting cluster set is not the final cluster set, the following steps are performed each time: respectively setting the standard sample feature vectors except the clustering center point in the training data set as clustering distribution points; for each cluster distribution point, calculating the weighted Euclidean distance between the cluster distribution point and each cluster center point according to the weight of each characteristic index in the standard sample characteristic vector corresponding to the cluster distribution point; clustering each cluster distribution point into a cluster set corresponding to the cluster center point with the shortest weighted Euclidean distance of the cluster distribution point according to the weighted Euclidean distance between each cluster distribution point and each cluster center point, so as to obtain the cluster set with the preset quantity; averaging the cluster distribution points in each cluster set and the cluster center points of the cluster set to obtain an average value as a new cluster center point of the cluster set; when judging that the cluster center point of each new cluster set determined at the time is the same as the cluster center point in the cluster set determined at the last time, determining the cluster set with the preset number obtained at the time as the final cluster set; calculating the proportion of the number of the standard sample feature vectors contained in each final cluster set to the total number of all the standard sample feature vectors contained in the training data set as the probability of the corresponding final cluster set meeting the preset standard; taking each final cluster set and the corresponding probability meeting the preset standard as a standard data model;

the analysis unit is used for calculating the distance between the feature vector of the data to be analyzed and the clustering center point of each final clustering set in the standard data model, and determining the final clustering set corresponding to the clustering center point with the shortest feature vector distance of the data to be analyzed as the final clustering set to which the feature vector of the data to be analyzed belongs; and taking the probability which corresponds to the final cluster set to which the feature vector of the data to be analyzed belongs and accords with the preset standard as the probability that the data to be analyzed accords with the preset standard.

7. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the data feature mining method of any of claims 1-5 when the computer program is executed.

8. A computer-readable storage medium, characterized in that it stores a computer program that causes a computer to execute the data feature mining method according to any one of claims 1 to 5.