CN110288468A - Data characteristics method for digging, device, electronic equipment and storage medium - Google Patents

Data characteristics method for digging, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110288468A
CN110288468A CN201910630499.7A CN201910630499A CN110288468A CN 110288468 A CN110288468 A CN 110288468A CN 201910630499 A CN201910630499 A CN 201910630499A CN 110288468 A CN110288468 A CN 110288468A
Authority
CN
China
Prior art keywords
data
analyzed
cluster
feature vector
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910630499.7A
Other languages
Chinese (zh)
Other versions
CN110288468B (en
Inventor
叶素兰
李国才
刘卉
王秋施
贾怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Publication of CN110288468A publication Critical patent/CN110288468A/en
Application granted granted Critical
Publication of CN110288468B publication Critical patent/CN110288468B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Accounting & Taxation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Finance (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Technology Law (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to data analysis technique field, a kind of data characteristics method for digging, device, electronic equipment and storage medium are disclosed.The described method includes: acquiring the data sample building training data set that vertical field meets preassigned, training data set is handled, obtain normal data model, to analyze data to be analyzed to construct the feature vector of data to be analyzed, it is analysed to the feature vector input normal data model of data, obtains the probability of data fit preassigned to be analyzed.By building master sample feature vector and carry out clustering processing, obtain normal data model, so as to obtain the user characteristics of the data to be analyzed by normal data model analysis in the case where getting data to be analyzed, and accurately know the probability of the data fit preassigned to be analyzed.

Description

Data characteristics method for digging, device, electronic equipment and storage medium
Technical field
The present invention relates to data analysis technique field, in particular to a kind of data characteristics method for digging, device, electronic equipment And storage medium.
Background technique
In vertical field, in order to which the behavior that can be can be carried out to data sample is predicted, it will usually by senior according to It excavates to obtain sample characteristics according to business experience, and establishes sample database according to sample characteristics and historical sample data, thus real The behavior that can now can be carried out to data sample is predicted.However, if data to be analyzed are to come across the vertical field for the first time, by Do not have historical behavior in data to be analyzed, the behavior of data to be analyzed can not be predicted according to sample database, and on Method is stated dependent on artificial analysis, the limitation artificially recognized is affected, and accuracy rate is not high.As it can be seen that traditional data are special Sign method for digging can not effectively disclose data sample feature, and not high to the predictablity rate of data sample behavior.
Summary of the invention
In order to solve that data sample feature can not be disclosed existing for traditional data feature mining method, and to data sample row For the not high problem of predictablity rate, the present invention provides a kind of data characteristics method for digging, device, electronic equipment and storages Medium.
First aspect of the embodiment of the present invention discloses a kind of data characteristics method for digging, which comprises
The data sample that vertical field meets preassigned is acquired, and based on the data sample structure for meeting preassigned Build training data set;The training data set includes multiple sampling feature vectors, and each sampling feature vectors are corresponding Meet the data sample of preassigned in one;
The training data set is handled, normal data model is obtained;
Data to be analyzed are analyzed, to construct the feature vector of data to be analyzed;
The feature vector of the data to be analyzed is inputted into the normal data model, obtains the data fit to be analyzed The probability of the preassigned.
Second aspect of the embodiment of the present invention discloses a kind of data characteristics excavating gear, and described device includes:
Training unit meets the data sample of preassigned for acquiring vertical field, and meets pre- calibration based on described Quasi- data sample constructs training data set;The training data set includes multiple sampling feature vectors, each sample Eigen vector corresponds to the data sample for meeting preassigned;
Cluster cell obtains normal data model for handling the training data set;
Construction unit, for analyzing data to be analyzed, to construct the feature vector of data to be analyzed;
Analytical unit obtains described for the feature vector of the data to be analyzed to be inputted the normal data model The probability of preassigned described in data fit to be analyzed.
The third aspect of the embodiment of the present invention discloses a kind of electronic equipment, and the electronic equipment includes:
Processor;
Memory is stored with computer-readable instruction on the memory, and the computer-readable instruction is by the processing When device executes, a kind of data characteristics method for digging disclosed in first aspect of the embodiment of the present invention is realized.
Fourth aspect of the embodiment of the present invention discloses a kind of computer readable storage medium, stores computer program, institute Stating computer program makes computer execute a kind of data characteristics method for digging disclosed in first aspect of the embodiment of the present invention.
The technical solution that the embodiment of the present invention provides can include the following benefits:
A kind of data characteristics method for digging provided by the present invention includes the following steps: the normal data for acquiring vertical field Sample data, and training data set is constructed based on normal data sample data;Training data set is handled, is marked Quasi- data model;Data to be analyzed are analyzed, to construct the feature vector of data to be analyzed;The feature vector for being analysed to data is defeated Enter normal data model, obtains the probability of data fit preassigned to be analyzed.
Under the method, it is handled by building master sample feature vector and using k-means clustering algorithm, obtained To normal data model, so as in the case where getting data to be analyzed, by normal data model analysis obtain this to The user characteristics of data are analyzed, and accurately know the probability of the data fit preassigned to be analyzed.
It should be understood that the above general description and the following detailed description are merely exemplary, this can not be limited Invention.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention Example, and in specification together principle for explaining the present invention.
Fig. 1 is a kind of structural schematic diagram of device disclosed by the embodiments of the present invention;
Fig. 2 is a kind of flow chart of data characteristics method for digging disclosed by the embodiments of the present invention;
Fig. 3 is the flow chart of another data characteristics method for digging disclosed by the embodiments of the present invention;
Fig. 4 is a kind of structural schematic diagram of data characteristics excavating gear disclosed by the embodiments of the present invention;
Fig. 5 is the structural schematic diagram of another data characteristics excavating gear disclosed by the embodiments of the present invention.
Specific embodiment
Here will the description is performed on the exemplary embodiment in detail, the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended The example of device and method being described in detail in claims, some aspects of the invention are consistent.
Implementation environment of the invention can be electronic equipment, such as smart phone, tablet computer, desktop computer.Meet pre- The quasi- data sample data of calibration may is that in the data sample that certain industry is put on the blacklist, or meet a certain specific The data sample etc. of behavior.
Fig. 1 is a kind of structural schematic diagram of data characteristics excavating gear disclosed by the embodiments of the present invention.Data characteristics is excavated Device 100 can be above-mentioned electronic equipment.As shown in Figure 1, data characteristics excavating gear 100 may include following one or more Component: processing component 102, memory 104, power supply module 106, multimedia component 108, audio component 110, sensor module 114 and communication component 116.
Processing component 102 usually control data characteristics excavating gear 100 integrated operation, such as with display, call, Data communication, camera operation and the associated operation of record operation etc..Processing component 102 may include one or more processing Device 118 executes instruction, to complete all or part of the steps of following methods.In addition, processing component 102 may include one Or multiple modules, for convenient for the interaction between processing component 102 and other assemblies.For example, processing component 102 may include more Media module, for facilitate the interaction between multimedia component 108 and processing component 102.
Memory 104 is configured as storing various types of data to support the operation in data characteristics excavating gear 100. The example of these data includes the instruction of any application or method for operating on data characteristics excavating gear 100. Memory 104 can realize by any kind of volatibility or non-volatile memory device or their combination, it is such as static with Machine accesses memory (Static Random Access Memory, abbreviation SRAM), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, abbreviation EEPROM), erasable programmable Read-only memory (Erasable Programmable Read Only Memory, abbreviation EPROM), programmable read only memory (Programmable Red-Only Memory, abbreviation PROM), read-only memory (Read-Only Memory, abbreviation ROM), Magnetic memory, flash memory, disk or CD.One or more modules are also stored in memory 104, for this or Multiple modules are configured to be executed by the one or more processors 118, to complete the whole or portion in method as follows Step by step.
Power supply module 106 provides electric power for the various assemblies of data characteristics excavating gear 100.Power supply module 106 can wrap Include power-supply management system, one or more power supplys and other with for data characteristics excavating gear 100 generate, manage, and distribute electricity The associated component of power.
Multimedia component 108 includes one output interface of offer between data characteristics excavating gear 100 and user Screen.In some embodiments, screen may include liquid crystal display (Liquid Crystal Display, abbreviation LCD) and Touch panel.If screen includes touch panel, screen may be implemented as touch screen, to receive input letter from the user Number.Touch panel includes one or more touch sensors to sense the gesture on touch, slide, and touch panel.Touch sensing Device can not only sense the boundary of a touch or slide action, but also detect the duration relevant with touch or slide and Pressure.Screen can also include display of organic electroluminescence (Organic Light Emitting Display, abbreviation OLED)。
Audio component 110 is configured as output and/or input audio signal.For example, audio component 110 includes a Mike Wind (Microphone, abbreviation MIC), when data characteristics excavating gear 100 is in operation mode, such as call model, logging mode When with speech recognition mode, microphone is configured as receiving external audio signal.The received audio signal can be further It is stored in memory 104 or is sent via communication component 116.In some embodiments, audio component 110 further includes a loudspeaking Device is used for output audio signal.
Sensor module 114 includes one or more sensors, for providing each side for data characteristics excavating gear 100 The status assessment in face.For example, sensor module 114 can detecte data characteristics excavating gear 100 open/close state, The relative positioning of component, sensor module 114 can be with detection data feature mining device 100 or data characteristics excavating gears The position change of 100 1 components and the temperature change of data characteristics excavating gear 100.In some embodiments, the sensing Device assembly 114 can also include Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 116 is configured to facilitate wired or wireless side between data characteristics excavating gear 100 and other equipment The communication of formula.Data characteristics excavating gear 100 can access the wireless network based on communication standard, such as WiFi (Wireless- Fidelity, Wireless Fidelity).In embodiments of the present invention, communication component 116 receives via broadcast channel and comes from external broadcasting pipe The broadcast singal or broadcast related information of reason system.In embodiments of the present invention, communication component 116 further includes near-field communication (Near Field Communication, abbreviation NFC) module, for promote short range communication.For example, NFC module can base In radio frequency identification (Radio Frequency Identification, abbreviation RFID) technology, Infrared Data Association (Infrared Data Association, abbreviation IrDA) technology, ultra wide band (Ultra Wideband, abbreviation UWB) technology, Bluetooth technology and Other technologies are realized.
In the exemplary embodiment, data characteristics excavating gear 100 can be by one or more application specific integrated circuit At (Application Specific Integrated Circuit, abbreviation ASIC), digital signal processor, digital signal Manage equipment, programmable logic device, field programmable gate array, controller, microcontroller, microprocessor or other electronic components It realizes, for executing following methods.
Referring to Fig. 2, Fig. 2 is a kind of flow diagram of data characteristics method for digging disclosed by the embodiments of the present invention.Such as The data characteristics method for digging shown in Fig. 2 may comprise steps of:
201, the data sample that vertical field meets preassigned is acquired, and based on the data sample structure for meeting preassigned Build training data set.
In the embodiment of the present invention, training data set includes multiple sampling feature vectors, and each sampling feature vectors are corresponding Meet the data sample of preassigned in one.
In the embodiment of the present invention, when constructing training data set, needs first to acquire some vertical field and meet pre- calibration Quasi- data sample, if the data sample for meeting preassigned can refer to for example, fiduciary loan industry is refered in particular in vertical field The black list user of fiduciary loan industry, then user's sample included in blacklist is data sample at this time.
As an alternative embodiment, acquiring the data sample that vertical field meets preassigned, and it is based on meeting The data sample of preassigned constructs training data set, can be accomplished by the following way: acquire vertical field meet it is predetermined The data sample of standard handles data sample according to default screening rule, normal data sample is obtained, according to normal data sample Analysis obtains personal information included by normal data, device-fingerprint and behavioral data and is set as the characteristic index of normal data, Construct to obtain master sample feature vector as training data set further according to the characteristic index of normal data;Wherein, sieve is preset Choosing rule is for screening out the nonstandard data sample of data format.
Specifically, it is assumed that vertical field is set as fiduciary loan industry, and the data sample for meeting preassigned is set It is set to the blacklist data sample of fiduciary loan industry, then blacklist data sample collected can be for by fiduciary loan industry The user's sample etc. being performed on list of breaking one's promise of the user's sample and publicity that pipe off.First acquire above-mentioned blacklist The contact method of user's sample, personal information, proof of identification, work income proves, bank card flowing water proves, the intended use of the loan proves And the details such as applied fiduciary loan business information, further according to default screening rule to above-mentioned details at Reason, writes code to the above-mentioned details of screening according to default screening rule, thus easily by details missing or in detail Invalid black list user's sample of thin information format mistake screens out, and it is complete to obtain sample information in black list user's sample Normal data and corresponding normal data sample;Characteristic index in extraction standard data sample such as sets fisrt feature index For personal information such as the age of user, educational background and work;Second feature index is set as the identification code of the equipment of user, Yong Hushe The device-fingerprint information of the users such as the common address Wi-Fi of standby physical address and user equipment;Set third feature index as The behavioural information of the users such as the transaction application frequency and geographical mobile data of user, and said extracted is set to multinomial characteristic index The feature vector of normal data A is obtained for the feature vector of normal data sample, such as extraction as (training, male, worker; 192.168.1.1 123456789012345,192.168.1.2;Average annual application fiduciary loan 3 times is located at Guangzhou in 2017), It additionally extracts and obtains the feature vector of several normal datas, the feature vector of above-mentioned several normal datas is packaged, then Obtain training data set.
As it can be seen that implementing the embodiment of the present invention, data sample invalid in mixed and disorderly scattered data sample can be screened out Normal data sample is obtained, and extracts and obtains the detailed feature vector of normal data sample, so as to effectively obtain normal data Sample and the corresponding user characteristics of normal data sample.
202, the training data set is handled, obtains normal data model.
In the embodiment of the present invention, since the data volume of training data set is more huge and needs to training data set packet The master sample feature vector included is classified, and k-means clustering algorithm can be used to be trained training data set, thus Master sample feature vector in training data set is divided in multiple cluster set, the standard in each cluster set is enabled Sampling feature vectors similarity all with higher.
As an alternative embodiment, handling the training data set, normal data model is obtained, it can To be accomplished by the following way:
The master sample feature vector for choosing preset quantity is set as cluster centre point;
It is executed every time following for each cluster centre point when obtained cluster set is not final cluster set Step:
Master sample feature vector in training data set in addition to cluster centre point is set to clustering distribution point;
For each clustering distribution point, according to each characteristic index the corresponding master sample feature of clustering distribution point to Weight in amount calculates the weighted euclidean distance of the clustering distribution point Yu each cluster centre point;
According to the weighted euclidean distance of each clustering distribution point and each cluster centre point, respectively by each cluster point Cluster of layouting is into cluster corresponding with the shortest cluster centre point of the weighted euclidean distance of clustering distribution point set, to obtain The cluster set of preset quantity;
It is averaged to the cluster centre point of clustering distribution point and the cluster set in each cluster set, to be put down Mean value gathers new cluster centre point as the cluster;
Judge the cluster set that the cluster centre point that the cluster each of this time determined newly is gathered is determined with the last time When middle cluster centre point is identical, determine that the cluster set of the preset quantity this time obtained is combined into final cluster set;
Based on the quantity for the master sample feature vector for including in each final cluster set, each final cluster set is determined The probability of corresponding data fit preassigned is closed, and each final cluster set and corresponding normal data are met into pre- calibration Quasi- probability is as normal data model.
Specifically, bank can be according to the personal experience of expert, and selection is several to have representative master sample feature vector It is set as cluster centre point, then the corresponding data sample of master sample feature vector for showing that above-mentioned several cluster centre points include is Has the blacklist data sample of representative user characteristics, have biggish may be cheated during bidding to host credit operation Behavior;After setting several cluster centre points, remaining master sample feature vector is set as clustering distribution point, and according to Cluster centre point is handled training data set using k-means clustering algorithm, and clustering distribution point can be obtained and gather with several The weighted euclidean distance of class central point arrives clustering distribution point cluster shortest poly- with the weighted euclidean distance of the clustering distribution point In the corresponding cluster set of class central point, so that several cluster set are obtained, as expert sets the cluster centre of B cluster set The corresponding master sample feature vector of point has following characteristic index, and (senior middle school's educational background, has no property, average annual to apply for that fiduciary loan is more than 3 It is secondary), then the master sample feature vector in the cluster set all needs have features described above index, each standard sample in cluster set The difference of eigen vector will be embodied in the other features index such as user equipment fingerprint;Obtaining several cluster set Afterwards, it also needs to be averaged to the master sample feature vector in each cluster set, by the average value of master sample feature vector Corresponding clustering distribution point is set as the cluster and gathers new cluster centre point, such as to B cluster set be averaged after, obtain as Lower characteristic index (senior middle school is academic, unemployed and average annual application fiduciary loan is more than 4 times), then by the corresponding feature of features described above index Vector is set as the cluster centre point of B cluster set, is repeated averagely, obtains new to all clusters set according to above-mentioned steps Cluster centre point, until this each of determines the cluster centre point and the last cluster set determined of new cluster set When cluster centre point is identical in conjunction, determine that the cluster set of the preset quantity this time obtained is combined into final cluster set.
As it can be seen that expert selectes cluster centre point according to business experience and uses k-means clustering algorithm, it can be by training data Master sample feature vector clusters in set are into several cluster set, so that the feature according to each cluster centre point refers to Mark, to master sample feature vector, corresponding user is rationally sorted out.
Wherein, in the weight according to characteristic index in master sample feature vector, the clustering distribution point is calculated separately Before the weighted euclidean distance of the cluster centre point, also need to be determined characteristic index in master sample feature according to Expert Rules Weight in vector, so that the high characteristic index of the probability for regarding as meeting preassigned by Expert Rules is in master sample feature Weight in vector is higher than the low characteristic index of the probability for regarding as meeting preassigned by Expert Rules in master sample feature Weight in vector, specifically, when calculating clustering distribution point and cluster centre point weighted euclidean distance, Expert Rules will be to every The weight of a characteristic index is assert, wherein when deciding whether to cheat, whether educational background height carries out user with it Fraud has apparent relevance, and the regional information of user then decides whether that progress fraud is not obvious with user Association, so, in the characteristic index of user, academic information will occupy higher weight than regional information, and pass through following public affairs Formula calculates the weighted euclidean distance of the cluster centre point of each feature vector and each cluster set:
Wherein, above-mentioned formula uses n characteristic index, ω1、……、ωnIt is weighted value corresponding to n characteristic index, x1、……、xnIt is the n characteristic index value an of feature vector, y1、……、ynIt is n characteristic index value of cluster centre point, D is weighted euclidean distance.
As it can be seen that the cluster of each feature vector and each cluster set can be accurately calculated by using above embodiment The weighted euclidean distance of central point, so that each feature vector accurately be clustered in suitable cluster set.
As an alternative embodiment, based on the master sample feature vector for including in each final cluster set Quantity determines that each final cluster gathers the corresponding probability for meeting preassigned, can be accomplished by the following way: calculating every The quantity for the master sample feature vector for including in a final cluster set occupies all standards for including in training data set The ratio of the total quantity of sampling feature vectors gathers the corresponding probability for meeting preassigned as final cluster, each to obtain Final cluster gathers corresponding final cluster and gathers the corresponding probability for meeting preassigned.Specifically, if B cluster set includes The quantity of master sample feature vector be 100, all master sample feature vectors for including in training data set it is total Quantity is 200, and new probability formula may be used and carry out that P (B)=100/200=50% is calculated, i.e. identification B cluster set pair The user's probability of cheating answered is 50%.As it can be seen that implementing present embodiment, it is convenient to obtain the standard sample in cluster set User corresponding to eigen vector carries out the probability of user behavior corresponding with cluster set.
It should be understood that poor for existing in different vertical field to the calculation of the probability of different user behavior It is different, for example the calculation and above-mentioned fiduciary loan industry calculating user's probability of cheating of user's continuation of insurance probability are calculated in insurance industry Calculation it is significantly different, the embodiment of the present invention with fiduciary loan industry calculate user's probability of cheating calculation lift Example does not constitute and generates restriction to the calculation in other vertical fields.
203, the user data of data to be analyzed is analyzed, to construct the feature vector of data to be analyzed.
In the embodiment of the present invention, after getting data to be analyzed, referring to the processing method of step 201, extract wait divide The characteristic index in data is analysed, to obtain the spy for the data to be analyzed of reference format handled for normal data model Vector is levied, step 204 is turned to.
204, it is analysed to the feature vector input normal data model of data, obtains data fit preassigned to be analyzed Probability.
In the embodiment of the present invention, the feature vector by being analysed to data inputs normal data model, can wait for this point Analyse data feature vector clusters in normal data model some cluster set in, then the cluster set it is corresponding meet it is pre- Calibrate quasi- probability, the as probability of the data fit preassigned to be analyzed.
As an alternative embodiment, in the feature vector and each final cluster set for calculating data to be analyzed After the distance of cluster centre point, and it is corresponding apart from shortest cluster centre point with the feature vector of data to be analyzed in determination Final cluster set be combined into belonging to the feature vector of data to be analyzed before final cluster set, if the feature of data to be analyzed The shortest distance of vector and the cluster centre point of each final cluster set is greater than cluster centre point in each final cluster set With the maximum distance of each clustering distribution point in the final cluster set, it is determined that data to be analyzed do not meet preassigned.Tool Body, it is greater than each most in the shortest distance of the cluster centre point of the feature vector and each final cluster set of data to be analyzed Eventually cluster set in cluster centre point and this it is final cluster set in each clustering distribution point maximum distance when, illustrate to be analyzed There is larger difference in the master sample feature vector in the feature vector of data and each clustering distribution point, should not be clustered In some cluster set, at this time it is believed that the user characteristics of the data to be analyzed, which do not meet several clusters, gathers corresponding user spy Sign, determines that the feature vector of the data to be analyzed does not meet preassigned.As it can be seen that above-mentioned decision process will filter out do not meet it is pre- The feature vector of quasi- data to be analyzed is calibrated, rather than simply it is clustered.
As another optional embodiment, when what the feature vector of data to be analyzed and each final cluster were gathered gathers The shortest distance of class central point is greater than cluster centre point in each final cluster set and each cluster in final cluster set When the maximum distance of distributed point, the corresponding user data of the feature vector of the data to be analyzed is pushed to expert and is attended a banquet terminal, The data to be analyzed are identified for expert.As it can be seen that manually being determined by expert, the feature because of certain class user can avoid Vector is more special, can not form independent cluster set and cause to differentiate wrong situation.
As an alternative embodiment, if the shortest distance is not greater than maximum distance, be analysed to the features of data to Amount is added in final cluster set belonging to the feature vector of data to be analyzed, and by the data fit preassigned to be analyzed Probability is set as its affiliated final cluster and gathers the corresponding probability for meeting preassigned, and executes determining and data to be analyzed spies The corresponding final cluster set of the sign shortest cluster centre point of vector distance is combined into final belonging to the feature vector of data to be analyzed Cluster set is analysed to final cluster belonging to the feature vector of data and gathers the probability that corresponding user meets preassigned The step of probability as data fit preassigned to be analyzed, data to be analyzed are added in the feature vector for being analysed to data In final cluster set belonging to feature vector;Execute the weighted euclidean distance for calculating separately clustering distribution point Yu cluster centre point The step of, to update normal data model.Specifically, it can be clustered in some cluster set in the feature vector of data to be analyzed When, then by the cluster set correspond to user meet preassigned probability be set as corresponding to the feature vector of the data to be analyzed to The probability of data fit preassigned is analyzed, in addition, after the cluster set is added in the feature vector of data to be analyzed, the cluster The average value of set will also generate variation, it is possible to understand that ground, in the case where data bulk to be analyzed increases, in cluster set Cluster centre point cannot reflect the actual conditions of the cluster set, it is therefore desirable to be added in the feature vector for having data to be analyzed It when cluster set, is averaged to the cluster set, acquires new cluster centre point, to realize to normal data model more Newly, make normal data model that can normally handle the feature vector of continually changing data to be analyzed.As it can be seen that passing through real-time update mark Quasi- data model realizes engineering so that will not generate disconnection between normal data model and the feature vector of data to be analyzed The function of habit.
As it can be seen that implementing method described in Fig. 2, passes through analytical standard data template data and constructs training data set, Algorithm can be used to obtain normal data model, input standard in the feature vector for being analysed to the corresponding data to be analyzed of data After data model, you can learn that the probability of the data fit preassigned to be analyzed.It improves and data to be analyzed is identified Efficiency, reduce due to manually identifying mistake caused by loss.If implementing the present invention in fiduciary loan industry, can analyze in time It obtains the user characteristics of data to be analyzed and accurately knows that data to be analyzed implement the probability of fraud, realize to be analyzed The automation of data is screened.
Referring to Fig. 3, Fig. 3 is the flow diagram of another data characteristics method for digging disclosed by the embodiments of the present invention. As shown in figure 3, the data characteristics method for digging may comprise steps of:
301, the data sample that vertical field meets preassigned is acquired, and based on the data sample structure for meeting preassigned Build training data set.
In the embodiment of the present invention, training data set includes multiple sampling feature vectors, and each sampling feature vectors are corresponding One data sample, wherein sampling feature vectors are made of the characteristic index of multiple dimensions.
302, it is analyzed using the characteristic index that Gaussian function includes to master sample feature vector, obtains preset quantity Master sample feature vector probability density distribution sample, by the highest standard sample of probability in each probability density distribution sample Eigen vector is set as cluster centre point.
In the embodiment of the present invention, cluster centre point is chosen based on expertise, although having certain reasonability, by The existing experience of expert is relied in extreme, when there is novel user behavior, expertise also can not be immediately to novel user row To be identified, cause selected cluster centre point ineffective, the knot that thus obtained normal data model analyzes Fruit is naturally also inaccurate, so, the embodiment of the present invention use Gaussian function come in analyzing and training data acquisition system sample characteristics to The characteristic index that amount includes is chosen to obtain cluster centre point according to expertise.
As an alternative embodiment, the characteristic index for including to master sample feature vector using Gaussian function into Row analysis, obtains the probability density distribution sample of the master sample feature vector of preset quantity, by each probability density distribution sample Highest master sample feature vector of probability is set as cluster centre point in this.Specifically, using Gaussian function to training dataset It closes the characteristic index that Plays sampling feature vectors include to be analyzed, several probability density distribution samples can be obtained, each In probability density distribution sample, the highest master sample feature vector of probability can be regarded as a large amount of master sample feature vectors Characteristic index is similar to the characteristic index of the highest master sample feature vector of the probability, so, it can be by several probability density The highest master sample feature vector of probability is set to the cluster centre point of cluster set in distribution sample, uses Gaussian function Obtained probability density distribution number of samples is the number of cluster centre point.It, can be as it can be seen that by using the embodiment of the present invention Before being trained to training data set, good cluster centre point is obtained, avoids and is chosen in cluster according to expertise When heart point, because the cluster of normal data model caused by experience limitation is bad, the case where data to be analyzed carry out identification misalignment.
303, training data set is handled, obtains normal data model.
304, data to be analyzed are analyzed, to construct the feature vector of data to be analyzed
305, it is analysed to the feature vector input normal data model of data, obtains data fit preassigned to be analyzed Probability.
As it can be seen that method described in implementing Fig. 3, is reasonably that normal data model chooses cluster by using Gaussian function Central point can avoid the subjective factor because of expertise, and have an impact to normal data model so that can not accurately identify to Analyze data.
Referring to Fig. 4, Fig. 4 is a kind of structural schematic diagram of data characteristics excavating gear disclosed by the embodiments of the present invention.Such as Shown in Fig. 5, which may include: training unit 401, cluster cell 402, construction unit 403 and divides Analyse unit 404, wherein
Training unit 401 meets the data sample of preassigned for acquiring vertical field, and is based on meeting preassigned Data sample construct training data set;Training data set includes multiple sampling feature vectors, each sample characteristics Vector corresponds to the data sample for meeting preassigned;
Cluster cell 402 obtains normal data model for handling training data set;
Construction unit 403, for analyzing data to be analyzed, to construct the feature vector of data to be analyzed;
Analytical unit 404, the feature vector for being analysed to data input normal data model, obtain data to be analyzed Meet the probability of preassigned.
In the embodiment of the present invention, training unit 401 is instructed after training obtains training data set by 402 Duis of cluster cell Practice data acquisition system to be handled, obtains normal data model;The feature of data to be analyzed that construction unit 403 obtains building to Amount input analytical unit 404, to obtain the probability of data fit preassigned to be analyzed.
As an alternative embodiment, training unit 401 acquires the data sample that vertical field meets preassigned, And training data set is constructed based on the data sample for meeting preassigned, it can be accomplished by the following way: training unit 401 The data sample that vertical field meets preassigned is acquired, data sample is handled according to default screening rule, obtains normal data Sample, according to normal data sample analysis obtain normal data included by personal information, device-fingerprint and behavioral data and set For the characteristic index of normal data, construct to obtain master sample feature vector as training further according to the characteristic index of normal data Data acquisition system;Wherein, screening rule is preset for screening out the nonstandard data sample of data format.
Specifically, it is assumed that vertical field is set as fiduciary loan industry, and the data sample for meeting preassigned is set It is set to the blacklist data sample of fiduciary loan industry, then the blacklist data sample collected of training unit 401 can be quilt The user's sample etc. being performed on list of breaking one's promise of user's sample and publicity that fiduciary loan industry pipes off.Training Unit 401 first acquires the contact method, personal information, proof of identification, work income proof, bank card stream of above-mentioned black list user Water proves, the intended use of the loan proves and the details such as applied fiduciary loan business information, further according to default screening rule Above-mentioned details are handled, code is write to the above-mentioned details of screening, thus easily according to default screening rule Invalid black list user's sample of details missing or details format error is screened out, black list user is obtained The complete normal data of sample information and corresponding normal data sample in sample;401 extraction standard data sample of training unit In characteristic index, such as set fisrt feature index as the age of user, educational background and work personal information;Set second feature Index is the users such as the common address Wi-Fi of the identification code of equipment of user, the physical address of user equipment and user equipment Device-fingerprint information;Third feature index is set as the behavior letter of the users such as the transaction application frequency of user and geographical mobile data It ceases, and said extracted is set as to the feature vector of normal data sample to multinomial characteristic index, such as extract and obtain normal data A Feature vector be (training, male, worker;192.168.1.1 123456789012345,192.168.1.2;Average annual letter of application With loan 3 times, it is located at Guangzhou within 2017), it additionally extracts and obtains the feature vector of several normal datas, by above-mentioned several standards The feature vector of data is packaged, then obtains training data set.
As it can be seen that implementing the embodiment of the present invention, training unit 401 can be by data sample invalid in mixed and disorderly scattered data sample This is screened out to obtain normal data sample, and is extracted and obtained the detailed feature vector of normal data sample, so as to effectively obtain Obtain normal data sample and the corresponding user characteristics of normal data sample.
As an alternative embodiment, cluster cell 402 handles the training data set, standard is obtained Data model can be accomplished by the following way:
The master sample feature vector that cluster cell 402 chooses preset quantity is set as cluster centre point;
It is executed every time following for each cluster centre point when obtained cluster set is not final cluster set Step:
Master sample feature vector in training data set in addition to cluster centre point is set to clustering distribution point;
For each clustering distribution point, according to each characteristic index the corresponding master sample feature of clustering distribution point to Weight in amount calculates the weighted euclidean distance of the clustering distribution point Yu each cluster centre point;
According to the weighted euclidean distance of each clustering distribution point and each cluster centre point, respectively by each cluster point Cluster of layouting is into cluster corresponding with the shortest cluster centre point of the weighted euclidean distance of clustering distribution point set, to obtain The cluster set of preset quantity;
It is averaged to the cluster centre point of clustering distribution point and the cluster set in each cluster set, to be put down Mean value gathers new cluster centre point as the cluster;
Judge the cluster set that the cluster centre point that the cluster each of this time determined newly is gathered is determined with the last time When middle cluster centre point is identical, determine that the cluster set of the preset quantity this time obtained is combined into final cluster set;
Based on the quantity for the master sample feature vector for including in each final cluster set, each final cluster set is determined The probability that corresponding user meets preassigned is closed, and each final cluster set and corresponding normal data are met into pre- calibration Quasi- probability is as normal data model.
Specifically, cluster cell 402 can be according to the personal experience of expert, and selection is several to have representative master sample spy Sign vector is set as cluster centre point, then shows the corresponding user of master sample feature vector that above-mentioned several cluster centre points include To have the black list user of representative user characteristics, having biggish may carry out fraud row during bidding to host credit operation For;After setting several cluster centre points, remaining master sample feature vector is set as clustering distribution by cluster cell 402 Point, and training data set is handled using k-means clustering algorithm according to cluster centre point, clustering distribution point can be obtained With the weighted euclidean distance of several cluster centre points, cluster cell 402 arrives clustering distribution point cluster and the clustering distribution point In the corresponding cluster set of the shortest cluster centre point of weighted euclidean distance, so that several cluster set are obtained, as expert sets Determine B cluster set the corresponding master sample feature vector of cluster centre point have following characteristic index (senior middle school educational background, have no property, Average annual application fiduciary loan is more than 3 times), then the master sample feature vector in the cluster set, which all needs to have features described above, refers to Mark, the difference for clustering each master sample feature vector in set will be embodied in the other features index such as user equipment fingerprint; After obtaining several cluster set, also needs to be averaged to the master sample feature vector in each cluster set, will mark The corresponding clustering distribution point of the average value of quasi- sampling feature vectors is set as the cluster and gathers new cluster centre point, such as clusters to B Set carry out it is average after, obtain following characteristic index (senior middle school is academic, unemployed and average annual application fiduciary loan is more than 4 times), then The corresponding feature vector of features described above index is set as to the cluster centre point of B cluster set, according to above-mentioned steps to all clusters Set is repeated average, obtains new cluster centre point, until this each of is determined in the cluster of new cluster set When heart point is identical as cluster centre point in the cluster set that the last time is determined, the cluster set of the preset quantity this time obtained is determined It is combined into final cluster set.
As it can be seen that expert selectes cluster centre point according to business experience and uses k-means clustering algorithm, cluster cell 402 It can be by the master sample feature vector clusters in training data set into several cluster set, thus according in each cluster The characteristic index of heart point, to master sample feature vector, corresponding user is rationally sorted out.
As an alternative embodiment, cluster cell 402 is based on the standard sample for including in each final cluster set The quantity of eigen vector determines that each final cluster gathers the corresponding probability for meeting preassigned, can be by with lower section Formula is realized: the quantity that cluster cell 402 calculates the master sample feature vector for including in each final cluster set occupies training The ratio of the total quantity for all master sample feature vectors for including in data acquisition system gathers corresponding meet as final cluster The probability of preassigned, to obtain, the corresponding final cluster set of each final cluster set is corresponding to meet the general of preassigned Rate.Specifically, if the quantity for the master sample feature vector that B cluster set includes is 100, include in training data set The total quantity of all master sample feature vectors is 200, and new probability formula may be used and carry out that P (B)=100/200 is calculated =50%, i.e., it is 50% that identification B cluster, which gathers corresponding user's probability of cheating,.As it can be seen that implementing present embodiment, cluster cell 402 be convenient to obtain user corresponding to master sample feature vector in cluster set carry out it is corresponding with set is clustered The probability of user behavior.
As an alternative embodiment, cluster cell 402 calculate data to be analyzed feature vector and it is each most Eventually after the distance of the cluster centre point of cluster set, and in the determining feature vector with data to be analyzed of cluster cell 402 Final cluster set belonging to the feature vector of data to be analyzed is combined into apart from the corresponding final cluster set of shortest cluster centre point Before conjunction, if the shortest distance of the cluster centre point of the feature vector of data to be analyzed and each final cluster set is greater than each In final cluster set in cluster centre point and the final cluster set each clustering distribution point maximum distance, then analytical unit 404 determine that data to be analyzed do not meet preassigned.Specifically, in the feature vector of data to be analyzed and each final cluster set The shortest distance of the cluster centre point of conjunction is greater than in each final cluster set in cluster centre point and final cluster set often The feature vector for illustrating data to be analyzed when the maximum distance of a clustering distribution point and the master sample in each clustering distribution point There is larger difference in feature vector, should not be clustered in some cluster set, at this time it is believed that the use of the data to be analyzed Family feature does not meet several clusters and gathers corresponding user characteristics, and analytical unit 404 determines the feature vector of the data to be analyzed Preassigned is not met.As it can be seen that above-mentioned decision process will filter out the feature vector for not meeting the data to be analyzed of preassigned, Rather than simply it is clustered.
As another optional embodiment, when what the feature vector of data to be analyzed and each final cluster were gathered gathers The shortest distance of class central point is greater than cluster centre point in each final cluster set and each cluster in final cluster set When the maximum distance of distributed point, the corresponding user data of the feature vector of the data to be analyzed is pushed to specially by analytical unit 404 Family attends a banquet terminal, identifies for expert to the data to be analyzed.As it can be seen that manually being determined by expert, can avoid because of certain The feature vector of class user is more special, can not form independent cluster set and cause to differentiate wrong situation.
As an alternative embodiment, analytical unit 404 is analysed to count if the shortest distance is not greater than maximum distance According to feature vector be added belonging to the feature vector of data to be analyzed in final cluster set, and by the data fit to be analyzed The probability of preassigned is set as its affiliated finally cluster and gathers the corresponding probability for meeting preassigned, and executes and determine and wait divide The feature vector of analysis data is combined into the feature vector of data to be analyzed apart from the corresponding final cluster set of shortest cluster centre point Affiliated final cluster set, be analysed to final cluster belonging to the feature vector of data gather corresponding user meet it is predetermined The step of probability of the probability of standard as data fit preassigned to be analyzed, be analysed to data feature vector be added to It analyzes in final cluster set belonging to the feature vector of data;The execution of analytical unit 404 calculates separately clustering distribution point and gathers The step of weighted euclidean distance of class central point, to update normal data model.Specifically, in the feature vector of data to be analyzed When can cluster in some cluster set, then the cluster set is corresponded to the probability that user meets preassigned by analytical unit 404 It is set as the probability of data fit preassigned to be analyzed corresponding to the feature vector of the data to be analyzed, in addition, in number to be analyzed According to feature vector the cluster set is added after, the average value of the cluster set also by generate variation, it is possible to understand that ground, to point In the case that analysis data bulk increases, the cluster centre point in cluster set cannot reflect the actual conditions of the cluster set, institute To need to be averaged to the cluster set when cluster set is added in the feature vector for having data to be analyzed, acquire new gather Class central point handle normal data model can normally continually changing wait divide to realize the update to normal data model Analyse the feature vector of data.As it can be seen that by real-time update normal data model, so that normal data model and data to be analyzed Disconnection will not be generated between feature vector, realize the function of machine learning.
As it can be seen that implementing method described in Fig. 4, by 404 analytical standard data template data of analytical unit and instruction is constructed Practice data acquisition system, algorithm can be used to obtain normal data model, in the feature for being analysed to the corresponding data to be analyzed of data After vector inputs normal data model, you can learn that the probability of the data fit preassigned to be analyzed.It improves to be analyzed The efficiency that data are identified reduces loss caused by due to manually identifying mistake.
Referring to Fig. 5, Fig. 5 is the structural schematic diagram of another data characteristics excavating gear disclosed by the embodiments of the present invention. Image processing apparatus shown in fig. 5 is that image processing apparatus as shown in Figure 4 optimizes.With image shown in Fig. 4 Processing unit compares, and image processing apparatus shown in fig. 5 can also include: cluster centre unit 405, wherein
Cluster centre unit 405, for being carried out using Gaussian function to the characteristic index that master sample feature vector includes Analysis, obtains the probability density distribution sample of the master sample feature vector of preset quantity, by each probability density distribution sample The middle highest master sample feature vector of probability is set as cluster centre point.
In the embodiment of the present invention, carry out the spy that sampling feature vectors include in analyzing and training data acquisition system using Gaussian function Sign index is to obtain cluster centre point.
As an alternative embodiment, cluster centre unit 405 is using Gaussian function to master sample feature vector Including characteristic index analyzed, obtain the probability density distribution sample of the master sample feature vector of preset quantity, will be every The highest master sample feature vector of probability is set as cluster centre point in a probability density distribution sample.Specifically, cluster centre Unit 405 is analyzed using the characteristic index that Gaussian function includes to training data set Plays sampling feature vectors, can Obtain several probability density distribution samples, in each probability density distribution sample, the highest master sample feature vector of probability It can be regarded as the characteristic index of a large amount of master sample feature vectors and the feature of the highest master sample feature vector of the probability Index is similar, so, the highest master sample feature vector of probability in several probability density distribution samples can be set to The cluster centre point of set is clustered, the probability density distribution number of samples obtained using Gaussian function is of cluster centre point Number.As it can be seen that cluster centre unit 405 can obtain before being trained to training data set by using the embodiment of the present invention To good cluster centre point, when avoiding according to expertise selection cluster centre point, normal data caused by being limited because of experience The case where cluster of model is bad, and data to be analyzed carry out identification misalignment.
It is reasonably that normal data model chooses cluster by using Gaussian function as it can be seen that implementing method described in Fig. 5 Central point can avoid the subjective factor because of expertise, and have an impact to normal data model so that can not accurately identify to Analyze data.
The present invention also provides a kind of electronic equipment, which includes:
Processor;
Memory is stored with computer-readable instruction on the memory, when which is executed by processor, Realize a kind of data characteristics method for digging as previously shown.
The electronic equipment can be data characteristics excavating gear shown in Fig. 1 100.
In one exemplary embodiment, the present invention also provides a kind of computer readable storage mediums, are stored thereon with calculating Machine program when the computer program is executed by processor, realizes a kind of data characteristics method for digging as previously shown.
It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and change can executed without departing from the scope.The scope of the present invention is limited only by the attached claims.

Claims (10)

1. a kind of data characteristics method for digging characterized by comprising
The data sample that vertical field meets preassigned is acquired, and based on the data sample building instruction for meeting preassigned Practice data acquisition system;The training data set includes multiple sampling feature vectors, and each sampling feature vectors correspond to one A data sample for meeting preassigned;
The training data set is handled, normal data model is obtained;
Data to be analyzed are analyzed, to construct the feature vector of data to be analyzed;
The feature vector of the data to be analyzed is inputted into the normal data model, is obtained described in the data fit to be analyzed The probability of preassigned.
2. data characteristics method for digging according to claim 1, which is characterized in that the vertical field of the acquisition meets predetermined The data sample of standard, and training data set is constructed based on the data sample for meeting preassigned, comprising:
The data sample that vertical field meets preassigned is acquired, the data sample is handled according to default screening rule, is obtained Normal data sample;The default screening rule is for screening out the nonstandard data sample of data format;
Personal information included by the normal data, device-fingerprint and behavior number are obtained according to the normal data sample analysis According to and be set as the characteristic index of the normal data;
Construct to obtain the master sample feature vector as the training dataset according to the characteristic index of the normal data It closes.
3. data characteristics method for digging according to claim 2, which is characterized in that it is described to the training data set into Row processing, obtains normal data model, comprising:
The master sample feature vector for choosing preset quantity is set as cluster centre point;
Following steps are executed every time when obtained cluster set is not final cluster set for each cluster centre point:
The master sample feature vector in the training data set in addition to the cluster centre point is set to gather Class distributed point;
For each clustering distribution point, according to each characteristic index in the corresponding standard sample of clustering distribution point Weight in eigen vector calculates the weighted euclidean distance of the clustering distribution point Yu each cluster centre point;
It, respectively will be each described poly- according to the weighted euclidean distance of each clustering distribution point and each cluster centre point Class distributed point cluster is to cluster set corresponding with the shortest cluster centre point of the weighted euclidean distance of the clustering distribution point In, to obtain the cluster set of the preset quantity;
It is averaged to the cluster centre point of the clustering distribution point and the cluster set in each cluster set, Gather new cluster centre point as the cluster to obtain average value;
Judgement each of is this time determined to gather in the cluster centre point and the last cluster set determined of new cluster set When class central point is identical, determine that the cluster set of the preset quantity this time obtained is combined into the final cluster set;
Based on the quantity for the master sample feature vector for including in each final cluster set, determine it is each it is described most Cluster gathers the probability of preassigned described in corresponding data fit eventually, and each final cluster is gathered and corresponding institute It states normal data and meets the probability of the preassigned as the normal data model;
And the feature vector by the data to be analyzed inputs the normal data model, obtains the number to be analyzed According to the probability for meeting the preassigned, comprising:
The feature vector of the data to be analyzed is calculated at a distance from the cluster centre point of each final cluster set, It is determining with the feature vector of the data to be analyzed apart from the corresponding final cluster set of the shortest cluster centre point Final cluster set belonging to feature vector for the data to be analyzed;
The corresponding normal data of final cluster set belonging to the feature vector of the data to be analyzed is met described pre- Calibrate probability of the quasi- probability as preassigned described in the data fit to be analyzed.
4. data characteristics method for digging according to claim 3, which is characterized in that referred to described according to each feature The weight being marked in the corresponding master sample feature vector of the clustering distribution point, calculate the clustering distribution point with it is each described Before the weighted euclidean distance of cluster centre point, the method also includes:
Weight of the characteristic index in the master sample feature vector is determined according to Expert Rules, so that by the expert It is high that rule regards as meeting weight of the high characteristic index of probability of the preassigned in the master sample feature vector In the low characteristic index of the probability for regarding as meeting the preassigned by the Expert Rules the master sample feature to Weight in amount.
5. data characteristics method for digging according to claim 4, which is characterized in that the mark for choosing preset quantity Quasi- sampling feature vectors are set as cluster centre point, comprising:
It is analyzed, is obtained described default using the characteristic index that Gaussian function includes to the master sample feature vector The probability density distribution sample of the master sample feature vector of quantity, by probability in each probability density distribution sample The highest master sample feature vector is set as the cluster centre point.
6. according to the described in any item data characteristics method for digging of claim 3~5, which is characterized in that described to be based on each institute The quantity for stating the master sample feature vector for including in final cluster set determines that each final cluster set corresponds to The probability for meeting the preassigned, comprising:
The quantity for calculating the master sample feature vector for including in each final cluster set occupies the trained number It is corresponded to according to the ratio of the total quantity for all master sample feature vectors for including in set as the final cluster set The probability for meeting the preassigned, to obtain the corresponding final cluster set correspondence of each final cluster set The probability for meeting the preassigned.
7. data characteristics method for digging according to claim 6, which is characterized in that calculate the data to be analyzed described Feature vector at a distance from the cluster centre point of each final cluster set after, and in the determination and institute State the feature vectors of data to be analyzed apart from the corresponding final cluster set of the shortest cluster centre point be combined into it is described to Before analyzing final cluster set belonging to the feature vector of data, the method also includes:
If the most short distance of the cluster centre point of the feature vector of the data to be analyzed and each final cluster set From greater than each clustering distribution in cluster centre point described in each final cluster set and final cluster set The maximum distance of point, it is determined that the data to be analyzed do not meet the preassigned;
If the shortest distance is not greater than the maximum distance, execute the feature vectors of the determination and the data to be analyzed away from The final cluster set corresponding from the shortest cluster centre point is combined into belonging to the feature vector of the data to be analyzed Final cluster set, will final cluster set be corresponding belonging to the feature vector of the data to be analyzed meets the pre- calibration The step of probability of the quasi- probability as preassigned described in the data fit to be analyzed;
Final cluster set belonging to the feature vector of the data to be analyzed is added in the feature vector of the data to be analyzed In;
The step of weighted euclidean distance of the clustering distribution point with the cluster centre point is calculated separately described in execution, to update The normal data model.
8. a kind of data characteristics excavating gear characterized by comprising
Training unit, meet the data sample of preassigned for acquiring vertical field, and based on the preassigned that meets Data sample constructs training data set;The training data set includes multiple sampling feature vectors, and each sample is special Sign vector corresponds to the data sample for meeting preassigned;
Cluster cell obtains normal data model for handling the training data set;
Construction unit, for analyzing data to be analyzed, to construct the feature vector of data to be analyzed;
Analytical unit obtains described wait divide for the feature vector of the data to be analyzed to be inputted the normal data model Analyse the probability of preassigned described in data fit.
9. a kind of electronic equipment, including memory and processor, the memory are stored with computer program, which is characterized in that The processor realizes data characteristics method for digging according to any one of claims 1 to 7 when executing the computer program The step of.
10. a kind of computer readable storage medium, which is characterized in that it stores computer program, and the computer program makes Computer perform claim requires 1~7 described in any item data characteristics method for digging.
CN201910630499.7A 2019-04-19 2019-07-12 Data feature mining method and device, electronic equipment and storage medium Active CN110288468B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2019103178422 2019-04-19
CN201910317842 2019-04-19

Publications (2)

Publication Number Publication Date
CN110288468A true CN110288468A (en) 2019-09-27
CN110288468B CN110288468B (en) 2023-06-06

Family

ID=68022554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910630499.7A Active CN110288468B (en) 2019-04-19 2019-07-12 Data feature mining method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110288468B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062440A (en) * 2019-12-18 2020-04-24 腾讯科技(深圳)有限公司 Sample selection method, device, equipment and storage medium
CN112348107A (en) * 2020-11-17 2021-02-09 百度(中国)有限公司 Image data cleaning method and apparatus, electronic device, and medium
CN114996360A (en) * 2022-07-20 2022-09-02 江西现代职业技术学院 Data analysis method, system, readable storage medium and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004038602A1 (en) * 2002-10-24 2004-05-06 Warner-Lambert Company, Llc Integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications
US20150142808A1 (en) * 2013-11-15 2015-05-21 Futurewei Technologies Inc. System and method for efficiently determining k in data clustering
CN106980623A (en) * 2016-01-18 2017-07-25 华为技术有限公司 A kind of determination method and device of data model
CN109360105A (en) * 2018-09-18 2019-02-19 平安科技(深圳)有限公司 Product risks method for early warning, device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004038602A1 (en) * 2002-10-24 2004-05-06 Warner-Lambert Company, Llc Integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications
US20150142808A1 (en) * 2013-11-15 2015-05-21 Futurewei Technologies Inc. System and method for efficiently determining k in data clustering
CN106980623A (en) * 2016-01-18 2017-07-25 华为技术有限公司 A kind of determination method and device of data model
WO2017124713A1 (en) * 2016-01-18 2017-07-27 华为技术有限公司 Data model determination method and apparatus
CN109360105A (en) * 2018-09-18 2019-02-19 平安科技(深圳)有限公司 Product risks method for early warning, device, computer equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062440A (en) * 2019-12-18 2020-04-24 腾讯科技(深圳)有限公司 Sample selection method, device, equipment and storage medium
CN111062440B (en) * 2019-12-18 2024-02-02 腾讯科技(深圳)有限公司 Sample selection method, device, equipment and storage medium
CN112348107A (en) * 2020-11-17 2021-02-09 百度(中国)有限公司 Image data cleaning method and apparatus, electronic device, and medium
CN114996360A (en) * 2022-07-20 2022-09-02 江西现代职业技术学院 Data analysis method, system, readable storage medium and computer equipment

Also Published As

Publication number Publication date
CN110288468B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
US12014253B2 (en) System and method for building predictive model for synthesizing data
WO2019100844A1 (en) Machine learning model training method and device, and electronic device
CN109409677A (en) Enterprise Credit Risk Evaluation method, apparatus, equipment and storage medium
US11501161B2 (en) Method to explain factors influencing AI predictions with deep neural networks
CN110245213A (en) Questionnaire generation method, device, equipment and storage medium
CN108288067A (en) Training method, bidirectional research method and the relevant apparatus of image text Matching Model
CN110288468A (en) Data characteristics method for digging, device, electronic equipment and storage medium
CN108009600A (en) Model optimization, quality determining method, device, equipment and storage medium
CN113299346B (en) Classification model training and classifying method and device, computer equipment and storage medium
CN111241992B (en) Face recognition model construction method, recognition method, device, equipment and storage medium
CN103839183A (en) Intelligent credit extension method and intelligent credit extension device
Teng et al. Customer credit scoring based on HMM/GMDH hybrid model
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN112712383A (en) Potential user prediction method, device, equipment and storage medium of application program
CN109948730A (en) A kind of data classification method, device, electronic equipment and storage medium
CN111815169A (en) Business approval parameter configuration method and device
CN110717509A (en) Data sample analysis method and device based on tree splitting algorithm
CN111178656A (en) Credit model training method, credit scoring device and electronic equipment
CN110232154A (en) Products Show method, apparatus and medium based on random forest
CN115545103A (en) Abnormal data identification method, label identification method and abnormal data identification device
Pratondo et al. Prediction of Operating System Preferences on Mobile Phones Using Machine Learning
CN108304568A (en) A kind of real estate Expectations big data processing method and system
CN113762579A (en) Model training method and device, computer storage medium and equipment
CN109934352B (en) Automatic evolution method of intelligent model
CN110516713A (en) A kind of target group's recognition methods, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant