CN109615020A

CN109615020A - Characteristic analysis method, device, equipment and medium based on machine learning model

Info

Publication number: CN109615020A
Application number: CN201811588694.XA
Authority: CN
Inventors: 谭辉; 李元; 汪亚男; 邱毅
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2019-04-12

Abstract

The invention discloses a kind of characteristic analysis method based on machine learning model, device, equipment and media, this method comprises: determining the second training sample set based on the target sample got and the first training sample set；The target sample has the default sample class determined by disaggregated model, and the disaggregated model is obtained by first training sample set training；According to default training rules and second training sample set, training obtains Characteristic Analysis Model；Forecast sample is input to the disaggregated model, obtains the sample class of the forecast sample；When detecting that the sample class of the forecast sample is identical as the default sample class, the forecast sample is input to the Characteristic Analysis Model, obtains the signature analysis result of the forecast sample.When the present invention is realized based on machine learning model progress business classification, not changing disaggregated model algorithm can be realized the signature analysis of single sample.

Description

Characteristic analysis method, device, equipment and medium based on machine learning model

Technical field

The present invention relates to machine learning techniques field more particularly to a kind of signature analysis sides based on machine learning model Method, device, equipment and medium.

Background technique

When carrying out prediction classification to business sample in conjunction with machine learning model, each business sample has multiple features, and Each feature is different the percentage contribution of the classification results of business sample, and the feature importance of sample is characterized in the sample quilt When disaggregated model is determined as a certain classification, multiple features of the sample are to the current significance level for determining result.

Current machine learning model, such as decision tree, algorithm is relatively easy, although can know from classification results individually The feature importance of sample, but its classifying quality is bad, therefore is rarely employed；And the better machine learning model of classifying quality, Point that can only be exported according to model such as SVM (Support Vector Machine, support vector machines), neural network etc., user Class result knows what classification single business sample belongs to, but can not know that model is mainly sentenced according to which feature of the sample The fixed sample is current class, i.e., user can not know the feature importance of single sample under the judgement result, unless calculating Method fully opens the related source code of modification, but this needs very deep algorithm knowledge.

Summary of the invention

The main purpose of the present invention is to provide a kind of characteristic analysis methods based on machine learning model, device, equipment And medium, it is intended under the premise of not changing disaggregated model algorithm, realize the signature analysis of single sample, user is made not only may be used To know the classification results of business sample, the feature importance of business sample under the classification results can also be known, to assist User preferably carries out business judgement according to the feature importance of sample.

To achieve the above object, the present invention provides a kind of characteristic analysis method based on machine learning model, described to be based on The characteristic analysis method of machine learning model the following steps are included:

Based on the target sample got and the first training sample set, the second training sample set is determined；The target sample With the default sample class determined by disaggregated model, the disaggregated model is obtained by first training sample set training；

According to default training rules and second training sample set, training obtains Characteristic Analysis Model；

Forecast sample is input to the disaggregated model, obtains the sample class of the forecast sample；

When detecting that the sample class of the forecast sample is identical as the default sample class, by the forecast sample It is input to the Characteristic Analysis Model, obtains the signature analysis result of the forecast sample.

Optionally, second training sample set includes multiple second training samples, described based on the target sample got This and the first training sample set, the step of determining the second training sample set include:

Obtain target sample, the first training sample set and multiple initial training samples；

After the standard deviation for multiplying first training sample set to the initial training sample, it is added with the target sample, The result that will add up is as the second training sample；

Based on obtained multiple second training samples, the second training sample set is determined.

Optionally, the basis presets training rules and second training sample set, and training obtains signature analysis mould The step of type includes:

Calculate the Euclidean distance between the target sample and second training sample；

The second training sample is calculated according to default calculation formula and the corresponding Euclidean distance of second training sample This enters to join coefficient；

Obtain the predicted value that the disaggregated model is directed to second training sample；

Using the multiple second training sample, multiple second training samples it is corresponding enter join coefficient and predicted value as Enter ginseng and carry out ridge regression model training, obtained training result is as Characteristic Analysis Model.

Optionally, described when detecting that the sample class of the forecast sample is identical as the default sample class, it will Before the step of forecast sample is input to the Characteristic Analysis Model, obtains the signature analysis result of the forecast sample also Include:

Based on second training sample set, Accuracy Verification is carried out to the Characteristic Analysis Model；

Judge whether the Characteristic Analysis Model passes through Accuracy Verification, if passing through, enter step: is described when detecting When the sample class of forecast sample is identical as the default sample class, the forecast sample is input to the signature analysis mould Type obtains the signature analysis result of the forecast sample.

Optionally, described to be based on second training sample set, Accuracy Verification is carried out to the Characteristic Analysis Model Step includes:

Several aspect of model for meeting preset condition are obtained from the Characteristic Analysis Model；

Multiple first predicted values are obtained according to several described aspect of model training Characteristic Analysis Model；

Multiple second training samples for including by second training sample set input the disaggregated model respectively, obtain more A second predicted value；

According to the multiple first predicted value and the multiple second predicted value, it is accurate to carry out to the Characteristic Analysis Model Property verifying.

In addition, the present invention also provides a kind of feature analyzing apparatus based on machine learning model, it is described to be based on machine learning The feature analyzing apparatus of model includes:

Extraction module, for determining the second training sample set based on the target sample got and the first training sample set； The target sample has the default sample class determined by disaggregated model, and the disaggregated model is by first training sample Training is got；

Training module, for according to training rules and second training sample set is preset, training to obtain signature analysis Model；

Determination module obtains the sample class of the forecast sample for forecast sample to be input to the disaggregated model；

Analysis module, for when detecting that the sample class of the forecast sample is identical as the default sample class, The forecast sample is input to the Characteristic Analysis Model, obtains the signature analysis result of the forecast sample.

Optionally, second training sample set includes multiple second training samples, and the extraction module includes:

First acquisition unit, for obtaining target sample, the first training sample set and multiple initial training samples；

Processing unit, after the standard deviation for multiplying first training sample set to the initial training sample with the mesh This addition of standard specimen, the result that will add up is as the second training sample；

Determination unit, for determining the second training sample set based on obtained multiple second training samples.

Optionally, the training module includes:

First computing unit, for calculating the Euclidean distance between the target sample and second training sample；

Second computing unit, based on according to default calculation formula and the corresponding Euclidean distance of second training sample Second training sample is calculated to enter to join coefficient；

Second acquisition unit, the predicted value for being directed to second training sample for obtaining the disaggregated model；

Training unit, for by the multiple second training sample, multiple second training samples it is corresponding enter ginseng be Several and predicted value carries out ridge regression model training as ginseng is entered, and obtained training result is as Characteristic Analysis Model.

Optionally, described device further include:

Authentication module carries out Accuracy Verification to the Characteristic Analysis Model for being based on second training sample set；

Judgment module for judging whether the Characteristic Analysis Model passes through Accuracy Verification, and works as and judges the spy Analysis model is levied by after Accuracy Verification, sending judging result " passing through " to the analysis module；

The analysis module is also used to after receiving the judging result that the judgment module is sent and being " passing through ", works as inspection Measure the sample class of the forecast sample it is identical as the default sample class when, the forecast sample is input to the spy Analysis model is levied, the signature analysis result of the forecast sample is obtained.

Optionally, the authentication module includes:

Extraction unit, for obtaining several aspect of model for meeting preset condition from the Characteristic Analysis Model；

Third computing unit, for obtaining multiple the according to several aspect of model training Characteristic Analysis Model One predicted value；

4th computing unit, multiple second training samples for including by second training sample set input institute respectively Disaggregated model is stated, multiple second predicted values are obtained；

Authentication unit, for dividing the feature according to the multiple first predicted value and the multiple second predicted value It analyses model and carries out Accuracy Verification.

In addition, the present invention also provides a kind of signature analysis equipment based on machine learning model, the equipment includes: storage The feature based on machine learning model point that device, processor and being stored in can be run on the memory and on the processor Program is analysed, realizes when the signature analysis program based on machine learning model is executed by the processor and is based on as described above The step of characteristic analysis method of machine learning model.

In addition, being applied to computer the present invention also provides a kind of medium, being stored on the medium based on machine learning mould The signature analysis program of type is realized as described above when the signature analysis program based on machine learning model is executed by processor The characteristic analysis method based on machine learning model the step of.

The present invention is based on the target sample got and the first training sample sets, determine the second training sample set；The mesh Standard specimen sheet has the default sample class determined by disaggregated model, and the disaggregated model is by first training sample set training It obtains；According to default training rules and second training sample set, training obtains Characteristic Analysis Model；Forecast sample is defeated Enter the sample class that the forecast sample is obtained to the disaggregated model；When detect the sample class of the forecast sample with When the default sample class is identical, the forecast sample is input to the Characteristic Analysis Model, obtains the forecast sample Signature analysis result；It is carried out as a result, according to the second training sample set determined based on target sample and the first training sample set The training of Characteristic Analysis Model, it is not necessary to modify points that the important feature of single forecast sample can be realized in the source code of disaggregated model Analysis, when solving in the prior art using the disaggregated model of good classification effect but algorithm complexity progress sample classification, user can only Know what classification single business sample belongs to according to the classification results that model exports, but can not know that model is mainly that basis should Which feature of sample determines the sample for current class, i.e., can not know that the feature of single sample under the judgement result is important The problem of property, the present invention is improved in the case where avoiding the intrusion to disaggregated model algorithm it is not necessary to modify disaggregated model algorithm The reference value of category of model result preferably can carry out business judgement and development according to classification results with auxiliary activities.

Detailed description of the invention

Fig. 1 is the structural schematic diagram for the hardware running environment that the embodiment of the present invention is related to；

Fig. 2 is that the present invention is based on the flow diagrams of the characteristic analysis method first embodiment of machine learning model；

Fig. 3 is that the present invention is based on the flow diagrams of the characteristic analysis method second embodiment of machine learning model；

Fig. 4 is that the present invention is based on the flow diagrams of the characteristic analysis method 3rd embodiment of machine learning model；

Fig. 5 is the refinement step schematic diagram of step S310 in Fig. 4.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

As shown in Figure 1, Fig. 1 is the structural schematic diagram for the hardware running environment that the embodiment of the present invention is related to.

It should be noted that Fig. 1 can be the structural schematic diagram of the hardware running environment of sample characteristics analytical equipment.This hair Bright embodiment sample characteristics analytical equipment can be PC, the terminal devices such as portable computer.

As shown in Figure 1, the sample characteristics analytical equipment may include: processor 1001, such as CPU, network interface 1004, User interface 1003, memory 1005, communication bus 1002.Wherein, communication bus 1002 is for realizing between these components Connection communication.User interface 1003 may include display screen (Display), input unit such as keyboard (Keyboard), optional User interface 1003 can also include standard wireline interface and wireless interface.Network interface 1004 optionally may include standard Wireline interface, wireless interface (such as WI-FI interface).Memory 1005 can be high speed RAM memory, be also possible to stable Memory (non-volatile memory), such as magnetic disk storage.Memory 1005 optionally can also be independently of aforementioned The storage device of processor 1001.

It will be understood by those skilled in the art that the not structure paired samples of sample characteristics analytical equipment structure shown in Fig. 1 The restriction of signature analysis equipment may include perhaps combining certain components or different than illustrating more or fewer components Component layout.

As shown in Figure 1, as may include that operating system, network are logical in a kind of memory 1005 of computer storage medium Believe module, Subscriber Interface Module SIM and the signature analysis program based on machine learning model.Wherein, operating system is management and control The program of sample preparation eigen analytical equipment hardware and software resource, support signature analysis program based on machine learning model and The operation of other softwares or program.

In sample characteristics analytical equipment shown in Fig. 1, user interface 1003 is mainly used for carrying out data with each terminal Communication；Network interface 1004 is mainly used for connecting background server, carries out data communication with background server；And processor 1001 It can be used for calling the signature analysis program based on machine learning model stored in memory 1005, and execute following operation:

Further, processor 1001 can be also used for calling stored in memory 1005 based on machine learning model Signature analysis program, and execute following steps:

It is added after multiplying the standard deviation of first training sample set to the initial training sample with the target sample, it will The result of addition is as the second training sample；

Based on above-mentioned structure, each embodiment of the characteristic analysis method based on machine learning model is proposed.

It is that the present invention is based on the signals of the process of the characteristic analysis method first embodiment of machine learning model referring to Fig. 2, Fig. 2 Figure.

The embodiment of the invention provides the embodiments of the characteristic analysis method based on machine learning model, need to illustrate It is, it, in some cases, can be to be different from sequence execution institute herein although logical order is shown in flow charts The step of showing or describing.

The embodiment of the present invention is applied to signature analysis equipment based on the characteristic analysis method of machine learning model, and the present invention is real Applying signature analysis equipment can be PC, and the terminal devices such as portable computer are not particularly limited herein.

Characteristic analysis method of the present embodiment based on machine learning model include:

Step S100 determines the second training sample set based on the target sample got and the first training sample set；Its In, the target sample has the default sample class determined by disaggregated model, and the disaggregated model is trained by described first Sample set training obtains.

With big data and the fast development of machine learning, answering for business classification and prediction is carried out using machine learning model Also increasingly wider with range, when carrying out prediction classification to business sample in conjunction with machine learning model, each business sample has more A feature, and each feature is different the percentage contribution of the classification results of business sample, the feature importance characterization of sample When: the sample is classified model and is determined as a certain classification, multiple features of the sample to the current significance level for determining result, Determine that result is more important, the feature of sample is more important.

Currently, common disaggregated model has decision tree, logistic regression, SVM, random forest, neural network etc., wherein very More complicated machine learning classification models are all black boxs, i.e. input multiple characteristic values for being sample, export the classification knot for sample Fruit, it is unknowable as why from multiple characteristic values can deriving classification results；It, can be using complexity in the application Sorting algorithm, and in the case where not changing the source code of disaggregated model, realize the interpretation analysis of single sample, In, what interpretation is described as can derive classification results and which feature to classification results from so multiple characteristic values Influence it is maximum.

The interpretation of machine learning algorithm is divided to two classes: the relatively simple algorithm such as decision tree, logistic regression at present, can To there is the feature importance of single sample directly to use；And such as SVM, neural network etc. relative complex algorithm, single sample Feature importance be unable to get, financial system is when using machine learning model, if it is the relatively simple calculation such as decision tree Method, although there is the feature importance of single sample, algorithm classification effect is bad, therefore is rarely employed；It is returned if it is logic Return scheduling algorithm, there is the feature importance of model, that is, input multiple and different samples, model can provide this multiple sample respectively For the probability of a certain classification, but the output without being directed to single sample feature importance；If it is algorithm complexity, classification effect When the better model of fruit, algorithm is fully opened the related source code of modification by the explanatory needs of single sample, and this needs is very deep Algorithm knowledge.

Financial system values the interpretation of model when using machine learning very much, for example, in anti-money laundering field, All there are strict requirements to go to illustrate why client has money laundering suspicion from levels such as client, account, transaction for the supervision of all money laundering cases It doubts, and machine learning model, such as neural network model can only determine whether client has in anti-money laundering field practical application Money laundering suspicion, can not but illustrate why client has money laundering suspicion, and such judgement result is better to business personnel Analysis and judge client whether genuine money laundering can't play the role of it is too big.

In the present embodiment, target sample and the first training sample are first obtained, wherein target sample is containing special characteristic The sample of (if be applied to anti money washing and identify that special characteristic is preferably money laundering class another characteristic).Based on the target sample got This and the first training sample set, determine the second training sample set, the target sample have determine by disaggregated model it is default Sample class, the disaggregated model are obtained by first training sample set training；Specifically, the disaggregated model includes but not It is limited to decision tree, logistic regression, SVM, neural network etc., target sample is confirmed as the classification after disaggregated model is classified The corresponding a certain sample class of model；For example, target sample includes the characteristic information of a certain client, such as behavioural characteristic, building Disaggregated model be for client whether there is money laundering suspicion to classify to do, the target sample is after disaggregated model is classified, quilt It is determined as with money laundering suspicion, it is to be understood that the disaggregated model corresponding to building is that whether have money laundering suspicion for client It doubts to do and classify, the first training sample concentrates the characteristic information including multiple client's samples, has to have in these client's samples and wash The bad sample of money suspicion also has the good sample without money laundering suspicion；The target sample and the first training sample set are with client For dimension, the feature of multiple description customer actions can have, for example be transferred to the amount of money on the day of client, produce the amount of money, transaction generation In such feature such as the number of high-risk areas.

In the present embodiment, it is equivalent to client's sample (the i.e. above-mentioned target for choosing that several include client characteristics information Sample), then these client's sample random distributions process client's sample of this several random distribution, it is several to change this The spatial distribution of a client characteristics sample finally obtains the client's sample for being distributed in around target sample and having money laundering suspicion Feature set is as the second training sample set, by having money laundering suspicion to multiple client's sample analyses with money laundering suspicion Client feature importance.

Step S200, according to default training rules and second training sample set, training obtains Characteristic Analysis Model.

The multiple client characteristics for including are concentrated to carry out importance point the second training sample being distributed in around target sample Analysis specifically chooses regression algorithm model, the data got will be concentrated to substitute into regression algorithm model from the second training sample It calculates, to obtain the linear convergent rate about multiple client characteristics as a result, carrying out according to the coefficient of feature each in result expression Feature importance ranking obtains each important feature in this feature analysis model to get the visitor arrived under same sample class The importance ranking of the multiple features in family, further, the regression algorithm model can be ridge regression model, be also possible to return Model.

Forecast sample is input to the disaggregated model, obtains the sample class of the forecast sample by step S300.

Forecast sample is input to the disaggregated model based on the building of the first training sample set to classify to forecast sample, is obtained pre- The classification results (i.e. sample class) of test sample sheet.

Step S400, when detecting that the sample class of the forecast sample is identical as the default sample class, by institute It states forecast sample and is input to the Characteristic Analysis Model, obtain the signature analysis result of the forecast sample.

Judge whether the classification results of forecast sample are consistent with the classification of target sample, is washed as whether forecast sample also has Money suspicion, if forecast sample by disaggregated model is judged to that current predictive sample is input to the needle of building with money laundering suspicion To the Characteristic Analysis Model with money laundering suspicion user, the feature importance of the forecast sample is obtained.

Further, when the disaggregated model determines the target sample and the forecast sample all has money laundering suspicion When, the predicted value for the target sample and the predicted value for the forecast sample of the disaggregated model output may phases Together, it is also possible to it is different.In the present embodiment, when two predicted value differences, the forecast sample is input to the feature Analysis model, it is as an implementation, optional first to judge described two before obtaining the signature analysis result of the forecast sample Whether the difference between a predicted value is less than preset threshold, if being less than, the target sample and forecast sample of description selection all have Under the premise of having money laundering suspicion, the differences between samples of the two within a preset range, then by the forecast sample are input to the feature Analysis model obtains the signature analysis of the forecast sample as a result, thus, it is possible to lift pins are to the signature analysis of forecast sample Accuracy；While business personnel knows that the business sample has money laundering suspicion according to machine learning model as a result, also it would know that The importance of each feature in the business sample, business personnel can be in conjunction with the importance of each feature of single sample, to the industry The business whether genuine money laundering of sample is judged, mitigates business personnel's workload simultaneously, the present invention is applied to anti-money laundering field can also To meet regulatory requirements.

The present invention is based on the target sample got and the first training sample sets, determine the second training sample set；The mesh Standard specimen sheet has the default sample class determined by disaggregated model, and the disaggregated model is by first training sample set training It obtains；According to default training rules and second training sample set, training obtains Characteristic Analysis Model；Forecast sample is defeated Enter the sample class that the forecast sample is obtained to the disaggregated model；When detect the sample class of the forecast sample with When the default sample class is identical, the forecast sample is input to the Characteristic Analysis Model, obtains the forecast sample Signature analysis result；It is carried out as a result, according to the second training sample set determined based on target sample and the first training sample set The training of Characteristic Analysis Model, it is not necessary to modify points that the important feature of single forecast sample can be realized in the source code of disaggregated model Analysis solves in the prior art, and when carrying out sample classification using the disaggregated model of good classification effect but algorithm complexity, user can only Know what classification single business sample belongs to according to the classification results that model exports, but can not know that model is mainly that basis should Which feature of sample determines the sample for current class, i.e., can not know that the feature of single sample under the judgement result is important The problem of property, the present invention is improved in the case where avoiding the intrusion to disaggregated model algorithm it is not necessary to modify disaggregated model algorithm The reference value of category of model result, auxiliary activities preferably carry out business judgement and development according to classification results.

Further, propose that the present invention is based on the characteristic analysis method second embodiments of machine learning model.

It is that the present invention is based on the signals of the process of the characteristic analysis method second embodiment of machine learning model referring to Fig. 3, Fig. 3 Figure, based on the above-mentioned characteristic analysis method first embodiment based on machine learning model, in the present embodiment, step S100 is based on The target sample and the first training sample set got, the step of determining the second training sample set include:

Step S101 obtains target sample, the first training sample set and multiple initial training samples；Wherein, the target Sample has the default sample class determined by disaggregated model, and the disaggregated model is by first training sample set trained It arrives.

In the present embodiment, use it is random generate and the mean value after standardization is 0, standard deviation for 1 it is multiple Different initial training samples.

Step S102, after the standard deviation that first training sample set is multiplied to the initial training sample, with the target Sample is added, and the result that will add up is as the second training sample；

Multiply the standard deviation of first training sample set to each initial training sample respectively, then again by multiplied result with The target sample is added, and the result that will add up is as the second training sample.It should be understood that initial training sample and first are instructed Practice the multiplication of the standard deviation of sample set and then be added with target sample, realizes and produced around target sample Second training sample set.By this mode of operation, the second training sample set for realizing generation is more in line with target sample Actual conditions improve the accuracy of subsequent characteristics analysis.

Step S103 determines the second training sample set based on obtained multiple second training samples.

The second multiple and different training samples is generated around target sample, thereby determines that the second training sample set.

As an implementation, for target sample using client as dimension, sample includes multiple features to describe customer action, For example be transferred to the amount of money on the day of client, produce the amount of money, transaction generation in number of high-risk areas etc., target sample passes through disaggregated model It is to be distributed in that the client is judged as after classification with multiple second training samples that after money laundering suspicion, the second training sample is concentrated Around target sample and client's sample of money laundering suspicion is all had, includes the client characteristics such as the behavioural characteristic of corresponding client letter Breath.

Further, step S200, according to default training rules and second training sample set, training obtains feature The step of analysis model includes:

Step S201 calculates the Euclidean distance between the target sample and second training sample；

Each of the second training sample concentration the second training sample and mesh are calculated separately according to the calculation formula of Euclidean distance Euclidean distance between standard specimen sheet, and between i-th of second training samples and target sample that the second training sample is concentrated Euclidean distance is expressed as D_i。

Step S202 calculates described the according to default calculation formula and the corresponding Euclidean distance of second training sample Two training samples enter to join coefficient；

In the present embodiment, institute is calculated according to default calculation formula and the corresponding Euclidean distance of second training sample It states the second training sample to enter to join coefficient, be equivalent to corresponding according to the standard deviation and second training sample preset in calculation formula Euclidean distance, calculate the second training sample enters to join coefficient, i.e., based on default calculation formula:It is calculated i-th A second training sample enters to join coefficient (i.e. weight) W_i, wherein D_iBetween i-th of second training samples and target sample Euclidean distance, σ are the standard deviation of first training sample set, and e is irrational number, and numerical value is approximately equal to 2.718, it is possible to understand that It is that i is the positive integer greater than 1, the value range of i concentrates the number for the second training sample for including in 1 to the second training sample Between.

Step S203 obtains the predicted value that the disaggregated model is directed to second training sample；

Each of the second training sample is concentrated by the disaggregated model obtained according to first training sample set training Second training sample is predicted, the predicted value about each second training sample is obtained, wherein disaggregated model includes but unlimited In: decision tree, logistic regression, SVM, neural network etc..

Step S204, by the multiple second training sample, multiple second training samples it is corresponding enter join coefficient and Predicted value carries out ridge regression model training as ginseng is entered, and obtained training result is as Characteristic Analysis Model.

Using multiple second training samples, multiple second training samples are corresponding enters to join coefficient and predicted value as variable Substitute into calculation formulaCarry out ridge regression model training, wherein n is the second training sample The number for the second training sample that concentration includes, the value range of i is 1 between n, and α is random number, according to numerical value according to reality Situation setting, β_iI-th of second training samples being calculated for above-mentioned steps enter to join coefficient W_i, y is described point got Class model is directed to the predicted value of second training sample, x_iFor i-th of second training samples.

It includes multiple multiple features with money laundering suspicion client that second training sample, which is concentrated, this multiple feature is denoted as spy Sign 1, feature 2, feature 3 ..., feature 10, feature 11 etc., the output result obtained after ridge regression model calculating For coef1* feature 1+coef2* feature 2+coef3* feature 3+...+coef10* feature 10+...+b, wherein coef value is every The coefficient of a feature, b are deviation value, according to coef value to feature 1, feature 2, feature 3 ..., feature 10, feature 11 it is equal into Row sequence, obtains the importance ranking of the important feature of model, i.e. feature 1, feature 2, feature 3 ..., feature 10, feature The different importance of multiple features such as 11.

Forecast sample is input to the disaggregated model, obtains the sample class of the forecast sample, it is described when detecting When the sample class of forecast sample is identical as the default sample class, that is, when all having money laundering suspicion, by the forecast sample It is input to the Characteristic Analysis Model, obtains the signature analysis of the forecast sample as a result, obtaining multiple spies in forecast sample The importance of sign.

It is understood that when being applied to anti-money laundering field, the feature of the training sample of disaggregated model and forecast sample is all It is to be extracted according to rule, as to reach wholesale standard, transaction spot more in area of being involved in drug traffic, client occupation for the client trading amount of money For the features such as unemployed, method may be used also while providing single forecast sample to business has money laundering suspicion through the embodiment of the present invention The feature importance for providing single sample, the present invention is based on the characteristic analysis methods of machine learning model, are applicable in a variety of different Prediction model, business personnel is according to the prediction result of prediction model and the feature importance of single sample, whether to judge client It is genuine suspicious, mitigate business personnel's workload, increases the reliability of business judgement；Financial system is done using machine learning model When business is classified, the interpretation of model is valued very much, such as in anti-money laundering field, all money laundering case supervision there are stringent rule It is fixed, it to go to illustrate why client has money laundering suspicion from levels such as client, account, transaction, may be implemented by the method for the invention The feature importance analysis of single sample, meets regulatory requirements.

Further, propose that the present invention is based on the characteristic analysis method 3rd embodiments of machine learning model.

It is that the present invention is based on the signals of the process of the characteristic analysis method 3rd embodiment of machine learning model referring to Fig. 4, Fig. 4 Figure, based on the above-mentioned characteristic analysis method second embodiment based on machine learning model, in the present embodiment, step S400 works as inspection Measure the sample class of the forecast sample it is identical as the default sample class when, the forecast sample is input to the spy Before the step of levying analysis model, obtaining the signature analysis result of the forecast sample further include:

Step S310 is based on second training sample set, carries out Accuracy Verification to the Characteristic Analysis Model；

Step S320, judges whether the Characteristic Analysis Model passes through Accuracy Verification, if passing through, enters step S400: when detecting that the sample class of the forecast sample is identical as the default sample class, the forecast sample is defeated Enter to obtain the signature analysis result of the forecast sample to the Characteristic Analysis Model.

It as an implementation, is the refinement step schematic diagram of step S310 in the present embodiment referring to Fig. 5, Fig. 5, specifically Ground, step S310, be based on second training sample set, to the Characteristic Analysis Model carry out Accuracy Verification may include as Lower refinement step:

Step S311 obtains several aspect of model for meeting preset condition from the Characteristic Analysis Model；

Step S312 obtains multiple first predictions according to several described aspect of model training Characteristic Analysis Model Value；

Step S313, multiple second training samples for including by second training sample set input the classification mould respectively Type obtains multiple second predicted values；

Step S314, according to the multiple first predicted value and the multiple second predicted value, to the signature analysis mould Type carries out Accuracy Verification.

After the feature for obtaining Characteristic Analysis Model according to the second training sample set, extracted by the priority of feature importance excellent Several high features of first grade obtain multiple first predicted value ypred according to this several feature training characteristics analysis model； Multiple second training samples for including by the second training sample set input the classification mould based on the training of the first training sample set respectively In type, multiple second predicted value ytrues corresponding with multiple second training samples are obtained, it is multiple second pre- that this is calculated The average value ytrue.mean () of measured value, according to ypred, multiple ytrue and ytrue.mean () digital simulation value R:R= 1-u/v, wherein

K is that the second training sample set includes The number of second training sample then judges that the Characteristic Analysis Model is tested by accuracy when R value is higher than the threshold value of setting Card, it is to be understood that threshold value can by user's sets itself, the result accuracy of the more high then Characteristic Analysis Model of threshold value more It is high；After Characteristic Analysis Model passes through Accuracy Verification, sample characteristics can be carried out to the forecast sample with default sample class Analysis, obtains the signature analysis of the forecast sample as a result, thus, it is possible to improving the accuracy of single sample signature analysis.? To after important feature, the analysis of user's dubiety and judgement are carried out so that user is based on important feature, to mitigate manually in backwash The customer analysis workload in money field, and improve precision of analysis.

In addition, the embodiment of the present invention also proposes a kind of feature analyzing apparatus based on machine learning model, it is described to be based on machine The feature analyzing apparatus of device learning model includes:

Preferably, second training sample set includes multiple second training samples, and the extraction module includes:

Preferably, the training module includes:

Preferably, described device further include:

Preferably, the authentication module includes:

Institute as above is realized when the feature analyzing apparatus modules operation based on machine learning model that the present embodiment proposes The step of characteristic analysis method based on machine learning model stated, details are not described herein.

In addition, the embodiment of the present invention also proposes a kind of medium, it is applied to computer, i.e., the described medium is computer-readable deposits Storage media, be stored with the signature analysis program based on machine learning model on the medium, described based on machine learning model The step of characteristic analysis method based on machine learning model as described above is realized when signature analysis program is executed by processor.

Wherein, the signature analysis program based on machine learning model run on the processor, which is performed, to be realized Method can refer to the present invention is based on each embodiment of the characteristic analysis method of machine learning model, details are not described herein again.

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, method, article or the device that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or device institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, method of element, article or device.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, computer, clothes Business device, air conditioner or the network equipment etc.) execute method described in each embodiment of the present invention.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of characteristic analysis method based on machine learning model, which is characterized in that the spy based on machine learning model Levy analysis method the following steps are included:

Based on the target sample got and the first training sample set, the second training sample set is determined；The target sample has By the default sample class that disaggregated model determines, the disaggregated model is obtained by first training sample set training；

When detecting that the sample class of the forecast sample is identical as the default sample class, the forecast sample is inputted To the Characteristic Analysis Model, the signature analysis result of the forecast sample is obtained.

2. as described in claim 1 based on the characteristic analysis method of machine learning model, which is characterized in that second training Sample set includes multiple second training samples, described based on the target sample got and the first training sample set, determines second The step of training sample set includes:

It after the standard deviation for multiplying first training sample set to the initial training sample, is added with the target sample, by phase The result added is as the second training sample；

3. as claimed in claim 2 based on the characteristic analysis method of machine learning model, which is characterized in that the basis is default Training rules and second training sample set, training the step of obtaining Characteristic Analysis Model include:

Second training sample is calculated according to default calculation formula and the corresponding Euclidean distance of second training sample Enter to join coefficient；

Enter to join coefficient and predicted value as entering ginseng using the multiple second training sample, multiple second training samples are corresponding Ridge regression model training is carried out, obtained training result is as Characteristic Analysis Model.

4. the characteristic analysis method as claimed in any one of claims 1-3 based on machine learning model, which is characterized in that institute It states when detecting that the sample class of the forecast sample is identical as the default sample class, the forecast sample is input to The Characteristic Analysis Model, before the step of obtaining the signature analysis result of the forecast sample further include:

Judge whether the Characteristic Analysis Model passes through Accuracy Verification, if passing through, enter step: when detecting the prediction When the sample class of sample is identical as the default sample class, the forecast sample is input to the Characteristic Analysis Model, Obtain the signature analysis result of the forecast sample.

5. as claimed in claim 4 based on the characteristic analysis method of machine learning model, which is characterized in that described based on described Second training sample set, to the Characteristic Analysis Model carry out Accuracy Verification the step of include:

Multiple second training samples for including by second training sample set input the disaggregated model respectively, obtain multiple Two predicted values；

According to the multiple first predicted value and the multiple second predicted value, accuracy is carried out to the Characteristic Analysis Model and is tested Card.

6. a kind of feature analyzing apparatus based on machine learning model, which is characterized in that the spy based on machine learning model Levying analytical equipment includes:

Extraction module, for determining the second training sample set based on the target sample got and the first training sample set；It is described Target sample has the default sample class determined by disaggregated model, and the disaggregated model is assembled for training by first training sample It gets；

Training module, for according to training rules and second training sample set is preset, training to obtain Characteristic Analysis Model；

Analysis module, for when detecting that the sample class of the forecast sample is identical as the default sample class, by institute It states forecast sample and is input to the Characteristic Analysis Model, obtain the signature analysis result of the forecast sample.

7. as claimed in claim 6 based on the feature analyzing apparatus of machine learning model, which is characterized in that second training Sample set includes multiple second training samples, and the extraction module includes:

Processing unit, after the standard deviation for multiplying first training sample set to the initial training sample with the target sample This addition, the result that will add up is as the second training sample；

8. as claimed in claim 7 based on the feature analyzing apparatus of machine learning model, which is characterized in that the training module Include:

Second computing unit, for calculating institute according to default calculation formula and the corresponding Euclidean distance of second training sample The second training sample is stated to enter to join coefficient；

Training unit, for by the multiple second training sample, multiple second training samples it is corresponding enter join coefficient and Predicted value carries out ridge regression model training as ginseng is entered, and obtained training result is as Characteristic Analysis Model.

9. the feature analyzing apparatus based on machine learning model as described in any one of claim 6-8, which is characterized in that institute State device further include:

Judgment module for judging whether the Characteristic Analysis Model passes through Accuracy Verification, and works as and judges the feature point Model is analysed by after Accuracy Verification, sending judging result " passing through " to the analysis module；

The analysis module is also used to after receiving the judging result that the judgment module is sent and being " passing through ", when detecting When the sample class of the forecast sample is identical as the default sample class, the forecast sample is input to the feature point Model is analysed, the signature analysis result of the forecast sample is obtained.

10. as claimed in claim 9 based on the feature analyzing apparatus of machine learning model, which is characterized in that the verifying mould Block includes:

Third computing unit, it is pre- for obtaining multiple first according to several aspect of model training Characteristic Analysis Model Measured value；

4th computing unit, multiple second training samples for including by second training sample set input described point respectively Class model obtains multiple second predicted values；

Authentication unit is used for according to the multiple first predicted value and the multiple second predicted value, to the signature analysis mould Type carries out Accuracy Verification.

11. a kind of signature analysis equipment based on machine learning model, which is characterized in that the equipment includes: memory, processing Device and the signature analysis program based on machine learning model that is stored on the memory and can run on the processor, It is realized when the signature analysis program based on machine learning model is executed by the processor as any in claim 1 to 5 The step of characteristic analysis method based on machine learning model described in item.

12. a kind of medium, which is characterized in that be applied to computer, be stored with the spy based on machine learning model on the medium Sign analysis program, realizes such as claim 1 to 5 when the signature analysis program based on machine learning model is executed by processor Any one of described in the characteristic analysis method based on machine learning model the step of.