CN114861800A

CN114861800A - Model training method, probability determination method, device, equipment, medium and product

Info

Publication number: CN114861800A
Application number: CN202210518775.2A
Authority: CN
Inventors: 刘钱; 张建
Original assignee: CCB Finetech Co Ltd
Current assignee: CCB Finetech Co Ltd
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-08-05
Anticipated expiration: 2042-05-12

Abstract

The embodiment of the invention relates to the technical field of intelligent finance, in particular to a model training method, a probability determination device, equipment, a medium and a product. The method comprises the following steps: obtaining a target sample set, wherein the target sample set is obtained by screening historical source data of a sample enterprise based on a sample evaluation index; inputting a first data sample in the target sample set into a pre-established random forest model to obtain enterprise abnormal prediction probability; according to the technical scheme, historical source data with low contribution rate can be filtered based on sample evaluation indexes, so that overfitting is effectively avoided, high value is contributed, and the training efficiency and precision of the model are improved.

Description

Model training method, probability determination method, device, equipment, medium and product

Technical Field

The embodiment of the invention relates to the technical field of intelligent finance, in particular to a model training method, a probability determination device, equipment, a medium and a product.

Background

The actual business scene comprises a plurality of enterprise abnormal scenes, and for the plurality of enterprise abnormal scenes, the abnormal probability of enterprises is mostly determined through a neural network model in the prior art, but because the abnormal enterprises account for a very small number, the phenomenon of sample imbalance exists, and the accuracy of the model is reduced by training the model through the unbalanced samples. Moreover, because source data related to some services are more and more, overfitting is easy to occur when the source data are processed, and the efficiency of model training is reduced.

Disclosure of Invention

Embodiments of the present invention provide a model training method, a probability determination method, an apparatus, a device, a medium, and a product, so as to filter historical source data with a low contribution rate based on a sample evaluation index, thereby effectively avoiding overfitting, enabling filtered source data to contribute a high value, and improving training efficiency and accuracy of a model.

According to an aspect of the present invention, there is provided a model training method, including:

obtaining a target sample set, wherein the target sample set is obtained by screening historical source data of a sample enterprise based on a sample evaluation index;

inputting a first data sample in the target sample set into a pre-established random forest model to obtain enterprise abnormal prediction probability;

and training parameters of the random forest model according to an objective function formed on the basis of the enterprise abnormal prediction probability and the label information corresponding to the first data sample.

Further, obtaining a target sample set includes:

acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample;

randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples;

obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples;

determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples;

and screening the target sample matrix according to the sample evaluation index value corresponding to each row of first data samples to obtain a target sample set.

Further, obtaining a correlation value of each column of the first data samples in the target sample matrix includes:

acquiring the rank of each column of first data samples in the target sample matrix, the rank of label information and the total number of first data samples in historical source data;

and determining a relevance value of the first data sample of each column according to the rank of the first data sample of each column, the rank of the label information and the total number of the first data samples in the target sample matrix.

Further, obtaining an analysis of variance value of each column of the first data sample in the target sample matrix includes:

acquiring the inter-group mean square of each column of first data samples in the target sample matrix and the intra-group mean square of each column of first data samples;

and determining the analysis of variance value of each column of the first data samples according to the ratio of the mean square between the groups of the first data samples of each column to the mean square in the groups of the first data samples of each column.

Further, screening the target sample matrix according to the sample evaluation index value corresponding to the first data sample of each column to obtain a target sample set, including:

and deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a target sample set.

Further, deleting at least one column of first data samples with the sample evaluation index value smaller than the average sample evaluation index value from the target sample matrix to obtain a target sample set, including:

deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a first sample matrix;

and carrying out PCA (principal component analysis) dimensionality reduction on the first sample matrix based on singular value decomposition to obtain a target sample set.

Further, performing PCA dimension reduction on the first sample matrix based on singular value decomposition to obtain a target sample set, including:

performing decentralization on the first sample matrix to obtain a second sample matrix;

and carrying out PCA (principal component analysis) dimensionality reduction on the second sample matrix based on singular value decomposition to obtain a target sample set.

Further, the step of performing de-centering on the first sample matrix to obtain a second sample matrix includes:

and replacing each first data sample in the first sample matrix with a ratio of the first data sample to a mean value of a column to which the first data sample belongs to obtain a second sample matrix.

Further, performing PCA dimension reduction on the second sample matrix based on singular value decomposition to obtain a target sample set, including:

obtaining the tolerance of each column of first data samples in the second sample matrix;

deleting at least one row of first data samples with the tolerance larger than the row tolerance mean value in the second sample matrix to obtain a third sample matrix;

obtaining a fourth sample matrix, wherein the fourth sample matrix is a feature vector matrix corresponding to the third sample matrix;

determining a product of the transpose of the third sample matrix and the fourth sample matrix as a fifth sample matrix;

obtaining a variance contribution rate corresponding to each column of first data samples in the fifth sample matrix;

reordering the fifth sample matrix according to the sequence of variance contribution rate from large to small to obtain a sixth sample matrix;

and if the cumulative variance contribution rate corresponding to the nth column of first data samples in the sixth sample matrix is greater than or equal to a set threshold, and the cumulative variance contribution rate corresponding to the N-1 th column of first data samples in the sixth sample matrix is smaller than the set threshold, generating a target sample set according to the first N columns of first data samples in the sixth sample matrix, where N is a positive integer greater than or equal to 1.

Further, obtaining a fourth sample matrix includes:

acquiring a covariance matrix of the third sample matrix;

obtaining the eigenvalue of each first data sample in the covariance matrix of the third sample matrix through singular value decomposition;

and determining a matrix obtained by multiplying the eigenvalue of each first data sample in the covariance matrix of the third sample matrix by each first data sample in the third sample matrix as a fourth sample matrix.

According to another aspect of the present invention, there is provided an enterprise anomaly probability determining method, including:

acquiring source data corresponding to an enterprise to be identified;

inputting the source data corresponding to the enterprise to be identified into a target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified; wherein the target model is trained according to the model training method of any one of claims 1-10.

According to another aspect of the present invention, there is provided a model training apparatus including: the system comprises a sample set acquisition module, a sample set selection module and a sample set selection module, wherein the sample set acquisition module is used for acquiring a target sample set, and the target sample set is obtained by screening historical source data of a sample enterprise based on sample evaluation indexes;

the enterprise abnormal prediction probability determining module is used for inputting a first data sample in the target sample set into a pre-established random forest model to obtain enterprise abnormal prediction probability;

and the training module is used for training the parameters of the random forest model according to an objective function formed on the basis of the enterprise abnormal prediction probability and the label information corresponding to the first data sample.

Further, the sample set obtaining module is specifically configured to:

acquiring a covariance matrix of the third sample matrix;

According to another aspect of the present invention, there is provided an enterprise anomaly probability determining apparatus, including:

the source data acquisition module is used for acquiring source data corresponding to the enterprise to be identified;

the enterprise abnormal probability determining module is used for inputting the source data corresponding to the enterprise to be identified into a target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified; the target model is obtained by training according to the model training method of any embodiment of the invention.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the model training method of any one of the embodiments of the present invention or to enable the at least one processor to perform the enterprise anomaly probability determination method of any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the model training method according to any one of the embodiments of the present invention, or the enterprise anomaly probability determination method according to any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer program product, which when executed by a processor implements the model training method according to any one of the embodiments of the present invention, or which when executed by a processor implements the enterprise anomaly probability determination method according to any one of the embodiments of the present invention.

According to the method and the device, the historical source data of the sample enterprise are screened based on the sample evaluation indexes to obtain the target sample set, the first data sample in the target sample set is input into a pre-established random forest model, the enterprise abnormity prediction probability is obtained, and then the parameters of the random forest model are trained according to the target function formed based on the enterprise abnormity prediction probability and the label information corresponding to the first data sample, so that the historical source data with low contribution rate can be filtered based on the sample evaluation indexes, the over-fitting is effectively avoided, meanwhile, high value is contributed, and the training efficiency and precision of the model are improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a flow chart of a model training method in an embodiment of the invention;

FIG. 2 is a graph of cumulative variance contribution ratio for each column of the first data sample in a fifth sample matrix in an embodiment of the present invention;

FIG. 3 is a graph of AUC iterations during model training in an embodiment of the present invention;

FIG. 4 is a ROC graph in an embodiment of the present invention;

FIG. 5 is a flowchart of a method for determining enterprise anomaly probability in an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an enterprise anomaly probability determining apparatus in an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device in an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical scheme related by the application can be used for acquiring, storing and/or processing the data, and the relevant regulations of national laws and regulations are met.

Example one

Fig. 1 is a flowchart of a model training method provided in an embodiment of the present invention, where the present embodiment is applicable to a case of model training, and the method may be executed by a model training apparatus in an embodiment of the present invention, and the model training apparatus may be implemented in a software and/or hardware manner, as shown in fig. 1, the method specifically includes the following steps:

s110, a target sample set is obtained, wherein the target sample set is obtained after the historical source data of the sample enterprise are screened based on the sample evaluation indexes.

The historical source data is enterprise data authorized to be disclosed by the sample enterprise, and may include, for example: basic information of a sample enterprise.

The target sample set may be obtained in the following manner: acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample (it should be noted that the data volume of the historical source data involved in the actual service is very large, and therefore, in general, the historical source data includes a plurality of first data samples); randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples; obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples; determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples; and screening the target sample matrix according to the sample evaluation index value corresponding to each row of first data samples to obtain a target sample set.

The target sample set may be obtained by: acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample; randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples; acquiring the rank of each column of first data samples in the target sample matrix, the rank of label information and the total number of first data samples in historical source data; determining a relevance value of each column of the first data samples according to the rank of each column of the first data samples, the rank of the label information and the total number of the first data samples in the target sample matrix; acquiring the inter-group mean square of each column of first data samples in the target sample matrix and the intra-group mean square of each column of first data samples; determining an analysis of variance value of each column of first data samples according to a ratio of the inter-group mean square of each column of first data samples to the intra-group mean square of each column of first data samples; acquiring a characteristic importance value of each column of first data sample in the target sample matrix; determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples; and screening the target sample matrix according to the sample evaluation index value corresponding to each row of first data samples to obtain a target sample set.

The target sample set may be obtained by: acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample; randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples; obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples; determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples; and deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a target sample set.

The target sample set may be obtained by: acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample; randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples; obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples; determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples; deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a first sample matrix; and carrying out PCA (principal component analysis) dimensionality reduction on the first sample matrix based on singular value decomposition to obtain a target sample set.

The target sample set may be obtained by: acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample; randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples; obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples; determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples; deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a first sample matrix; obtaining the tolerance of each column of first data samples in the second sample matrix; deleting at least one row of first data samples with the tolerance larger than the row tolerance mean value in the second sample matrix to obtain a third sample matrix; obtaining a fourth sample matrix, wherein the fourth sample matrix is a feature vector matrix corresponding to the third sample matrix; determining a product of the transpose of the third sample matrix and the fourth sample matrix as a fifth sample matrix; obtaining a variance contribution rate corresponding to each column of first data samples in the fifth sample matrix; reordering the fifth sample matrix according to the sequence of variance contribution rate from large to small to obtain a sixth sample matrix; and if the cumulative variance contribution rate corresponding to the nth column of first data samples in the sixth sample matrix is greater than or equal to a set threshold, and the cumulative variance contribution rate corresponding to the N-1 th column of first data samples in the sixth sample matrix is smaller than the set threshold, generating a target sample set according to the first N columns of first data samples in the sixth sample matrix, where N is a positive integer greater than or equal to 1.

And S120, inputting the first data sample in the target sample set into a pre-established random forest model to obtain the enterprise abnormal prediction probability.

Wherein the first data sample is any data sample in the target sample set.

Wherein the random forest comprises at least one decision tree. A decision tree is a tree-like structure in which each internal node represents a test on an attribute, each branch represents a test output, and each leaf node represents a category.

Specifically, the first data sample in the target sample set is input into a pre-established random forest model to obtain the enterprise abnormal prediction probability, for example, the first data sample of the enterprise R is input into the pre-established random forest model to obtain the enterprise abnormal prediction probability P corresponding to the enterprise R.

S130, training parameters of the random forest model according to an objective function formed based on the enterprise abnormal prediction probability and label information corresponding to the first data sample.

Wherein the parameters of the random forest model may include: the number of decision trees, the number of features randomly sampled at each node, the minimum number of samples allowed at a leaf node, the maximum number of leaf nodes allowed, and the like, which are not limited in the embodiments of the present invention.

The tag information may be that an enterprise corresponding to the first data sample is an abnormal enterprise, and the tag information may also be that an enterprise corresponding to the first data sample is a normal enterprise.

Specifically, after parameters of the random forest model are trained according to an objective function formed based on the enterprise abnormal prediction probability and label information corresponding to the first data sample, the steps are executed in a circulating mode to conduct iterative training on the random forest model to obtain the target model.

Optionally, obtaining a target sample set includes:

Specifically, in order to solve the problem of sample imbalance, the embodiment of the present invention reconstructs the historical source data by using a random undersampling method in resampling, that is, a certain proportion of most samples are removed.

Wherein the correlation value may be a spearman correlation coefficient, also commonly referred to as spearman rank correlation coefficient. "rank" is understood to mean an order or sequence that is solved according to the ordered position of the first data sample in each column of the target sample matrix. The relevance value is calculated based on the following formula:

d _i ＝rg(X _i )-rg(Y _i )；

firstly, for the ith column, the first data sample x _i Label information Y corresponding to first data sample in ith column _i Sorting, recording the position after sorting, and solving according to the position after sorting to obtain rg (x) _i ) And rg (Y) _i )，rg(x _i ) And rg (Y) _i ) Called rank, the difference in rank is d in the above formula _i N is the number of first data samples in the target sample matrix, m is the number of columns of the target sample matrix, and the target sample matrix may be: d ═ x ₁ ,x ₂ ,…,x _m },x _i ＝(x _i1 ,x _i2 ,…,x _ij ) N is j m, D is the target sample matrix, i is 1,2, …, m.

The obtaining manner of the feature importance value of each column of the first data sample may be: and constructing a tree model in advance according to each column of first data samples in the target sample matrix and label information corresponding to each column of first data samples, wherein the characteristic importance value of each column of first data samples is measured by the average value of the importance of each column of first data samples in a single tree.

The analysis of variance value of each column of the first data samples may be obtained by: acquiring the inter-group mean square of each column of first data samples in the target sample matrix and the intra-group mean square of each column of first data samples; and determining the analysis of variance value of each column of the first data samples according to the ratio of the mean square between the groups of the first data samples of each column to the mean square in the groups of the first data samples of each column.

Specifically, the manner of determining the sample evaluation index value corresponding to each column of the first data samples according to the correlation value of each column of the first data samples, the feature importance value of each column of the first data samples, and the analysis of variance value of each column of the first data samples may be: determining a correlation value vector according to the correlation value of each column of first data samples, determining a feature importance vector according to the feature importance value of each column of first data samples, and determining an analysis of variance value vector according to the analysis of variance value of each column of first data samples. And performing relativity conversion on the correlation value vector to obtain a converted correlation value vector, performing relativity conversion on the feature importance vector to obtain a converted feature importance vector, and performing relativity conversion on the square difference analysis value vector to obtain a converted variance analysis value vector. Setting the weight of the feature importance value, the weight of the correlation value and the weight of the anova value, determining a sample evaluation index value corresponding to each row of the first data samples according to the weight of the feature importance value, the weight of the correlation value, the weight of the anova value, the converted correlation value vector, the converted feature importance vector and the converted anova value vector, and further obtaining a sample evaluation index vector.

In a specific example, a correlation value vector R is determined based on the correlation values of the first data samples of each column in the target sample matrix,

wherein the content of the first and second substances,

for the 1 st column first data sample x ₁ The correlation value of (a) is determined,

for the 2 nd column first data sample x ₂ The correlation value of (a) is determined,

for the m-th column of the first data sample x _m The relevance value of (c). Determining a feature importance vector V according to the feature importance value of the first data sample of each column in the target sample matrix,

wherein the content of the first and second substances,

for the 1 st column first data sample x ₁ Is determined by the characteristic importance value of (a),

for the 2 nd column first data sample x ₂ Is determined by the characteristic importance value of (a),

for the m-th column of the first data sample x _m A characteristic importance value of. Determining an analysis of variance value vector F based on the analysis of variance values of the first data samples of each column in the target sample matrix,

wherein the content of the first and second substances,

for the 1 st column first data sample x ₁ The analysis of variance value of (a) is,

for the 2 nd column first data sample x ₂ The analysis of variance value of (a) is,

for the m-th column of the first data sample x _m Analysis of variance values of (a). Performing relativity conversion on the relevance value vector R based on the following formula to obtain a vector R':

where max (r) is the maximum correlation value in the correlation value vector. And performing relativity conversion on the feature importance vector V based on the following formula to obtain a vector V':

where max (v) is the maximum feature importance value in the feature importance vector. And carrying out relativity conversion on the variance analysis value vector F based on the following formula to obtain a vector F':

where max (F) is the largest ANOVA value in the ANOVA vector. Defining the ith column of the first data sample x _i The corresponding sample evaluation index values are:

wherein the content of the first and second substances,

for the ith column, the first data sample x _i The correlation value of (a) is determined,

for the ith column, the first data sample x _i Is determined by the characteristic importance value of (a),

for the ith column, the first data sample x _i A is a weight corresponding to the correlation value, b is a weight corresponding to the feature importance value, and c is a weight corresponding to the anova value. Obtaining a sample evaluation index vector

Optionally, obtaining a correlation value of each column of the first data sample in the target sample matrix includes:

Specifically, the manner of determining the correlation value of each column of the first data samples according to the rank of each column of the first data samples, the rank of the label information, and the total number of the first data samples in the target sample matrix may be: and obtaining a difference value between the rank of each column of the first data sample and the rank of the label information of each column of the first data sample, and determining a correlation value of each column of the first data sample according to the difference value between the rank of each column of the first data sample and the rank of the label information of each column of the first data sample and the total number of the first data samples in the target sample matrix.

In a specific example, the ith column of first data samples x is calculated based on the following formula _i The correlation value of (c):

d _i ＝rg(X _i )-rg(Y _i )；

Optionally, obtaining an analysis of variance value of each column of the first data sample in the target sample matrix includes:

Specifically, the manner of determining the analysis of variance value of each column of the first data samples according to the ratio of the inter-group mean square of each column of the first data samples to the intra-group mean square of each column of the first data samples may be: determining a ratio of the inter-group mean square of the first data samples of each column to the intra-group mean square of the first data samples of each column as an ANOVA value of the first data samples of each column. For example, the ith column of first data sample x may be determined based on the following formula _i Analysis of variance of (a):

wherein the content of the first and second substances,

for the ith column, the first data sample x _i The mean square between the groups of (1),

for the ith column, the first data sample x _i Mean square in group.

Optionally, the screening the target sample matrix according to the sample evaluation index value corresponding to the first data sample in each column to obtain a target sample set, including:

The average sample evaluation index value is the ratio of the sum of the sample evaluation indexes corresponding to each column of the first data samples in the target sample to the total number of columns of the first data samples in the target sample.

In one specific example, a sample evaluation index vector

Obtaining

Average value of (2)

If it is

Is less than

Then will be

And deleting the target sample matrix, and generating a target sample set according to the first data sample in the deleted target sample matrix and the label information corresponding to the first data sample.

Optionally, deleting at least one column of first data samples with sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a target sample set, where the method includes:

Specifically, the method for performing PCA dimension reduction on the first sample matrix based on singular value decomposition to obtain the target sample set may be: performing decentralization on the first sample matrix to obtain a second sample matrix; and carrying out PCA (principal component analysis) dimensionality reduction on the second sample matrix based on singular value decomposition to obtain a target sample set.

Specifically, at least one column of first data samples with sample evaluation index values smaller than the average sample evaluation index value is deleted from the target sample matrix to obtain the first sample matrix, for example, the first sample matrix may be obtained

Average value of (2)

If it is

Is less than

Then will be

And deleting the target sample matrix to obtain a first sample matrix, wherein w is less than or equal to m.

Optionally, performing PCA dimension reduction on the first sample matrix based on singular value decomposition to obtain a target sample set, including:

Specifically, the method for performing de-centering on the first sample matrix to obtain the second sample matrix may be: replacing each first data sample in the first sample matrix with first dataThe second sample matrix is obtained by taking the ratio of the sample to the mean value of the columns to which the first data samples belong, for example, the mean value of the first samples of each column in the target sample matrix may be obtained

X in the target sample matrix _i Is replaced by

Obtaining a second sample matrix

In one specific example, the mean value of the first sample of each column in the target sample matrix is obtained

X in the target sample matrix _i Is replaced by

To realize the decentralization of the first sample matrix to obtain a second sample matrix

And carrying out PCA (principal component analysis) dimensionality reduction on the second sample matrix D' based on singular value decomposition to obtain a target sample set.

Optionally, the step of performing de-centering on the first sample matrix to obtain a second sample matrix includes:

Specifically, replacing each first data sample in the first sample matrix with a ratio of the first data sample to a mean value of a column to which the first data sample belongs, and obtaining the second sample matrix may be: obtaining the mean value of the first data sample of each column in the target sample matrix in advance, replacing the first sample matrix of each column in the target matrix with the first data sample andthe ratio of the mean values of the columns to which the first data samples belong may be, for example, the mean value of the first samples of each column in the target sample matrix is obtained

X in the target sample matrix _i Is replaced by

Obtaining a second sample matrix

Optionally, performing PCA dimension reduction on the second sample matrix based on singular value decomposition to obtain a target sample set, including:

The tolerance is the reciprocal of a Variance expansion Factor (VIF), which is used to detect multiple collinearity of independent variables in a regression model.

For example, there may be a total of 5 first samples: x is a radical of a fluorine atom ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ . B is x ₁ As observed value, the rest x ₂ ,x ₃ ,x ₄ ,x ₅ As independent variables, linear regression was performed. x is the number of ₁ ＝c _1,0 +c _1,2 x ₂ +c _1,3 x ₃ +c _1,4 x ₄ +c _1,5 x ₅ + e1, using

S of the above regression ² 。x ₁ Variance expansion factor of

Accordingly, x _i Variance expansion factor of

Wherein, the first and the second end of the pipe are connected with each other,

is a mixing of x _i As observed values, the rest are used as regression S of independent variables ² 。S ² The closer to 1, the better the fit. Here VIF _i A value equal to 1 indicates no collinearity, the closer to 1, x _i The smaller the multicollinearity of (c). VIF _i The larger, x _i The greater the multiple collinearity with the first sample of the other columns.

Specifically, the variance of the data represents information about fluctuation of the data. If the variance of the data is 0, the data is completely unchanged, and the data is not valuable for research. In general, PCA is used for dimensionality reduction, which causes information loss, and saves as much information of data as possible while reducing the dimensionality of data. The variance of the data is equal to the sum of all eigenvalues of the covariance, and the variance of the ith principal component is equal to the ith eigenvalue of the covariance matrix. The eigenvalues are sorted from large to small and then summed from front to back to obtain the cumulative variance. The magnitude of the cumulative variance contribution rate indicates the proportion of all principal components currently selected carrying information of the original data.

The obtaining manner of the variance contribution rate corresponding to each column of the first data sample in the fifth sample matrix may be: and obtaining the variance corresponding to each row of first data samples in the fifth sample matrix in advance, and determining the variance contribution rate corresponding to each row of first data samples in the fifth sample matrix according to the variance corresponding to each row of first data samples in the fifth sample matrix.

For example, as shown in fig. 2, fig. 2 is a cumulative variance contribution rate corresponding to each column of the first data sample in the fifth sample matrix, the abscissa of fig. 2 represents the number of columns, and the ordinate represents the cumulative variance contribution rate.

Specifically, the manner of deleting at least one column of first data samples whose tolerance is greater than the column tolerance mean value in the second sample matrix to obtain the third sample matrix may be: performing tolerance analysis on each row of first data samples in the second sample matrix to obtain the tolerance T of each row of first data samples _i ，T _i Has a value range of [0,1 ]]When T is _i The closer to 0, the higher the correlation between the variable i and other independent variables, so that the reservation is set

A third sample matrix is obtained, wherein,

is the tolerance mean value of the first data sample in the ith column in the second sample matrix.

Specifically, the fifth sample matrix a is determined based on the following formula: a ═ D '(D') ^T 。

Specifically, if the sum of the first ratios of the first data samples in the preset number of columns accumulated in sequence in the fifth sample matrix is greater than the set threshold, the target sample set is generated according to the first data samples in the preset number of columns, for example, if the sum of the first ratios of the first data samples in the first column, the first ratio of the first data samples in the second column, …, and the sum of the first ratios of the first data samples in the L-th column is greater than the set threshold, and the first ratios of the first data samples in the first column, the first ratios of the first data samples in the second column, …, and the sum of the first ratios of the first data samples in the L-1 th column is less than the set threshold, the target sample set is generated according to the first data samples in the first column, the first data samples in the second column, …, and the first data samples in the L-th column in the fifth sample matrix.

Optionally, obtaining a fourth sample matrix includes:

acquiring a covariance matrix of the third sample matrix;

Specifically, the manner of determining a matrix obtained by multiplying the eigenvalue of each first data sample in the covariance matrix of the third sample matrix by each first data sample in the third sample matrix as the fourth sample matrix may be: third sample matrix D ″ ═ x ₁ ,x ₂ ,…,x _p ) P ≦ w ≦ m, and the eigenvalue of each first data sample in the covariance matrix of the third sample matrix is K ═ λ (λ ≦ m ₁ ,λ ₂ ,…,λ _p ) The fourth sample matrix D '″' (x) ₁ *λ ₁ ,x ₂ *λ ₂ ,…,x _p *λ _p )。

In the embodiment of the invention, the reconstructed sample data set is trained and tested by using RF, AUC iteration in the model training process is shown in figure 3, the abscissa in figure 3 represents iteration rounds, and the ordinate represents AUC.

Given a threshold, the TPR (coverage) and FPR (disturbance) can be calculated from the confusion matrix,

wherein TP isTrue positive, FN false negative, FP false positive and TN true negative. By setting different thresholds, there will be a series of TPR and FPR, so that the ROC curve (solid line is ROC curve, and dotted line is reference curve) shown in fig. 4 can be drawn, where the abscissa in fig. 4 represents FPR and the ordinate represents TPR.

The scoring index in the embodiment of the invention firstly calculates 3 coverage rates TPR:

the final error was tpe ═ 0.4 × TPR1+0.3 × TPR2+0.3 × TPR 3.

And performing model verification by prediction on the test set to obtain classification errors AUC (AUC 0.81) and tpe (AUC 0.65).

According to the technical scheme, the historical source data of the sample enterprise are screened based on the sample evaluation indexes to obtain a target sample set, a first data sample in the target sample set is input into a pre-established random forest model, after the enterprise abnormal prediction probability is obtained, parameters of the random forest model are trained according to a target function formed based on the enterprise abnormal prediction probability and label information corresponding to the first data sample, the historical source data with low contribution rate can be filtered based on the sample evaluation indexes, high value is contributed while overfitting is effectively avoided, and training efficiency and accuracy of the model are improved.

Example two

Fig. 5 is a flowchart of an enterprise abnormal probability determining method according to an embodiment of the present invention, where this embodiment is applicable to a case of determining an enterprise abnormal probability, and the method may be executed by an enterprise abnormal probability determining device according to an embodiment of the present invention, where the enterprise abnormal probability determining device may be implemented in a software and/or hardware manner, as shown in fig. 5, where the method specifically includes the following steps:

and S210, acquiring source data corresponding to the enterprise to be identified.

The source data corresponding to the to-be-identified enterprise is enterprise data authorized to be disclosed by the to-be-identified enterprise, and may include, for example: basic information of the enterprise to be identified.

The acquisition mode of the source data corresponding to the enterprise to be identified may be: acquiring source data corresponding to the enterprise to be identified from a database; the acquisition mode of the source data corresponding to the enterprise to be identified may also be: and if an application instruction sent by the target enterprise is detected, determining the target enterprise as an enterprise to be identified, and acquiring source data corresponding to the target enterprise.

The source data may be type data such as enterprise basic information, enterprise legal person information, enterprise credit investigation, enterprise legal person credit investigation, and the like, and the embodiment of the present invention does not limit specific contents of the source data.

The source data corresponding to the enterprise to be identified is the source data disclosed by the enterprise to be identified, or the source data disclosed by the enterprise to be identified is acquired after the authorization of an enterprise legal person is obtained.

And S220, inputting the source data corresponding to the enterprise to be identified into a target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified.

The training method of the target model is as described in the above embodiments, which is not described herein again.

Specifically, the source data corresponding to the enterprise to be identified is input into the target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified, for example, the source data corresponding to the enterprise K to be identified is input into the target model to obtain the enterprise abnormal probability corresponding to the enterprise K to be identified, which is 30%.

According to the technical scheme, the source data corresponding to the enterprise to be identified is acquired, the source data corresponding to the enterprise to be identified is input into the target model, the enterprise abnormal probability corresponding to the enterprise to be identified is acquired, and the target model is obtained by training after filtering the historical source data with low contribution rate based on the sample evaluation index, so that the filtered historical source data can contribute high value while overfitting is effectively avoided, and the efficiency and accuracy of acquiring the enterprise abnormal probability corresponding to the enterprise to be identified can be improved.

EXAMPLE III

Fig. 6 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention. The present embodiment may be applied to the case of model training, the apparatus may be implemented in a software and/or hardware manner, and the model training apparatus may be integrated in any device providing a model training function, as shown in fig. 6, where the model training apparatus specifically includes: a sample set acquisition module 310, an enterprise anomaly prediction probability determination module 320, and a training module 330.

The system comprises a sample set acquisition module, a sample set analysis module and a sample set analysis module, wherein the sample set acquisition module is used for acquiring a target sample set, and the target sample set is obtained by screening historical source data of a sample enterprise based on sample evaluation indexes;

Optionally, the sample set obtaining module is specifically configured to:

obtaining historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample;

Optionally, the sample set obtaining module is specifically configured to:

acquiring the inter-group mean square of each column of first data samples and the intra-group mean square of each column of first data samples in the target sample matrix;

Optionally, the sample set obtaining module is specifically configured to:

acquiring a covariance matrix of the third sample matrix;

The product can execute the method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 7 is a schematic structural diagram of an enterprise anomaly probability determining apparatus according to an embodiment of the present invention. The present embodiment may be applied to the case of determining the enterprise abnormal probability, where the apparatus may be implemented in a software and/or hardware manner, and the apparatus for determining the enterprise abnormal probability may be integrated in any device providing the function of determining the enterprise abnormal probability, as shown in fig. 7, where the apparatus for determining the enterprise abnormal probability specifically includes: a source data acquisition module 410 and an enterprise anomaly probability determination module 420.

the enterprise abnormal probability determining module is used for inputting the source data corresponding to the enterprise to be identified into a target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified; the target model is obtained by training according to the model training method in the embodiment.

EXAMPLE five

FIG. 8 illustrates a schematic diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 8, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM)12, a Random Access Memory (RAM)13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM)12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the model training method:

Or, for example, the enterprise anomaly probability determination method:

acquiring source data corresponding to an enterprise to be identified;

In some embodiments, the model training method, or alternatively, the enterprise anomaly probability determination method, may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When loaded into RAM 13 and executed by processor 11, the computer program may perform one or more steps of the model training method described above, or the enterprise anomaly probability determination method. Alternatively, in other embodiments, processor 11 may be configured in any other suitable manner (e.g., by way of firmware) to perform a model training method, or an enterprise anomaly probability determination method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

An embodiment of the present invention further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the model training method according to any embodiment of the present invention, or an enterprise anomaly probability determining method.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of model training, comprising:

2. The method of claim 1, wherein obtaining a target sample set comprises:

3. The method of claim 2, wherein obtaining a correlation value for each column of first data samples in the target sample matrix comprises:

4. The method of claim 2, wherein obtaining analysis of variance values for each column of first data samples in the target sample matrix comprises:

5. The method according to claim 2, wherein the step of screening the target sample matrix according to the sample evaluation index value corresponding to the first data sample of each column to obtain a target sample set comprises:

6. The method of claim 5, wherein deleting at least one column of first data samples having a sample evaluation index value less than an average sample evaluation index value from the target sample matrix to obtain a target sample set comprises:

7. The method of claim 6, wherein performing PCA dimensionality reduction on the first sample matrix based on singular value decomposition to obtain a target sample set comprises:

8. The method of claim 7, wherein de-centering the first sample matrix to obtain a second sample matrix comprises:

9. The method of claim 8, wherein performing PCA dimension reduction on the second sample matrix based on singular value decomposition to obtain a target sample set comprises:

10. The method of claim 9, wherein obtaining a fourth matrix of samples comprises:

acquiring a covariance matrix of the third sample matrix;

11. An enterprise anomaly probability determination method is characterized by comprising the following steps:

acquiring source data corresponding to an enterprise to be identified;

12. A model training apparatus, comprising:

the system comprises a sample set acquisition module, a sample set selection module and a sample set selection module, wherein the sample set acquisition module is used for acquiring a target sample set, and the target sample set is obtained by screening historical source data of a sample enterprise based on sample evaluation indexes;

13. An apparatus for determining enterprise anomaly probability, comprising:

the enterprise abnormal probability determining module is used for inputting the source data corresponding to the enterprise to be identified into a target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified; wherein the target model is trained according to the model training method of any one of claims 1-10.

14. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the model training method of any one of claims 1-10 or to enable the at least one processor to perform the enterprise anomaly probability determining method of claim 11.

15. A computer-readable storage medium storing computer instructions for causing a processor to implement the model training method of any one of claims 1-10 or the enterprise anomaly probability determination method of claim 11 when executed.

16. A computer program product, characterized in that the computer program product comprises a computer program which, when being executed by a processor, implements the model training method according to any one of claims 1-10, or which, when being executed by a processor, implements the enterprise anomaly probability determination method according to claim 11.