CN114861800A - Model training method, probability determination method, device, equipment, medium and product - Google Patents

Model training method, probability determination method, device, equipment, medium and product Download PDF

Info

Publication number
CN114861800A
CN114861800A CN202210518775.2A CN202210518775A CN114861800A CN 114861800 A CN114861800 A CN 114861800A CN 202210518775 A CN202210518775 A CN 202210518775A CN 114861800 A CN114861800 A CN 114861800A
Authority
CN
China
Prior art keywords
sample
data
column
matrix
data samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210518775.2A
Other languages
Chinese (zh)
Other versions
CN114861800B (en
Inventor
刘钱
张建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CCB Finetech Co Ltd
Original Assignee
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CCB Finetech Co Ltd filed Critical CCB Finetech Co Ltd
Priority to CN202210518775.2A priority Critical patent/CN114861800B/en
Priority claimed from CN202210518775.2A external-priority patent/CN114861800B/en
Publication of CN114861800A publication Critical patent/CN114861800A/en
Application granted granted Critical
Publication of CN114861800B publication Critical patent/CN114861800B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The embodiment of the invention relates to the technical field of intelligent finance, in particular to a model training method, a probability determination device, equipment, a medium and a product. The method comprises the following steps: obtaining a target sample set, wherein the target sample set is obtained by screening historical source data of a sample enterprise based on a sample evaluation index; inputting a first data sample in the target sample set into a pre-established random forest model to obtain enterprise abnormal prediction probability; according to the technical scheme, historical source data with low contribution rate can be filtered based on sample evaluation indexes, so that overfitting is effectively avoided, high value is contributed, and the training efficiency and precision of the model are improved.

Description

Model training method, probability determination method, device, equipment, medium and product
Technical Field
The embodiment of the invention relates to the technical field of intelligent finance, in particular to a model training method, a probability determination device, equipment, a medium and a product.
Background
The actual business scene comprises a plurality of enterprise abnormal scenes, and for the plurality of enterprise abnormal scenes, the abnormal probability of enterprises is mostly determined through a neural network model in the prior art, but because the abnormal enterprises account for a very small number, the phenomenon of sample imbalance exists, and the accuracy of the model is reduced by training the model through the unbalanced samples. Moreover, because source data related to some services are more and more, overfitting is easy to occur when the source data are processed, and the efficiency of model training is reduced.
Disclosure of Invention
Embodiments of the present invention provide a model training method, a probability determination method, an apparatus, a device, a medium, and a product, so as to filter historical source data with a low contribution rate based on a sample evaluation index, thereby effectively avoiding overfitting, enabling filtered source data to contribute a high value, and improving training efficiency and accuracy of a model.
According to an aspect of the present invention, there is provided a model training method, including:
obtaining a target sample set, wherein the target sample set is obtained by screening historical source data of a sample enterprise based on a sample evaluation index;
inputting a first data sample in the target sample set into a pre-established random forest model to obtain enterprise abnormal prediction probability;
and training parameters of the random forest model according to an objective function formed on the basis of the enterprise abnormal prediction probability and the label information corresponding to the first data sample.
Further, obtaining a target sample set includes:
acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample;
randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples;
obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples;
determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples;
and screening the target sample matrix according to the sample evaluation index value corresponding to each row of first data samples to obtain a target sample set.
Further, obtaining a correlation value of each column of the first data samples in the target sample matrix includes:
acquiring the rank of each column of first data samples in the target sample matrix, the rank of label information and the total number of first data samples in historical source data;
and determining a relevance value of the first data sample of each column according to the rank of the first data sample of each column, the rank of the label information and the total number of the first data samples in the target sample matrix.
Further, obtaining an analysis of variance value of each column of the first data sample in the target sample matrix includes:
acquiring the inter-group mean square of each column of first data samples in the target sample matrix and the intra-group mean square of each column of first data samples;
and determining the analysis of variance value of each column of the first data samples according to the ratio of the mean square between the groups of the first data samples of each column to the mean square in the groups of the first data samples of each column.
Further, screening the target sample matrix according to the sample evaluation index value corresponding to the first data sample of each column to obtain a target sample set, including:
and deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a target sample set.
Further, deleting at least one column of first data samples with the sample evaluation index value smaller than the average sample evaluation index value from the target sample matrix to obtain a target sample set, including:
deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a first sample matrix;
and carrying out PCA (principal component analysis) dimensionality reduction on the first sample matrix based on singular value decomposition to obtain a target sample set.
Further, performing PCA dimension reduction on the first sample matrix based on singular value decomposition to obtain a target sample set, including:
performing decentralization on the first sample matrix to obtain a second sample matrix;
and carrying out PCA (principal component analysis) dimensionality reduction on the second sample matrix based on singular value decomposition to obtain a target sample set.
Further, the step of performing de-centering on the first sample matrix to obtain a second sample matrix includes:
and replacing each first data sample in the first sample matrix with a ratio of the first data sample to a mean value of a column to which the first data sample belongs to obtain a second sample matrix.
Further, performing PCA dimension reduction on the second sample matrix based on singular value decomposition to obtain a target sample set, including:
obtaining the tolerance of each column of first data samples in the second sample matrix;
deleting at least one row of first data samples with the tolerance larger than the row tolerance mean value in the second sample matrix to obtain a third sample matrix;
obtaining a fourth sample matrix, wherein the fourth sample matrix is a feature vector matrix corresponding to the third sample matrix;
determining a product of the transpose of the third sample matrix and the fourth sample matrix as a fifth sample matrix;
obtaining a variance contribution rate corresponding to each column of first data samples in the fifth sample matrix;
reordering the fifth sample matrix according to the sequence of variance contribution rate from large to small to obtain a sixth sample matrix;
and if the cumulative variance contribution rate corresponding to the nth column of first data samples in the sixth sample matrix is greater than or equal to a set threshold, and the cumulative variance contribution rate corresponding to the N-1 th column of first data samples in the sixth sample matrix is smaller than the set threshold, generating a target sample set according to the first N columns of first data samples in the sixth sample matrix, where N is a positive integer greater than or equal to 1.
Further, obtaining a fourth sample matrix includes:
acquiring a covariance matrix of the third sample matrix;
obtaining the eigenvalue of each first data sample in the covariance matrix of the third sample matrix through singular value decomposition;
and determining a matrix obtained by multiplying the eigenvalue of each first data sample in the covariance matrix of the third sample matrix by each first data sample in the third sample matrix as a fourth sample matrix.
According to another aspect of the present invention, there is provided an enterprise anomaly probability determining method, including:
acquiring source data corresponding to an enterprise to be identified;
inputting the source data corresponding to the enterprise to be identified into a target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified; wherein the target model is trained according to the model training method of any one of claims 1-10.
According to another aspect of the present invention, there is provided a model training apparatus including: the system comprises a sample set acquisition module, a sample set selection module and a sample set selection module, wherein the sample set acquisition module is used for acquiring a target sample set, and the target sample set is obtained by screening historical source data of a sample enterprise based on sample evaluation indexes;
the enterprise abnormal prediction probability determining module is used for inputting a first data sample in the target sample set into a pre-established random forest model to obtain enterprise abnormal prediction probability;
and the training module is used for training the parameters of the random forest model according to an objective function formed on the basis of the enterprise abnormal prediction probability and the label information corresponding to the first data sample.
Further, the sample set obtaining module is specifically configured to:
acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample;
randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples;
obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples;
determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples;
and screening the target sample matrix according to the sample evaluation index value corresponding to each row of first data samples to obtain a target sample set.
Further, the sample set obtaining module is specifically configured to:
acquiring the rank of each column of first data samples in the target sample matrix, the rank of label information and the total number of first data samples in historical source data;
and determining a relevance value of the first data sample of each column according to the rank of the first data sample of each column, the rank of the label information and the total number of the first data samples in the target sample matrix.
Further, the sample set obtaining module is specifically configured to:
acquiring the inter-group mean square of each column of first data samples in the target sample matrix and the intra-group mean square of each column of first data samples;
and determining the analysis of variance value of each column of the first data samples according to the ratio of the mean square between the groups of the first data samples of each column to the mean square in the groups of the first data samples of each column.
Further, the sample set obtaining module is specifically configured to:
and deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a target sample set.
Further, the sample set obtaining module is specifically configured to:
deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a first sample matrix;
and carrying out PCA (principal component analysis) dimensionality reduction on the first sample matrix based on singular value decomposition to obtain a target sample set.
Further, the sample set obtaining module is specifically configured to:
performing decentralization on the first sample matrix to obtain a second sample matrix;
and carrying out PCA (principal component analysis) dimensionality reduction on the second sample matrix based on singular value decomposition to obtain a target sample set.
Further, the sample set obtaining module is specifically configured to:
and replacing each first data sample in the first sample matrix with a ratio of the first data sample to a mean value of a column to which the first data sample belongs to obtain a second sample matrix.
Further, the sample set obtaining module is specifically configured to:
obtaining the tolerance of each column of first data samples in the second sample matrix;
deleting at least one row of first data samples with the tolerance larger than the row tolerance mean value in the second sample matrix to obtain a third sample matrix;
obtaining a fourth sample matrix, wherein the fourth sample matrix is a feature vector matrix corresponding to the third sample matrix;
determining a product of the transpose of the third sample matrix and the fourth sample matrix as a fifth sample matrix;
obtaining a variance contribution rate corresponding to each column of first data samples in the fifth sample matrix;
reordering the fifth sample matrix according to the sequence of variance contribution rate from large to small to obtain a sixth sample matrix;
and if the cumulative variance contribution rate corresponding to the nth column of first data samples in the sixth sample matrix is greater than or equal to a set threshold, and the cumulative variance contribution rate corresponding to the N-1 th column of first data samples in the sixth sample matrix is smaller than the set threshold, generating a target sample set according to the first N columns of first data samples in the sixth sample matrix, where N is a positive integer greater than or equal to 1.
Further, the sample set obtaining module is specifically configured to:
acquiring a covariance matrix of the third sample matrix;
obtaining the eigenvalue of each first data sample in the covariance matrix of the third sample matrix through singular value decomposition;
and determining a matrix obtained by multiplying the eigenvalue of each first data sample in the covariance matrix of the third sample matrix by each first data sample in the third sample matrix as a fourth sample matrix.
According to another aspect of the present invention, there is provided an enterprise anomaly probability determining apparatus, including:
the source data acquisition module is used for acquiring source data corresponding to the enterprise to be identified;
the enterprise abnormal probability determining module is used for inputting the source data corresponding to the enterprise to be identified into a target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified; the target model is obtained by training according to the model training method of any embodiment of the invention.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the model training method of any one of the embodiments of the present invention or to enable the at least one processor to perform the enterprise anomaly probability determination method of any one of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the model training method according to any one of the embodiments of the present invention, or the enterprise anomaly probability determination method according to any one of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer program product, which when executed by a processor implements the model training method according to any one of the embodiments of the present invention, or which when executed by a processor implements the enterprise anomaly probability determination method according to any one of the embodiments of the present invention.
According to the method and the device, the historical source data of the sample enterprise are screened based on the sample evaluation indexes to obtain the target sample set, the first data sample in the target sample set is input into a pre-established random forest model, the enterprise abnormity prediction probability is obtained, and then the parameters of the random forest model are trained according to the target function formed based on the enterprise abnormity prediction probability and the label information corresponding to the first data sample, so that the historical source data with low contribution rate can be filtered based on the sample evaluation indexes, the over-fitting is effectively avoided, meanwhile, high value is contributed, and the training efficiency and precision of the model are improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart of a model training method in an embodiment of the invention;
FIG. 2 is a graph of cumulative variance contribution ratio for each column of the first data sample in a fifth sample matrix in an embodiment of the present invention;
FIG. 3 is a graph of AUC iterations during model training in an embodiment of the present invention;
FIG. 4 is a ROC graph in an embodiment of the present invention;
FIG. 5 is a flowchart of a method for determining enterprise anomaly probability in an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an enterprise anomaly probability determining apparatus in an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device in an embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The technical scheme related by the application can be used for acquiring, storing and/or processing the data, and the relevant regulations of national laws and regulations are met.
Example one
Fig. 1 is a flowchart of a model training method provided in an embodiment of the present invention, where the present embodiment is applicable to a case of model training, and the method may be executed by a model training apparatus in an embodiment of the present invention, and the model training apparatus may be implemented in a software and/or hardware manner, as shown in fig. 1, the method specifically includes the following steps:
s110, a target sample set is obtained, wherein the target sample set is obtained after the historical source data of the sample enterprise are screened based on the sample evaluation indexes.
The historical source data is enterprise data authorized to be disclosed by the sample enterprise, and may include, for example: basic information of a sample enterprise.
The target sample set may be obtained in the following manner: acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample (it should be noted that the data volume of the historical source data involved in the actual service is very large, and therefore, in general, the historical source data includes a plurality of first data samples); randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples; obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples; determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples; and screening the target sample matrix according to the sample evaluation index value corresponding to each row of first data samples to obtain a target sample set.
The target sample set may be obtained by: acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample; randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples; acquiring the rank of each column of first data samples in the target sample matrix, the rank of label information and the total number of first data samples in historical source data; determining a relevance value of each column of the first data samples according to the rank of each column of the first data samples, the rank of the label information and the total number of the first data samples in the target sample matrix; acquiring the inter-group mean square of each column of first data samples in the target sample matrix and the intra-group mean square of each column of first data samples; determining an analysis of variance value of each column of first data samples according to a ratio of the inter-group mean square of each column of first data samples to the intra-group mean square of each column of first data samples; acquiring a characteristic importance value of each column of first data sample in the target sample matrix; determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples; and screening the target sample matrix according to the sample evaluation index value corresponding to each row of first data samples to obtain a target sample set.
The target sample set may be obtained by: acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample; randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples; obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples; determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples; and deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a target sample set.
The target sample set may be obtained by: acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample; randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples; obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples; determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples; deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a first sample matrix; and carrying out PCA (principal component analysis) dimensionality reduction on the first sample matrix based on singular value decomposition to obtain a target sample set.
The target sample set may be obtained by: acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample; randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples; obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples; determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples; deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a first sample matrix; obtaining the tolerance of each column of first data samples in the second sample matrix; deleting at least one row of first data samples with the tolerance larger than the row tolerance mean value in the second sample matrix to obtain a third sample matrix; obtaining a fourth sample matrix, wherein the fourth sample matrix is a feature vector matrix corresponding to the third sample matrix; determining a product of the transpose of the third sample matrix and the fourth sample matrix as a fifth sample matrix; obtaining a variance contribution rate corresponding to each column of first data samples in the fifth sample matrix; reordering the fifth sample matrix according to the sequence of variance contribution rate from large to small to obtain a sixth sample matrix; and if the cumulative variance contribution rate corresponding to the nth column of first data samples in the sixth sample matrix is greater than or equal to a set threshold, and the cumulative variance contribution rate corresponding to the N-1 th column of first data samples in the sixth sample matrix is smaller than the set threshold, generating a target sample set according to the first N columns of first data samples in the sixth sample matrix, where N is a positive integer greater than or equal to 1.
And S120, inputting the first data sample in the target sample set into a pre-established random forest model to obtain the enterprise abnormal prediction probability.
Wherein the first data sample is any data sample in the target sample set.
Wherein the random forest comprises at least one decision tree. A decision tree is a tree-like structure in which each internal node represents a test on an attribute, each branch represents a test output, and each leaf node represents a category.
Specifically, the first data sample in the target sample set is input into a pre-established random forest model to obtain the enterprise abnormal prediction probability, for example, the first data sample of the enterprise R is input into the pre-established random forest model to obtain the enterprise abnormal prediction probability P corresponding to the enterprise R.
S130, training parameters of the random forest model according to an objective function formed based on the enterprise abnormal prediction probability and label information corresponding to the first data sample.
Wherein the parameters of the random forest model may include: the number of decision trees, the number of features randomly sampled at each node, the minimum number of samples allowed at a leaf node, the maximum number of leaf nodes allowed, and the like, which are not limited in the embodiments of the present invention.
The tag information may be that an enterprise corresponding to the first data sample is an abnormal enterprise, and the tag information may also be that an enterprise corresponding to the first data sample is a normal enterprise.
Specifically, after parameters of the random forest model are trained according to an objective function formed based on the enterprise abnormal prediction probability and label information corresponding to the first data sample, the steps are executed in a circulating mode to conduct iterative training on the random forest model to obtain the target model.
Optionally, obtaining a target sample set includes:
acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample;
randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples;
obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples;
determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples;
and screening the target sample matrix according to the sample evaluation index value corresponding to each row of first data samples to obtain a target sample set.
Specifically, in order to solve the problem of sample imbalance, the embodiment of the present invention reconstructs the historical source data by using a random undersampling method in resampling, that is, a certain proportion of most samples are removed.
Wherein the correlation value may be a spearman correlation coefficient, also commonly referred to as spearman rank correlation coefficient. "rank" is understood to mean an order or sequence that is solved according to the ordered position of the first data sample in each column of the target sample matrix. The relevance value is calculated based on the following formula:
Figure BDA0003640836240000151
d i =rg(X i )-rg(Y i );
firstly, for the ith column, the first data sample x i Label information Y corresponding to first data sample in ith column i Sorting, recording the position after sorting, and solving according to the position after sorting to obtain rg (x) i ) And rg (Y) i ),rg(x i ) And rg (Y) i ) Called rank, the difference in rank is d in the above formula i N is the number of first data samples in the target sample matrix, m is the number of columns of the target sample matrix, and the target sample matrix may be: d ═ x 1 ,x 2 ,…,x m },x i =(x i1 ,x i2 ,…,x ij ) N is j m, D is the target sample matrix, i is 1,2, …, m.
The obtaining manner of the feature importance value of each column of the first data sample may be: and constructing a tree model in advance according to each column of first data samples in the target sample matrix and label information corresponding to each column of first data samples, wherein the characteristic importance value of each column of first data samples is measured by the average value of the importance of each column of first data samples in a single tree.
The analysis of variance value of each column of the first data samples may be obtained by: acquiring the inter-group mean square of each column of first data samples in the target sample matrix and the intra-group mean square of each column of first data samples; and determining the analysis of variance value of each column of the first data samples according to the ratio of the mean square between the groups of the first data samples of each column to the mean square in the groups of the first data samples of each column.
Specifically, the manner of determining the sample evaluation index value corresponding to each column of the first data samples according to the correlation value of each column of the first data samples, the feature importance value of each column of the first data samples, and the analysis of variance value of each column of the first data samples may be: determining a correlation value vector according to the correlation value of each column of first data samples, determining a feature importance vector according to the feature importance value of each column of first data samples, and determining an analysis of variance value vector according to the analysis of variance value of each column of first data samples. And performing relativity conversion on the correlation value vector to obtain a converted correlation value vector, performing relativity conversion on the feature importance vector to obtain a converted feature importance vector, and performing relativity conversion on the square difference analysis value vector to obtain a converted variance analysis value vector. Setting the weight of the feature importance value, the weight of the correlation value and the weight of the anova value, determining a sample evaluation index value corresponding to each row of the first data samples according to the weight of the feature importance value, the weight of the correlation value, the weight of the anova value, the converted correlation value vector, the converted feature importance vector and the converted anova value vector, and further obtaining a sample evaluation index vector.
In a specific example, a correlation value vector R is determined based on the correlation values of the first data samples of each column in the target sample matrix,
Figure BDA0003640836240000171
wherein the content of the first and second substances,
Figure BDA0003640836240000172
for the 1 st column first data sample x 1 The correlation value of (a) is determined,
Figure BDA0003640836240000173
for the 2 nd column first data sample x 2 The correlation value of (a) is determined,
Figure BDA0003640836240000174
for the m-th column of the first data sample x m The relevance value of (c). Determining a feature importance vector V according to the feature importance value of the first data sample of each column in the target sample matrix,
Figure BDA0003640836240000175
wherein the content of the first and second substances,
Figure BDA0003640836240000176
for the 1 st column first data sample x 1 Is determined by the characteristic importance value of (a),
Figure BDA0003640836240000177
for the 2 nd column first data sample x 2 Is determined by the characteristic importance value of (a),
Figure BDA0003640836240000178
for the m-th column of the first data sample x m A characteristic importance value of. Determining an analysis of variance value vector F based on the analysis of variance values of the first data samples of each column in the target sample matrix,
Figure BDA0003640836240000179
wherein the content of the first and second substances,
Figure BDA00036408362400001710
for the 1 st column first data sample x 1 The analysis of variance value of (a) is,
Figure BDA00036408362400001711
for the 2 nd column first data sample x 2 The analysis of variance value of (a) is,
Figure BDA00036408362400001712
for the m-th column of the first data sample x m Analysis of variance values of (a). Performing relativity conversion on the relevance value vector R based on the following formula to obtain a vector R':
Figure BDA00036408362400001713
where max (r) is the maximum correlation value in the correlation value vector. And performing relativity conversion on the feature importance vector V based on the following formula to obtain a vector V':
Figure BDA00036408362400001714
where max (v) is the maximum feature importance value in the feature importance vector. And carrying out relativity conversion on the variance analysis value vector F based on the following formula to obtain a vector F':
Figure BDA00036408362400001715
where max (F) is the largest ANOVA value in the ANOVA vector. Defining the ith column of the first data sample x i The corresponding sample evaluation index values are:
Figure BDA00036408362400001716
wherein the content of the first and second substances,
Figure BDA00036408362400001720
for the ith column, the first data sample x i The correlation value of (a) is determined,
Figure BDA00036408362400001717
for the ith column, the first data sample x i Is determined by the characteristic importance value of (a),
Figure BDA00036408362400001718
for the ith column, the first data sample x i A is a weight corresponding to the correlation value, b is a weight corresponding to the feature importance value, and c is a weight corresponding to the anova value. Obtaining a sample evaluation index vector
Figure BDA00036408362400001719
Optionally, obtaining a correlation value of each column of the first data sample in the target sample matrix includes:
acquiring the rank of each column of first data samples in the target sample matrix, the rank of label information and the total number of first data samples in historical source data;
and determining a relevance value of the first data sample of each column according to the rank of the first data sample of each column, the rank of the label information and the total number of the first data samples in the target sample matrix.
Specifically, the manner of determining the correlation value of each column of the first data samples according to the rank of each column of the first data samples, the rank of the label information, and the total number of the first data samples in the target sample matrix may be: and obtaining a difference value between the rank of each column of the first data sample and the rank of the label information of each column of the first data sample, and determining a correlation value of each column of the first data sample according to the difference value between the rank of each column of the first data sample and the rank of the label information of each column of the first data sample and the total number of the first data samples in the target sample matrix.
In a specific example, the ith column of first data samples x is calculated based on the following formula i The correlation value of (c):
Figure BDA0003640836240000181
d i =rg(X i )-rg(Y i );
firstly, for the ith column, the first data sample x i Label information Y corresponding to first data sample in ith column i Sorting, recording the position after sorting, and solving according to the position after sorting to obtain rg (x) i ) And rg (Y) i ),rg(x i ) And rg (Y) i ) Called rank, the difference in rank is d in the above formula i N is the number of first data samples in the target sample matrix, m is the number of columns of the target sample matrix, and the target sample matrix may be: d ═ x 1 ,x 2 ,…,x m },x i =(x i1 ,x i2 ,…,x ij ) N is j m, D is the target sample matrix, i is 1,2, …, m.
Optionally, obtaining an analysis of variance value of each column of the first data sample in the target sample matrix includes:
acquiring the inter-group mean square of each column of first data samples in the target sample matrix and the intra-group mean square of each column of first data samples;
and determining the analysis of variance value of each column of the first data samples according to the ratio of the mean square between the groups of the first data samples of each column to the mean square in the groups of the first data samples of each column.
Specifically, the manner of determining the analysis of variance value of each column of the first data samples according to the ratio of the inter-group mean square of each column of the first data samples to the intra-group mean square of each column of the first data samples may be: determining a ratio of the inter-group mean square of the first data samples of each column to the intra-group mean square of the first data samples of each column as an ANOVA value of the first data samples of each column. For example, the ith column of first data sample x may be determined based on the following formula i Analysis of variance of (a):
Figure BDA0003640836240000191
wherein the content of the first and second substances,
Figure BDA0003640836240000192
for the ith column, the first data sample x i The mean square between the groups of (1),
Figure BDA0003640836240000193
for the ith column, the first data sample x i Mean square in group.
Optionally, the screening the target sample matrix according to the sample evaluation index value corresponding to the first data sample in each column to obtain a target sample set, including:
and deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a target sample set.
The average sample evaluation index value is the ratio of the sum of the sample evaluation indexes corresponding to each column of the first data samples in the target sample to the total number of columns of the first data samples in the target sample.
In one specific example, a sample evaluation index vector
Figure BDA0003640836240000194
Obtaining
Figure BDA0003640836240000195
Average value of (2)
Figure BDA0003640836240000196
If it is
Figure BDA0003640836240000197
Is less than
Figure BDA0003640836240000198
Then will be
Figure BDA0003640836240000199
And deleting the target sample matrix, and generating a target sample set according to the first data sample in the deleted target sample matrix and the label information corresponding to the first data sample.
Optionally, deleting at least one column of first data samples with sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a target sample set, where the method includes:
deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a first sample matrix;
and carrying out PCA (principal component analysis) dimensionality reduction on the first sample matrix based on singular value decomposition to obtain a target sample set.
Specifically, the method for performing PCA dimension reduction on the first sample matrix based on singular value decomposition to obtain the target sample set may be: performing decentralization on the first sample matrix to obtain a second sample matrix; and carrying out PCA (principal component analysis) dimensionality reduction on the second sample matrix based on singular value decomposition to obtain a target sample set.
Specifically, at least one column of first data samples with sample evaluation index values smaller than the average sample evaluation index value is deleted from the target sample matrix to obtain the first sample matrix, for example, the first sample matrix may be obtained
Figure BDA0003640836240000201
Average value of (2)
Figure BDA0003640836240000202
If it is
Figure BDA0003640836240000203
Is less than
Figure BDA0003640836240000204
Then will be
Figure BDA0003640836240000205
And deleting the target sample matrix to obtain a first sample matrix, wherein w is less than or equal to m.
Optionally, performing PCA dimension reduction on the first sample matrix based on singular value decomposition to obtain a target sample set, including:
performing decentralization on the first sample matrix to obtain a second sample matrix;
and carrying out PCA (principal component analysis) dimensionality reduction on the second sample matrix based on singular value decomposition to obtain a target sample set.
Specifically, the method for performing de-centering on the first sample matrix to obtain the second sample matrix may be: replacing each first data sample in the first sample matrix with first dataThe second sample matrix is obtained by taking the ratio of the sample to the mean value of the columns to which the first data samples belong, for example, the mean value of the first samples of each column in the target sample matrix may be obtained
Figure BDA0003640836240000206
X in the target sample matrix i Is replaced by
Figure BDA0003640836240000207
Obtaining a second sample matrix
Figure BDA0003640836240000208
In one specific example, the mean value of the first sample of each column in the target sample matrix is obtained
Figure BDA0003640836240000209
X in the target sample matrix i Is replaced by
Figure BDA0003640836240000211
To realize the decentralization of the first sample matrix to obtain a second sample matrix
Figure BDA0003640836240000212
And carrying out PCA (principal component analysis) dimensionality reduction on the second sample matrix D' based on singular value decomposition to obtain a target sample set.
Optionally, the step of performing de-centering on the first sample matrix to obtain a second sample matrix includes:
and replacing each first data sample in the first sample matrix with a ratio of the first data sample to a mean value of a column to which the first data sample belongs to obtain a second sample matrix.
Specifically, replacing each first data sample in the first sample matrix with a ratio of the first data sample to a mean value of a column to which the first data sample belongs, and obtaining the second sample matrix may be: obtaining the mean value of the first data sample of each column in the target sample matrix in advance, replacing the first sample matrix of each column in the target matrix with the first data sample andthe ratio of the mean values of the columns to which the first data samples belong may be, for example, the mean value of the first samples of each column in the target sample matrix is obtained
Figure BDA0003640836240000213
X in the target sample matrix i Is replaced by
Figure BDA0003640836240000214
Obtaining a second sample matrix
Figure BDA0003640836240000215
Optionally, performing PCA dimension reduction on the second sample matrix based on singular value decomposition to obtain a target sample set, including:
obtaining the tolerance of each column of first data samples in the second sample matrix;
deleting at least one row of first data samples with the tolerance larger than the row tolerance mean value in the second sample matrix to obtain a third sample matrix;
obtaining a fourth sample matrix, wherein the fourth sample matrix is a feature vector matrix corresponding to the third sample matrix;
determining a product of the transpose of the third sample matrix and the fourth sample matrix as a fifth sample matrix;
obtaining a variance contribution rate corresponding to each column of first data samples in the fifth sample matrix;
reordering the fifth sample matrix according to the sequence of variance contribution rate from large to small to obtain a sixth sample matrix;
and if the cumulative variance contribution rate corresponding to the nth column of first data samples in the sixth sample matrix is greater than or equal to a set threshold, and the cumulative variance contribution rate corresponding to the N-1 th column of first data samples in the sixth sample matrix is smaller than the set threshold, generating a target sample set according to the first N columns of first data samples in the sixth sample matrix, where N is a positive integer greater than or equal to 1.
The tolerance is the reciprocal of a Variance expansion Factor (VIF), which is used to detect multiple collinearity of independent variables in a regression model.
For example, there may be a total of 5 first samples: x is a radical of a fluorine atom 1 ,x 2 ,x 3 ,x 4 ,x 5 . B is x 1 As observed value, the rest x 2 ,x 3 ,x 4 ,x 5 As independent variables, linear regression was performed. x is the number of 1 =c 1,0 +c 1,2 x 2 +c 1,3 x 3 +c 1,4 x 4 +c 1,5 x 5 + e1, using
Figure BDA0003640836240000221
S of the above regression 2 。x 1 Variance expansion factor of
Figure BDA0003640836240000222
Accordingly, x i Variance expansion factor of
Figure BDA0003640836240000223
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003640836240000224
is a mixing of x i As observed values, the rest are used as regression S of independent variables 2 。S 2 The closer to 1, the better the fit. Here VIF i A value equal to 1 indicates no collinearity, the closer to 1, x i The smaller the multicollinearity of (c). VIF i The larger, x i The greater the multiple collinearity with the first sample of the other columns.
Specifically, the variance of the data represents information about fluctuation of the data. If the variance of the data is 0, the data is completely unchanged, and the data is not valuable for research. In general, PCA is used for dimensionality reduction, which causes information loss, and saves as much information of data as possible while reducing the dimensionality of data. The variance of the data is equal to the sum of all eigenvalues of the covariance, and the variance of the ith principal component is equal to the ith eigenvalue of the covariance matrix. The eigenvalues are sorted from large to small and then summed from front to back to obtain the cumulative variance. The magnitude of the cumulative variance contribution rate indicates the proportion of all principal components currently selected carrying information of the original data.
The obtaining manner of the variance contribution rate corresponding to each column of the first data sample in the fifth sample matrix may be: and obtaining the variance corresponding to each row of first data samples in the fifth sample matrix in advance, and determining the variance contribution rate corresponding to each row of first data samples in the fifth sample matrix according to the variance corresponding to each row of first data samples in the fifth sample matrix.
For example, as shown in fig. 2, fig. 2 is a cumulative variance contribution rate corresponding to each column of the first data sample in the fifth sample matrix, the abscissa of fig. 2 represents the number of columns, and the ordinate represents the cumulative variance contribution rate.
Specifically, the manner of deleting at least one column of first data samples whose tolerance is greater than the column tolerance mean value in the second sample matrix to obtain the third sample matrix may be: performing tolerance analysis on each row of first data samples in the second sample matrix to obtain the tolerance T of each row of first data samples i ,T i Has a value range of [0,1 ]]When T is i The closer to 0, the higher the correlation between the variable i and other independent variables, so that the reservation is set
Figure BDA0003640836240000231
A third sample matrix is obtained, wherein,
Figure BDA0003640836240000232
is the tolerance mean value of the first data sample in the ith column in the second sample matrix.
Specifically, the fifth sample matrix a is determined based on the following formula: a ═ D '(D') T
Specifically, if the sum of the first ratios of the first data samples in the preset number of columns accumulated in sequence in the fifth sample matrix is greater than the set threshold, the target sample set is generated according to the first data samples in the preset number of columns, for example, if the sum of the first ratios of the first data samples in the first column, the first ratio of the first data samples in the second column, …, and the sum of the first ratios of the first data samples in the L-th column is greater than the set threshold, and the first ratios of the first data samples in the first column, the first ratios of the first data samples in the second column, …, and the sum of the first ratios of the first data samples in the L-1 th column is less than the set threshold, the target sample set is generated according to the first data samples in the first column, the first data samples in the second column, …, and the first data samples in the L-th column in the fifth sample matrix.
Optionally, obtaining a fourth sample matrix includes:
acquiring a covariance matrix of the third sample matrix;
obtaining the eigenvalue of each first data sample in the covariance matrix of the third sample matrix through singular value decomposition;
and determining a matrix obtained by multiplying the eigenvalue of each first data sample in the covariance matrix of the third sample matrix by each first data sample in the third sample matrix as a fourth sample matrix.
Specifically, the manner of determining a matrix obtained by multiplying the eigenvalue of each first data sample in the covariance matrix of the third sample matrix by each first data sample in the third sample matrix as the fourth sample matrix may be: third sample matrix D ″ ═ x 1 ,x 2 ,…,x p ) P ≦ w ≦ m, and the eigenvalue of each first data sample in the covariance matrix of the third sample matrix is K ═ λ (λ ≦ m 12 ,…,λ p ) The fourth sample matrix D '″' (x) 11 ,x 22 ,…,x pp )。
In the embodiment of the invention, the reconstructed sample data set is trained and tested by using RF, AUC iteration in the model training process is shown in figure 3, the abscissa in figure 3 represents iteration rounds, and the ordinate represents AUC.
Given a threshold, the TPR (coverage) and FPR (disturbance) can be calculated from the confusion matrix,
Figure BDA0003640836240000241
wherein TP isTrue positive, FN false negative, FP false positive and TN true negative. By setting different thresholds, there will be a series of TPR and FPR, so that the ROC curve (solid line is ROC curve, and dotted line is reference curve) shown in fig. 4 can be drawn, where the abscissa in fig. 4 represents FPR and the ordinate represents TPR.
The scoring index in the embodiment of the invention firstly calculates 3 coverage rates TPR:
Figure BDA0003640836240000242
the final error was tpe ═ 0.4 × TPR1+0.3 × TPR2+0.3 × TPR 3.
And performing model verification by prediction on the test set to obtain classification errors AUC (AUC 0.81) and tpe (AUC 0.65).
According to the technical scheme, the historical source data of the sample enterprise are screened based on the sample evaluation indexes to obtain a target sample set, a first data sample in the target sample set is input into a pre-established random forest model, after the enterprise abnormal prediction probability is obtained, parameters of the random forest model are trained according to a target function formed based on the enterprise abnormal prediction probability and label information corresponding to the first data sample, the historical source data with low contribution rate can be filtered based on the sample evaluation indexes, high value is contributed while overfitting is effectively avoided, and training efficiency and accuracy of the model are improved.
Example two
Fig. 5 is a flowchart of an enterprise abnormal probability determining method according to an embodiment of the present invention, where this embodiment is applicable to a case of determining an enterprise abnormal probability, and the method may be executed by an enterprise abnormal probability determining device according to an embodiment of the present invention, where the enterprise abnormal probability determining device may be implemented in a software and/or hardware manner, as shown in fig. 5, where the method specifically includes the following steps:
and S210, acquiring source data corresponding to the enterprise to be identified.
The source data corresponding to the to-be-identified enterprise is enterprise data authorized to be disclosed by the to-be-identified enterprise, and may include, for example: basic information of the enterprise to be identified.
The acquisition mode of the source data corresponding to the enterprise to be identified may be: acquiring source data corresponding to the enterprise to be identified from a database; the acquisition mode of the source data corresponding to the enterprise to be identified may also be: and if an application instruction sent by the target enterprise is detected, determining the target enterprise as an enterprise to be identified, and acquiring source data corresponding to the target enterprise.
The source data may be type data such as enterprise basic information, enterprise legal person information, enterprise credit investigation, enterprise legal person credit investigation, and the like, and the embodiment of the present invention does not limit specific contents of the source data.
The source data corresponding to the enterprise to be identified is the source data disclosed by the enterprise to be identified, or the source data disclosed by the enterprise to be identified is acquired after the authorization of an enterprise legal person is obtained.
And S220, inputting the source data corresponding to the enterprise to be identified into a target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified.
The training method of the target model is as described in the above embodiments, which is not described herein again.
Specifically, the source data corresponding to the enterprise to be identified is input into the target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified, for example, the source data corresponding to the enterprise K to be identified is input into the target model to obtain the enterprise abnormal probability corresponding to the enterprise K to be identified, which is 30%.
According to the technical scheme, the source data corresponding to the enterprise to be identified is acquired, the source data corresponding to the enterprise to be identified is input into the target model, the enterprise abnormal probability corresponding to the enterprise to be identified is acquired, and the target model is obtained by training after filtering the historical source data with low contribution rate based on the sample evaluation index, so that the filtered historical source data can contribute high value while overfitting is effectively avoided, and the efficiency and accuracy of acquiring the enterprise abnormal probability corresponding to the enterprise to be identified can be improved.
EXAMPLE III
Fig. 6 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention. The present embodiment may be applied to the case of model training, the apparatus may be implemented in a software and/or hardware manner, and the model training apparatus may be integrated in any device providing a model training function, as shown in fig. 6, where the model training apparatus specifically includes: a sample set acquisition module 310, an enterprise anomaly prediction probability determination module 320, and a training module 330.
The system comprises a sample set acquisition module, a sample set analysis module and a sample set analysis module, wherein the sample set acquisition module is used for acquiring a target sample set, and the target sample set is obtained by screening historical source data of a sample enterprise based on sample evaluation indexes;
the enterprise abnormal prediction probability determining module is used for inputting a first data sample in the target sample set into a pre-established random forest model to obtain enterprise abnormal prediction probability;
and the training module is used for training the parameters of the random forest model according to an objective function formed on the basis of the enterprise abnormal prediction probability and the label information corresponding to the first data sample.
Optionally, the sample set obtaining module is specifically configured to:
obtaining historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample;
randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples;
obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples;
determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples;
and screening the target sample matrix according to the sample evaluation index value corresponding to each row of first data samples to obtain a target sample set.
Optionally, the sample set obtaining module is specifically configured to:
acquiring the rank of each column of first data samples in the target sample matrix, the rank of label information and the total number of first data samples in historical source data;
and determining a relevance value of the first data sample of each column according to the rank of the first data sample of each column, the rank of the label information and the total number of the first data samples in the target sample matrix.
Optionally, the sample set obtaining module is specifically configured to:
acquiring the inter-group mean square of each column of first data samples and the intra-group mean square of each column of first data samples in the target sample matrix;
and determining the analysis of variance value of each column of the first data samples according to the ratio of the mean square between the groups of the first data samples of each column to the mean square in the groups of the first data samples of each column.
Optionally, the sample set obtaining module is specifically configured to:
and deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a target sample set.
Optionally, the sample set obtaining module is specifically configured to:
deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a first sample matrix;
and carrying out PCA (principal component analysis) dimensionality reduction on the first sample matrix based on singular value decomposition to obtain a target sample set.
Optionally, the sample set obtaining module is specifically configured to:
performing decentralization on the first sample matrix to obtain a second sample matrix;
and carrying out PCA (principal component analysis) dimensionality reduction on the second sample matrix based on singular value decomposition to obtain a target sample set.
Optionally, the sample set obtaining module is specifically configured to:
and replacing each first data sample in the first sample matrix with a ratio of the first data sample to a mean value of a column to which the first data sample belongs to obtain a second sample matrix.
Optionally, the sample set obtaining module is specifically configured to:
obtaining the tolerance of each column of first data samples in the second sample matrix;
deleting at least one row of first data samples with the tolerance larger than the row tolerance mean value in the second sample matrix to obtain a third sample matrix;
obtaining a fourth sample matrix, wherein the fourth sample matrix is a feature vector matrix corresponding to the third sample matrix;
determining a product of the transpose of the third sample matrix and the fourth sample matrix as a fifth sample matrix;
obtaining a variance contribution rate corresponding to each column of first data samples in the fifth sample matrix;
reordering the fifth sample matrix according to the sequence of variance contribution rate from large to small to obtain a sixth sample matrix;
and if the cumulative variance contribution rate corresponding to the nth column of first data samples in the sixth sample matrix is greater than or equal to a set threshold, and the cumulative variance contribution rate corresponding to the N-1 th column of first data samples in the sixth sample matrix is smaller than the set threshold, generating a target sample set according to the first N columns of first data samples in the sixth sample matrix, where N is a positive integer greater than or equal to 1.
Optionally, the sample set obtaining module is specifically configured to:
acquiring a covariance matrix of the third sample matrix;
obtaining the eigenvalue of each first data sample in the covariance matrix of the third sample matrix through singular value decomposition;
and determining a matrix obtained by multiplying the eigenvalue of each first data sample in the covariance matrix of the third sample matrix by each first data sample in the third sample matrix as a fourth sample matrix.
The product can execute the method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
According to the technical scheme, the historical source data of the sample enterprise are screened based on the sample evaluation indexes to obtain a target sample set, a first data sample in the target sample set is input into a pre-established random forest model, after the enterprise abnormal prediction probability is obtained, parameters of the random forest model are trained according to a target function formed based on the enterprise abnormal prediction probability and label information corresponding to the first data sample, the historical source data with low contribution rate can be filtered based on the sample evaluation indexes, high value is contributed while overfitting is effectively avoided, and training efficiency and accuracy of the model are improved.
Example four
Fig. 7 is a schematic structural diagram of an enterprise anomaly probability determining apparatus according to an embodiment of the present invention. The present embodiment may be applied to the case of determining the enterprise abnormal probability, where the apparatus may be implemented in a software and/or hardware manner, and the apparatus for determining the enterprise abnormal probability may be integrated in any device providing the function of determining the enterprise abnormal probability, as shown in fig. 7, where the apparatus for determining the enterprise abnormal probability specifically includes: a source data acquisition module 410 and an enterprise anomaly probability determination module 420.
The source data acquisition module is used for acquiring source data corresponding to the enterprise to be identified;
the enterprise abnormal probability determining module is used for inputting the source data corresponding to the enterprise to be identified into a target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified; the target model is obtained by training according to the model training method in the embodiment.
According to the technical scheme, the source data corresponding to the enterprise to be identified is acquired, the source data corresponding to the enterprise to be identified is input into the target model, the enterprise abnormal probability corresponding to the enterprise to be identified is acquired, and the target model is obtained by training after filtering the historical source data with low contribution rate based on the sample evaluation index, so that the filtered historical source data can contribute high value while overfitting is effectively avoided, and the efficiency and accuracy of acquiring the enterprise abnormal probability corresponding to the enterprise to be identified can be improved.
EXAMPLE five
FIG. 8 illustrates a schematic diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 8, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM)12, a Random Access Memory (RAM)13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM)12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the model training method:
obtaining a target sample set, wherein the target sample set is obtained by screening historical source data of a sample enterprise based on a sample evaluation index;
inputting a first data sample in the target sample set into a pre-established random forest model to obtain enterprise abnormal prediction probability;
and training parameters of the random forest model according to an objective function formed on the basis of the enterprise abnormal prediction probability and the label information corresponding to the first data sample.
Or, for example, the enterprise anomaly probability determination method:
acquiring source data corresponding to an enterprise to be identified;
inputting the source data corresponding to the enterprise to be identified into a target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified; wherein the target model is trained according to the model training method of any one of claims 1-10.
In some embodiments, the model training method, or alternatively, the enterprise anomaly probability determination method, may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When loaded into RAM 13 and executed by processor 11, the computer program may perform one or more steps of the model training method described above, or the enterprise anomaly probability determination method. Alternatively, in other embodiments, processor 11 may be configured in any other suitable manner (e.g., by way of firmware) to perform a model training method, or an enterprise anomaly probability determination method.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
An embodiment of the present invention further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the model training method according to any embodiment of the present invention, or an enterprise anomaly probability determining method.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (16)

1. A method of model training, comprising:
obtaining a target sample set, wherein the target sample set is obtained by screening historical source data of a sample enterprise based on a sample evaluation index;
inputting a first data sample in the target sample set into a pre-established random forest model to obtain enterprise abnormal prediction probability;
and training parameters of the random forest model according to an objective function formed on the basis of the enterprise abnormal prediction probability and the label information corresponding to the first data sample.
2. The method of claim 1, wherein obtaining a target sample set comprises:
acquiring historical source data, wherein the historical source data comprises: at least one first data sample and label information corresponding to the at least one first data sample;
randomly undersampling the historical source data to obtain a target sample matrix, wherein the target sample matrix comprises: at least one column of first data samples;
obtaining a correlation value of each column of first data samples in the target sample matrix, a feature importance value of each column of first data samples and an analysis of variance value of each column of first data samples;
determining a sample evaluation index value corresponding to each column of first data samples according to the relevance value of each column of first data samples, the feature importance value of each column of first data samples and the analysis of variance value of each column of first data samples;
and screening the target sample matrix according to the sample evaluation index value corresponding to each row of first data samples to obtain a target sample set.
3. The method of claim 2, wherein obtaining a correlation value for each column of first data samples in the target sample matrix comprises:
acquiring the rank of each column of first data samples in the target sample matrix, the rank of label information and the total number of first data samples in historical source data;
and determining a relevance value of the first data sample of each column according to the rank of the first data sample of each column, the rank of the label information and the total number of the first data samples in the target sample matrix.
4. The method of claim 2, wherein obtaining analysis of variance values for each column of first data samples in the target sample matrix comprises:
acquiring the inter-group mean square of each column of first data samples in the target sample matrix and the intra-group mean square of each column of first data samples;
and determining the analysis of variance value of each column of the first data samples according to the ratio of the mean square between the groups of the first data samples of each column to the mean square in the groups of the first data samples of each column.
5. The method according to claim 2, wherein the step of screening the target sample matrix according to the sample evaluation index value corresponding to the first data sample of each column to obtain a target sample set comprises:
and deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a target sample set.
6. The method of claim 5, wherein deleting at least one column of first data samples having a sample evaluation index value less than an average sample evaluation index value from the target sample matrix to obtain a target sample set comprises:
deleting at least one column of first data samples with the sample evaluation index values smaller than the average sample evaluation index value from the target sample matrix to obtain a first sample matrix;
and carrying out PCA (principal component analysis) dimensionality reduction on the first sample matrix based on singular value decomposition to obtain a target sample set.
7. The method of claim 6, wherein performing PCA dimensionality reduction on the first sample matrix based on singular value decomposition to obtain a target sample set comprises:
performing decentralization on the first sample matrix to obtain a second sample matrix;
and carrying out PCA (principal component analysis) dimensionality reduction on the second sample matrix based on singular value decomposition to obtain a target sample set.
8. The method of claim 7, wherein de-centering the first sample matrix to obtain a second sample matrix comprises:
and replacing each first data sample in the first sample matrix with a ratio of the first data sample to a mean value of a column to which the first data sample belongs to obtain a second sample matrix.
9. The method of claim 8, wherein performing PCA dimension reduction on the second sample matrix based on singular value decomposition to obtain a target sample set comprises:
obtaining the tolerance of each column of first data samples in the second sample matrix;
deleting at least one row of first data samples with the tolerance larger than the row tolerance mean value in the second sample matrix to obtain a third sample matrix;
obtaining a fourth sample matrix, wherein the fourth sample matrix is a feature vector matrix corresponding to the third sample matrix;
determining a product of the transpose of the third sample matrix and the fourth sample matrix as a fifth sample matrix;
obtaining a variance contribution rate corresponding to each column of first data samples in the fifth sample matrix;
reordering the fifth sample matrix according to the sequence of variance contribution rate from large to small to obtain a sixth sample matrix;
and if the cumulative variance contribution rate corresponding to the nth column of first data samples in the sixth sample matrix is greater than or equal to a set threshold, and the cumulative variance contribution rate corresponding to the N-1 th column of first data samples in the sixth sample matrix is smaller than the set threshold, generating a target sample set according to the first N columns of first data samples in the sixth sample matrix, where N is a positive integer greater than or equal to 1.
10. The method of claim 9, wherein obtaining a fourth matrix of samples comprises:
acquiring a covariance matrix of the third sample matrix;
obtaining the eigenvalue of each first data sample in the covariance matrix of the third sample matrix through singular value decomposition;
and determining a matrix obtained by multiplying the eigenvalue of each first data sample in the covariance matrix of the third sample matrix by each first data sample in the third sample matrix as a fourth sample matrix.
11. An enterprise anomaly probability determination method is characterized by comprising the following steps:
acquiring source data corresponding to an enterprise to be identified;
inputting the source data corresponding to the enterprise to be identified into a target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified; wherein the target model is trained according to the model training method of any one of claims 1-10.
12. A model training apparatus, comprising:
the system comprises a sample set acquisition module, a sample set selection module and a sample set selection module, wherein the sample set acquisition module is used for acquiring a target sample set, and the target sample set is obtained by screening historical source data of a sample enterprise based on sample evaluation indexes;
the enterprise abnormal prediction probability determining module is used for inputting a first data sample in the target sample set into a pre-established random forest model to obtain enterprise abnormal prediction probability;
and the training module is used for training the parameters of the random forest model according to an objective function formed on the basis of the enterprise abnormal prediction probability and the label information corresponding to the first data sample.
13. An apparatus for determining enterprise anomaly probability, comprising:
the source data acquisition module is used for acquiring source data corresponding to the enterprise to be identified;
the enterprise abnormal probability determining module is used for inputting the source data corresponding to the enterprise to be identified into a target model to obtain the enterprise abnormal probability corresponding to the enterprise to be identified; wherein the target model is trained according to the model training method of any one of claims 1-10.
14. An electronic device, characterized in that the electronic device comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the model training method of any one of claims 1-10 or to enable the at least one processor to perform the enterprise anomaly probability determining method of claim 11.
15. A computer-readable storage medium storing computer instructions for causing a processor to implement the model training method of any one of claims 1-10 or the enterprise anomaly probability determination method of claim 11 when executed.
16. A computer program product, characterized in that the computer program product comprises a computer program which, when being executed by a processor, implements the model training method according to any one of claims 1-10, or which, when being executed by a processor, implements the enterprise anomaly probability determination method according to claim 11.
CN202210518775.2A 2022-05-12 Model training method, probability determining device, model training equipment, model training medium and model training product Active CN114861800B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210518775.2A CN114861800B (en) 2022-05-12 Model training method, probability determining device, model training equipment, model training medium and model training product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210518775.2A CN114861800B (en) 2022-05-12 Model training method, probability determining device, model training equipment, model training medium and model training product

Publications (2)

Publication Number Publication Date
CN114861800A true CN114861800A (en) 2022-08-05
CN114861800B CN114861800B (en) 2024-07-26

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116596336A (en) * 2023-05-16 2023-08-15 合肥联宝信息技术有限公司 State evaluation method and device of electronic equipment, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241418A (en) * 2018-08-22 2019-01-18 中国平安人寿保险股份有限公司 Abnormal user recognition methods and device, equipment, medium based on random forest
CN114219562A (en) * 2021-12-13 2022-03-22 香港中文大学(深圳) Model training method, enterprise credit evaluation method and device, equipment and medium
CN114298176A (en) * 2021-12-16 2022-04-08 重庆大学 Method, device, medium and electronic equipment for detecting fraudulent user

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241418A (en) * 2018-08-22 2019-01-18 中国平安人寿保险股份有限公司 Abnormal user recognition methods and device, equipment, medium based on random forest
CN114219562A (en) * 2021-12-13 2022-03-22 香港中文大学(深圳) Model training method, enterprise credit evaluation method and device, equipment and medium
CN114298176A (en) * 2021-12-16 2022-04-08 重庆大学 Method, device, medium and electronic equipment for detecting fraudulent user

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116596336A (en) * 2023-05-16 2023-08-15 合肥联宝信息技术有限公司 State evaluation method and device of electronic equipment, electronic equipment and storage medium
CN116596336B (en) * 2023-05-16 2023-10-31 合肥联宝信息技术有限公司 State evaluation method and device of electronic equipment, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110363387B (en) Portrait analysis method and device based on big data, computer equipment and storage medium
US11599892B1 (en) Methods and systems to extract signals from large and imperfect datasets
CN110458616A (en) A kind of finance product recommended method based on GAMxNN model
CN116186630A (en) Abnormal leakage current data identification method and related device
CN113935535A (en) Principal component analysis method for medium-and-long-term prediction model
CN116739742A (en) Monitoring method, device, equipment and storage medium of credit wind control model
CN114881343A (en) Short-term load prediction method and device of power system based on feature selection
CN117370913A (en) Method, device and equipment for detecting abnormal data in photovoltaic system
CN115630708A (en) Model updating method and device, electronic equipment, storage medium and product
CN114861800B (en) Model training method, probability determining device, model training equipment, model training medium and model training product
CN114861800A (en) Model training method, probability determination method, device, equipment, medium and product
CN114936204A (en) Feature screening method and device, storage medium and electronic equipment
CN111026661A (en) Method and system for comprehensively testing usability of software
Hung et al. Trimmed Granger causality between two groups of time series
CN112416774B (en) Software reliability testing method with added weight
CN114066278B (en) Method, apparatus, medium, and program product for evaluating article recall
CN115083442B (en) Data processing method, device, electronic equipment and computer readable storage medium
CN118246938A (en) Factor-based machine learning estimation method, device, equipment and medium
CN117743748A (en) Population birth rate prediction method and prediction system based on hundred degree index
He et al. An Efficient Iterative Least Squares Algorithm for Large-dimensional Matrix Factor Model via Random Projection
Fallah et al. Accuracy of L-moments approximation for spectral risk measures in heavy tail distributions
CN117934137A (en) Bad asset recovery prediction method, device and equipment based on model fusion
CN117422544A (en) Method, device, equipment and storage medium for predicting credit card user default probability
CN117153421A (en) Data monitoring method and device based on neural network algorithm
CN115687598A (en) Resume and post matching method, device, medium, equipment and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant