CN112329943A - Combined index selection method and device, computer equipment and medium - Google Patents

Combined index selection method and device, computer equipment and medium Download PDF

Info

Publication number
CN112329943A
CN112329943A CN202011232705.8A CN202011232705A CN112329943A CN 112329943 A CN112329943 A CN 112329943A CN 202011232705 A CN202011232705 A CN 202011232705A CN 112329943 A CN112329943 A CN 112329943A
Authority
CN
China
Prior art keywords
data
index
training
module
data table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011232705.8A
Other languages
Chinese (zh)
Inventor
陈远波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202011232705.8A priority Critical patent/CN112329943A/en
Publication of CN112329943A publication Critical patent/CN112329943A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of computers, and discloses a method, a device, computer equipment and a medium for selecting a combined index, wherein the method comprises the following steps: the method comprises the steps of obtaining sample data of each period, calculating a stability index corresponding to each index in the sample data, using the index with the stability index smaller than a preset stability threshold as the stability index, carrying out module classification on the stability index according to a preset classification mode, simultaneously generating a training data table corresponding to each module according to a classification result, inputting data in each training data table into a prediction model respectively for model training, determining an AUC corresponding to each training data table according to the obtained training result, and screening the modules corresponding to the training data tables according to the sequence of the AUC from large to small.

Description

Combined index selection method and device, computer equipment and medium
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for screening a combined index, computer equipment and a medium.
Background
With the development of artificial intelligence technology, it is very common to build relevant models by machine learning according to features in existing data. In the research process of machine learning, the idea of establishing a machine learning model is based on a constructive feedback principle. And constructing a model, obtaining feedback from the indexes, and improving until the ideal precision is achieved. The evaluation index explains the performance of the model. An important aspect of evaluating metrics is their ability to differentiate model results. Different kinds of metrics are considered to evaluate our model, the choice of which depends entirely on the type of model and the implementation plan of the model.
For the influence of a single index on the model, the iv (information value), i.e. the information value or the information amount, is generally used to evaluate currently. The process of selecting the mold-entering variable is a relatively complex process, and many factors need to be considered, such as: the prediction capability of the variables, the correlation among the variables, the interpretability of the variables on business and the like. However, in some complex prediction models, many important indexes are involved, different index combinations have a large influence on model evaluation, and for such a situation, a mode of verifying each index combination is currently adopted, and the screening efficiency is low, so that a screening method for effectively aiming at combined indexes of model prediction is urgently needed.
Disclosure of Invention
The embodiment of the invention provides a combined index screening method, a combined index screening device, computer equipment and a storage medium, so as to improve the combined index screening efficiency.
In order to solve the above technical problem, an embodiment of the present application provides a method for selecting a combination index, including:
acquiring sample data of each period, calculating a stability index corresponding to each index in the sample data, and taking an index of which the stability index is smaller than a preset stability threshold value as a stability index;
module classification is carried out on the stable indexes according to a preset classification mode, and a training data table corresponding to each module is generated according to a classification result;
respectively inputting the data in each training data table into a prediction model for model training, and determining the AUC corresponding to each training data table according to the obtained training result;
and selecting the modules corresponding to the training data table according to the sequence of the AUC from large to small.
Optionally, the calculating a stability index corresponding to each index in the sample data, and taking an index of which the stability index is smaller than a preset stability threshold as a stability index includes:
acquiring first data and second data of sample data of any two continuous periods, and calculating a stability index PSI of the second data relative to the first data;
and acquiring the index of which the stability index PSI is smaller than a preset stability threshold value from the second data, taking the index as a stable index, and adding the stable index into a stable index set.
Optionally, the calculating the stability indicator PSI of the second data relative to the first data includes:
performing binning processing on the first data and the second data respectively to obtain binned first data and binned second data, taking each bin of the binned first data as a reference bin, and taking each bin of the binned second data as an increment bin;
calculating the proportion of sample data corresponding to the indexes in each reference box in the first data subjected to box separation to obtain a first proportion, and calculating the proportion of the sample data corresponding to the indexes in each incremental box in the second data subjected to box separation to serve as a second proportion;
and calculating the difference value between the second proportion and the first proportion corresponding to the second proportion aiming at each second proportion, and taking the absolute value of the difference value as the stability index PSI of the second data corresponding to the second proportion.
Optionally, the generating a training data table corresponding to each module according to the classification result includes:
associating a hive table with a data table corresponding to the sample data of each period, wherein the hive table is a data table contained in a local database;
and extracting the input-mode feature field data from the data table corresponding to the sample data of each period in a correlation query mode, extracting the label data from the hive table, and generating the training data table based on the input-mode feature field data and the label data.
Optionally, the separately inputting the data in each training data table into a prediction model for model training includes:
dynamically generating a configuration file based on the attribute data of the training data table;
starting a preset evaluation script;
and sequentially reading the data of the training data table corresponding to each module by adopting the preset evaluation script according to the configuration file, and inputting the read data into a prediction model for model training.
Optionally, the prediction model is a LightGBM decision tree model, and determining an AUC corresponding to each training data table according to the obtained training result includes:
obtaining a predicted value in a training result of the LightGBM decision tree model, and performing prediction scoring on a module corresponding to the training data table according to the predicted value to obtain a prediction score of the module corresponding to the training data table;
and determining the AUC of the corresponding module of the training data table according to the prediction score and the label data in the training data table.
In order to solve the above technical problem, an embodiment of the present application further provides a device for selecting a combination index, including:
the acquisition module is used for acquiring sample data of each period, calculating a stability index corresponding to each index in the sample data, and taking the index of which the stability index is smaller than a preset stability threshold value as a stability index;
the classification module is used for performing module classification on the stable indexes according to a preset classification mode and generating a training data table corresponding to each module according to a classification result;
the training module is used for inputting the data in each training data table into a prediction model for model training and determining the AUC corresponding to each training data table according to the obtained training result;
and the selection module is used for selecting the modules corresponding to the training data table according to the sequence of the AUC from large to small.
Optionally, the obtaining module includes:
the stability index calculation unit is used for acquiring first data and second data of sample data of any two continuous periods and calculating a stability index PSI of the second data relative to the first data;
and the stable index set determining unit is used for acquiring the index of which the stability index PSI is smaller than a preset stable threshold value from the second data, taking the index as a stable index, and adding the stable index into the stable index set.
Optionally, the stability index calculation unit includes:
the box dividing subunit is configured to perform box dividing processing on the first data and the second data respectively to obtain first data after box dividing and second data after box dividing, use each box of the first data after box dividing as a reference box, and use each box of the second data after box dividing as an incremental box;
the proportion calculation subunit is used for calculating the proportion of the sample data corresponding to the index in each reference box in the first boxed data to obtain a first proportion, and calculating the proportion of the sample data corresponding to the index in each incremental box in the second boxed data to serve as a second proportion;
and the stability index determining subunit is used for calculating a difference value between each second proportion and the first proportion corresponding to the second proportion, and taking an absolute value of the difference value as the stability index PSI of the second data corresponding to the second proportion.
Optionally, the classification module comprises:
the association unit is used for associating the hive table with a data table corresponding to the sample data of each period, wherein the hive table is a data table contained in the local database;
and the query unit is used for extracting the input-mode feature field data from the data table corresponding to the sample data of each period in a correlation query mode, extracting the label data from the hive table, and generating the training data table based on the input-mode feature field data and the label data.
Optionally, the training module comprises:
the configuration file generating unit is used for dynamically generating a configuration file based on the attribute data of the training data table;
the script starting unit is used for starting a preset evaluation script;
and the data acquisition and transmission unit is used for sequentially reading the data of the training data table corresponding to each module by adopting the preset evaluation script according to the configuration file, and inputting the read data into the prediction model for model training.
Optionally, the prediction model is a LightGBM decision tree model, and the training module further includes:
the prediction score determining unit is used for obtaining a prediction value in a training result of the LightGBM decision tree model, and performing prediction scoring on a module corresponding to the training data table according to the prediction value to obtain a prediction score of the module corresponding to the training data table;
and the AUC calculating unit is used for determining the AUC of the corresponding module of the training data table according to the predicted score and the label data in the training data table.
In order to solve the technical problem, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method for selecting a combination index described above when executing the computer program.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the selection method for a combination index are implemented.
The method, the device, the computer equipment and the storage medium for selecting the combined indexes, provided by the embodiment of the invention, are used for acquiring sample data of each period, calculating the stability index corresponding to each index in the sample data, taking the index with the stability index smaller than a preset stability threshold as the stability index, performing module classification on the stability index according to a preset classification mode, realizing the quality control of the combined indexes, avoiding the influence of a large number of indexes with lower quality on the combined index screening, being beneficial to improving the small face of the combined index selection, simultaneously generating the training data table corresponding to each module according to the classification result, respectively inputting the data in each training data table into a prediction model for model training, determining the AUC corresponding to each training data table according to the obtained training result, selecting and selecting the modules corresponding to the training data tables according to the sequence from the AUC from large to small, the influence of the combined index on the model is digitized, and the selection efficiency of the combined index is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow chart of one embodiment of a method for selecting a combination index of the present application;
fig. 3 is a schematic structural diagram of an embodiment of a selection apparatus for combining indicators according to the present application;
FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, as shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like.
The terminal devices 101, 102, 103 may be various electronic devices having display screens and supporting web browsing, including but not limited to smart phones, tablet computers, E-book readers, MP3 players (Moving Picture E interface shows a properties Group Audio Layer III, motion Picture experts compress standard Audio Layer 3), MP4 players (Moving Picture E interface shows a properties Group Audio Layer IV, motion Picture experts compress standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that the selection method for the combination index provided in the embodiment of the present application is executed by a server, and accordingly, the selection device for the combination index is disposed in the server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks and servers may be provided according to implementation needs, and the terminal devices 101, 102 and 103 in this embodiment may specifically correspond to an application system in actual production.
Referring to fig. 2, fig. 2 shows a method for selecting a combination index according to an embodiment of the present invention, which is described by taking the application of the method to the server in fig. 1 as an example, and is detailed as follows:
s201: and acquiring sample data of each period, calculating a stability index corresponding to each index in the sample data, and taking the index with the stability index smaller than a preset stability threshold value as a stability index.
Specifically, in consideration of the influence of the timing on the prediction model, the embodiment classifies the sample data according to a preset period, applies a period label to the sample data, and selects a stability index by evaluating the stability of each index in the sample data in each period.
The value range of the preset stability threshold is (0, 1), and the preset stability threshold can be specifically set according to actual needs, and is not limited here, and as a preferred mode, the preset stability threshold is set to be 0.25 in this embodiment.
A prediction model (prediction model) refers to a quantitative relationship between objects described by a machine learning model for prediction. The method reveals the internal regularity of objects to a certain extent, and takes the internal regularity as a direct basis for calculating a predicted value during prediction, which has great influence on the prediction accuracy, so that training data needs to be screened to improve the prediction effect of the model.
For example, in a specific embodiment, the stability index is determined by calculating the stability index of each index in the sample with each month as one cycle, that is, taking the data set of each month as one sample data.
The calculation method of the stability index may specifically refer to the description of the subsequent embodiments, and is not repeated here to avoid repetition.
S202: and carrying out module classification on the stable indexes according to a preset classification mode, and generating a training data table corresponding to each module according to a classification result.
Specifically, according to actual service requirements, a plurality of modules are preset, the stable indexes obtained in the step S201 are classified into the modules, each index falls into at least one module, sample data corresponding to the indexes is screened out from the sample data, association between the modules and the sample data corresponding to the indexes is established, and a training data table is generated.
It should be noted that, for the influence of a single index on the model, iv (information value), i.e., the information value or the information amount, is generally used. The process of selecting the mold-entering variable is a relatively complex process, and many factors need to be considered, such as: the prediction capability of the variables, the correlation among the variables, the interpretability of the variables on business and the like. However, the most important and direct measure of this is the predictive power of the variables. "predictive power of variables" is general, subjective, and non-quantitative, and needs some specific quantitative indicators to measure the predictive power of each independent variable (indicator), and determines which variables (indicators) enter the model according to the magnitude of the quantitative indicators. However, in some complex prediction models, there are many key indexes involved, in this embodiment, in order to better utilize the indexes to improve the prediction capability of the model, a module mode is adopted, each index is put into one or more modules, and then the module which contributes most to the training of the prediction model is selected by evaluating the influence of each module on the model prediction, that is, the best index combination is obtained, which is beneficial to improving the effect of the model training.
The preset classification mode may be set according to actual service requirements, for example, index classification according to services, classification according to index attributes, and the like, and is not limited herein.
S203: and respectively inputting the data in each training data table into a prediction model for model training, and determining the AUC corresponding to each training data table according to the obtained training result.
Specifically, after the training data table corresponding to the module is obtained, for each module, the data in the training data table corresponding to the module is input into the prediction model, the data in the training data table is trained through the prediction model to obtain a training result, and the AUC of the module is calculated according to the training result, so that the influence degree of the index combination contained in the module on the model training is determined.
Where AUC (area Under cutter) is defined as the area Under the ROC curve. The reason why the AUC value is often used as the evaluation standard of the model is that the ROC curve cannot clearly indicate which classifier has a better effect in many cases, and as a numerical value, the classifier with a larger AUC has a better effect, and the AUC is a performance index for measuring the quality of the learner. By definition, AUC can be obtained by summing the areas of the sections under the ROC curve.
Wherein, the ROC curve is called a receiver operating characteristic curve (receiver operating characteristic curve), and is a curve drawn according to a series of different two classification modes (boundary values or decision thresholds) by taking a true positive rate (sensitivity) as an ordinate and a false positive rate (1-specificity) as an abscissa.
The specific process of determining the AUC corresponding to each training data table according to the obtained training result may refer to the description of the subsequent embodiments, and is not repeated here to avoid repetition.
S204: and selecting the modules corresponding to the training data table according to the sequence of the AUC from large to small.
Specifically, the value range of the AUC is between 0.5 and 1, and the larger the AUC is, the larger the influence of the module on the prediction model is, that is, the more important the module is, so that according to the order of the AUC from large to small, modules corresponding to the preset number of training data tables are selected from front to back as the finally determined module.
The preset number may be set according to an actual requirement, for example, if three candidate index combinations are actually required to be obtained, the preset number is 3.
In the embodiment, sample data of each period is obtained, a stability index corresponding to each index in the sample data is calculated, the index with the stability index smaller than a preset stability threshold value is used as the stability index, the stability index is subjected to module classification according to a preset classification mode, the quality of a combined index is controlled, the influence of a large number of indexes with lower quality on the combined index screening is avoided, the small face of the combined index screening is favorably improved, meanwhile, a training data table corresponding to each module is generated according to the classification result, data in each training data table is respectively input into a prediction model for model training, the AUC corresponding to each training data table is determined according to the obtained training result, the modules corresponding to the training data tables are screened according to the sequence from big to small of the AUC, and the digitization of the influence of the combined index on the model is realized, the method is favorable for improving the selection efficiency of the combination index.
In some optional implementation manners of this embodiment, in step S201, a stability index corresponding to each index in the sample data is calculated, and taking an index of which the stability index is smaller than a preset stability threshold as a stability index includes:
acquiring first data and second data of sample data of any two continuous periods, and calculating a stability index PSI of the second data relative to the first data;
and acquiring the index of which the stability index PSI is smaller than a preset stability threshold value from the second data as a stable index, and adding the stable index into the stable index set.
Specifically, any two continuous periods of sample data are obtained and are respectively recorded as first data and second data, wherein the first data are the last period of sample data, the second data are the next period of sample data, the change fluctuation of the second data relative to the first data is determined by calculating the stability index PSI, the stability index PSI of the second data relative to the first data is obtained, then each index is screened, and the index of which the stability index PSI is smaller than a preset stability threshold value is used as a stable index and is added into a stable index set.
In this embodiment, the second data is used as the predicted value, the first data is used as the actual value, and the fluctuation condition of each index is determined.
In the embodiment, the stability index is screened out by calculating the stability, the index quality for combination is improved, a redundant data processing process caused by a low-quality index is avoided, the data quantity of a low-quality index combination is reduced, and the efficiency of screening the combination index is improved.
In some optional implementations of this embodiment, calculating the stability indicator PSI of the second data relative to the first data includes:
performing binning processing on the first data and the second data respectively to obtain binned first data and binned second data, taking each bin of the binned first data as a reference bin, and taking each bin of the binned second data as an increment bin;
calculating the proportion of sample data corresponding to the indexes in each reference box in the first data subjected to box separation to obtain a first proportion, and calculating the proportion of the sample data corresponding to the indexes in each incremental box in the second data subjected to box separation to serve as a second proportion;
and calculating the difference value between the second proportion and the first proportion corresponding to the second proportion aiming at each second proportion, and taking the absolute value of the difference value as the stability index PSI of the second data corresponding to the second proportion.
Specifically, considering that there are many data participating in model training, in order to quickly perform calculation of the stability index, in this embodiment, the first data and the second data are binned, each bin of the binned first data is used as a reference bin, each bin of the binned second data is used as an incremental bin, and then a ratio of each reference bin to the first data is calculated, and as a first ratio, a ratio of each incremental bin to the second data is calculated, and as a second ratio, and then a difference between the ratios of the corresponding reference bin and the incremental bin is compared, and thus the fluctuation value of the binning can be used, and then the stability is evaluated according to the fluctuation value.
In this embodiment, through the mode of branch case, realize calculating the stability index PSI of second data relative first data fast, improve the efficiency that stability index PSI confirms.
In some optional implementation manners of this embodiment, in step S202, according to the classification result, generating a training data table corresponding to each module includes:
associating the hive table with a data table corresponding to the sample data of each period, wherein the hive table is a data table contained in the local database;
and extracting the data of the input mode characteristic field from the data table corresponding to the sample data of each period in a correlation query mode, extracting the label data from the hive table, and generating a training data table based on the data of the input mode characteristic field and the label data.
Specifically, the hive table is a data table in the local database, the data table corresponding to the hive table and the sample data of each period is subjected to correlation query through a primary key, the data table corresponding to the sample data of each period is obtained, the template-entering feature field data are extracted from the data table corresponding to the sample data of each period, the label data are extracted from the hive table, and the training data table is generated based on the template-entering feature field data and the label data.
The label data refers to a classification label field corresponding to the classification of the prediction model. The data of the in-mold characteristic field refers to a field corresponding to an index in the stability index set.
In the embodiment, data acquisition is performed rapidly in a correlation query mode and is stored in the training data table, data extraction is performed rapidly during subsequent model training, the training speed is improved, the calculation time of the AUC corresponding to each module is shortened, and therefore the selection efficiency of the combined features (modules) is achieved.
In some optional implementation manners of this embodiment, in step S203, inputting the data in each training data table into the prediction model for model training includes:
dynamically generating a configuration file based on the attribute data of the training data table;
starting a preset evaluation script;
and sequentially reading data of the training data table corresponding to each module by adopting a preset evaluation script according to the configuration file, and inputting the read data into the prediction model for model training.
Specifically, a configuration file is dynamically generated according to attribute data of the training data table, a preset evaluation script is started, configuration contents in the configuration file are read through the preset evaluation script, data reading of the training data table corresponding to each module is sequentially carried out according to the configuration contents, and the read data are input into the prediction model for model training.
Wherein the configuration file comprises: training data table path and file name, preset period, index data path, preset model storage path, label data of training model and the like.
Wherein, the attribute data of the training data table includes but is not limited to: training fields, field names, file names of the data tables, paths and label data of the data tables, and the like.
The evaluation script refers to a script file for starting a training task.
In the embodiment, the rapid data reading and writing is realized by dynamically generating the configuration file, and the efficiency of inputting the input model is improved.
In some optional implementation manners of this embodiment, in step S203, the prediction model is a LightGBM decision tree model, and determining, according to the obtained training result, an AUC corresponding to each training data table includes:
obtaining a predicted value in a training result of the LightGBM decision tree model, and performing prediction scoring on a module corresponding to a training data table according to the predicted value to obtain a prediction score of the module corresponding to the training data table;
and determining the AUC of the corresponding module of the training data table according to the prediction score and the label data in the training data table.
Specifically, the training result of the LightGBM decision tree model is a predicted value, the prediction value is used to predict and score the module corresponding to the training data table, so as to obtain the prediction score of the module corresponding to the training data table, and the prediction score is combined with the label data in the training data table to calculate the AUC of the module corresponding to the training data table.
In this embodiment, the LightGBM model is selected as the prediction model, and the LightGBM model mainly has the following characteristics in consideration: the training efficiency of the prediction model can be improved by using the LightGBM model, so that the efficiency of AUC determination corresponding to the module can be improved.
Specifically, the prediction value may be directly used as the prediction score, or the prediction value is normalized to a preset interval by a normalization method, or a direct proportional product is performed by a set coefficient, and the like, which is not limited herein.
In the embodiment, the AUC corresponding to each module is quickly calculated through the predicted value in the training result, and the influence of each module on the model is evaluated by adopting the AUC, so that the modules can be quickly screened.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Fig. 3 is a schematic block diagram of a selection apparatus for a combined index, which corresponds to the selection method for a combined index according to the foregoing embodiment. As shown in fig. 3, the selection device for the combined index includes an acquisition module 31, a classification module 32, a training module 33, and a selection module 34. The functional modules are explained in detail as follows:
the obtaining module 31 is configured to obtain sample data of each period, calculate a stability index corresponding to each index in the sample data, and use an index of which the stability index is smaller than a preset stability threshold as a stability index;
the classification module 32 is used for performing module classification on the stability index according to a preset classification mode and generating a training data table corresponding to each module according to a classification result;
the training module 33 is configured to input the data in each training data table into the prediction model for model training, and determine an AUC corresponding to each training data table according to an obtained training result;
and the selection module 34 is used for selecting the modules corresponding to the training data table according to the sequence of AUC from large to small.
Optionally, the obtaining module 31 includes:
the stability index calculation unit is used for acquiring first data and second data of sample data of any two continuous periods and calculating a stability index PSI of the second data relative to the first data;
and the stable index set determining unit is used for acquiring the index of which the stability index PSI is smaller than the preset stable threshold value from the second data, taking the index as a stable index, and adding the stable index into the stable index set.
Optionally, the stability index calculation unit includes:
the box dividing subunit is used for respectively carrying out box dividing processing on the first data and the second data to obtain first data subjected to box dividing and second data subjected to box dividing, each box of the first data subjected to box dividing is used as a reference box, and each box of the second data subjected to box dividing is used as an increment box;
the proportion calculation subunit is used for calculating the proportion of the sample data corresponding to the indexes in each reference box in the first boxed data to obtain a first proportion, and calculating the proportion of the sample data corresponding to the indexes in each incremental box in the second boxed data to serve as a second proportion;
and the stability index determining subunit is used for calculating a difference value between each second proportion and the first proportion corresponding to the second proportion, and taking the absolute value of the difference value as the stability index PSI of the second data corresponding to the second proportion.
Optionally, the classification module 32 comprises:
the association unit is used for associating the hive table with a data table corresponding to the sample data of each period, wherein the hive table is a data table contained in the local database;
and the query unit is used for extracting the data of the input mode characteristic field from the data table corresponding to the sample data of each period in a correlation query mode, extracting the label data from the hive table, and generating a training data table based on the data of the input mode characteristic field and the label data.
Optionally, the training module 33 comprises:
the configuration file generating unit is used for dynamically generating a configuration file based on the attribute data of the training data table;
the script starting unit is used for starting a preset evaluation script;
and the data acquisition and transmission unit is used for sequentially reading the data of the training data table corresponding to each module by adopting a preset evaluation script according to the configuration file and inputting the read data into the prediction model for model training.
Optionally, the prediction model is a LightGBM decision tree model, and the training module 33 further includes:
the prediction score determining unit is used for obtaining a prediction value in a training result of the LightGBM decision tree model, and performing prediction scoring on a module corresponding to the training data table according to the prediction value to obtain a prediction score of the module corresponding to the training data table;
and the AUC calculating unit is used for determining the AUC of the module corresponding to the training data table according to the prediction score and the label data in the training data table.
For specific limitations of the selection device for the combination index, reference may be made to the above limitations of the selection method for the combination index, and details thereof are not described here. All or part of each module in the selection device of the combination index can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only the computer device 4 having the components connection memory 41, processor 42, network interface 43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes for controlling electronic files. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, such as program code for executing control of an electronic file.
The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.
The present application provides another embodiment, which is to provide a computer-readable storage medium, where an interface display program is stored, and the interface display program is executable by at least one processor to cause the at least one processor to execute the steps of the method for selecting a combination index as described above.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A selection method of a combined index is characterized by comprising the following steps:
acquiring sample data of each period, calculating a stability index corresponding to each index in the sample data, and taking an index of which the stability index is smaller than a preset stability threshold value as a stability index;
module classification is carried out on the stable indexes according to a preset classification mode, and a training data table corresponding to each module is generated according to a classification result;
respectively inputting the data in each training data table into a prediction model for model training, and determining the AUC corresponding to each training data table according to the obtained training result;
and selecting the modules corresponding to the training data table according to the sequence of the AUC from large to small.
2. The method for selecting the combined index according to claim 1, wherein the calculating a stability index corresponding to each index in the sample data, and using an index whose stability index is smaller than a preset stability threshold as a stability index includes:
acquiring first data and second data of sample data of any two continuous periods, and calculating a stability index PSI of the second data relative to the first data;
and acquiring the index of which the stability index PSI is smaller than a preset stability threshold value from the second data, taking the index as a stable index, and adding the stable index into a stable index set.
3. The method for selecting the combined index according to claim 2, wherein the calculating the stability index PSI of the second data relative to the first data includes:
performing binning processing on the first data and the second data respectively to obtain binned first data and binned second data, taking each bin of the binned first data as a reference bin, and taking each bin of the binned second data as an increment bin;
calculating the proportion of sample data corresponding to the indexes in each reference box in the first data subjected to box separation to obtain a first proportion, and calculating the proportion of the sample data corresponding to the indexes in each incremental box in the second data subjected to box separation to serve as a second proportion;
and calculating the difference value between the second proportion and the first proportion corresponding to the second proportion aiming at each second proportion, and taking the absolute value of the difference value as the stability index PSI of the second data corresponding to the second proportion.
4. The method for selecting the combination index according to claim 1, wherein the generating a training data table corresponding to each module according to the classification result includes:
associating a hive table with a data table corresponding to the sample data of each period, wherein the hive table is a data table contained in a local database;
and extracting the input-mode feature field data from the data table corresponding to the sample data of each period in a correlation query mode, extracting the label data from the hive table, and generating the training data table based on the input-mode feature field data and the label data.
5. The method for selecting the combined index according to claim 1, wherein the separately inputting the data in each of the training data tables into a prediction model for model training comprises:
dynamically generating a configuration file based on the attribute data of the training data table;
starting a preset evaluation script;
and sequentially reading the data of the training data table corresponding to each module by adopting the preset evaluation script according to the configuration file, and inputting the read data into a prediction model for model training.
6. The method for selecting the combined index according to claim 4, wherein the prediction model is a LightGBM decision tree model, and the determining the AUC corresponding to each training data table according to the obtained training result includes:
obtaining a predicted value in a training result of the LightGBM decision tree model, and performing prediction scoring on a module corresponding to the training data table according to the predicted value to obtain a prediction score of the module corresponding to the training data table;
and determining the AUC of the corresponding module of the training data table according to the prediction score and the label data in the training data table.
7. A screening apparatus for a combination of indicators, comprising:
the acquisition module is used for acquiring sample data of each period, calculating a stability index corresponding to each index in the sample data, and taking the index of which the stability index is smaller than a preset stability threshold value as a stability index;
the classification module is used for performing module classification on the stable indexes according to a preset classification mode and generating a training data table corresponding to each module according to a classification result;
the training module is used for inputting the data in each training data table into a prediction model for model training and determining the AUC corresponding to each training data table according to the obtained training result;
and the selection module is used for selecting the modules corresponding to the training data table according to the sequence of the AUC from large to small.
8. The selection apparatus for selecting a combined index according to claim 7, wherein the obtaining module includes:
the stability index calculation unit is used for acquiring first data and second data of sample data of any two continuous periods and calculating a stability index PSI of the second data relative to the first data;
and the stable index set determining unit is used for acquiring the index of which the stability index PSI is smaller than a preset stable threshold value from the second data, taking the index as a stable index, and adding the stable index into the stable index set.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method for selecting a combination index according to any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements a method for selecting a combination index according to any one of claims 1 to 6.
CN202011232705.8A 2020-11-06 2020-11-06 Combined index selection method and device, computer equipment and medium Pending CN112329943A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011232705.8A CN112329943A (en) 2020-11-06 2020-11-06 Combined index selection method and device, computer equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011232705.8A CN112329943A (en) 2020-11-06 2020-11-06 Combined index selection method and device, computer equipment and medium

Publications (1)

Publication Number Publication Date
CN112329943A true CN112329943A (en) 2021-02-05

Family

ID=74316892

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011232705.8A Pending CN112329943A (en) 2020-11-06 2020-11-06 Combined index selection method and device, computer equipment and medium

Country Status (1)

Country Link
CN (1) CN112329943A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883073A (en) * 2021-03-22 2021-06-01 北京同邦卓益科技有限公司 Data screening method, device, equipment, readable storage medium and product

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883073A (en) * 2021-03-22 2021-06-01 北京同邦卓益科技有限公司 Data screening method, device, equipment, readable storage medium and product
CN112883073B (en) * 2021-03-22 2024-04-05 北京同邦卓益科技有限公司 Data screening method, device, equipment, readable storage medium and product

Similar Documents

Publication Publication Date Title
CN106845731A (en) A kind of potential renewal user based on multi-model fusion has found method
CN112328909B (en) Information recommendation method and device, computer equipment and medium
CN110826071A (en) Software vulnerability risk prediction method, device, equipment and storage medium
CN112035549B (en) Data mining method, device, computer equipment and storage medium
CN109388675A (en) Data analysing method, device, computer equipment and storage medium
CN101556553A (en) Defect prediction method and system based on requirement change
CN113051317B (en) Data mining model updating method, system, computer equipment and readable medium
CN112181835B (en) Automatic test method, device, computer equipment and storage medium
CN109194689A (en) Abnormal behaviour recognition methods, device, server and storage medium
CN113537510A (en) Machine learning model data processing method and device based on unbalanced data set
CN112085087A (en) Method and device for generating business rules, computer equipment and storage medium
CN112785005A (en) Multi-target task assistant decision-making method and device, computer equipment and medium
CN112363814A (en) Task scheduling method and device, computer equipment and storage medium
CN114817478A (en) Text-based question and answer method and device, computer equipment and storage medium
CN114862140A (en) Behavior analysis-based potential evaluation method, device, equipment and storage medium
CN110516062A (en) A kind of search processing method and device of document
CN112329943A (en) Combined index selection method and device, computer equipment and medium
CN109934631A (en) Question and answer information processing method, device and computer equipment
CN112966756A (en) Visual access rule generation method and device, machine readable medium and equipment
CN110348669B (en) Intelligent rule generation method, intelligent rule generation device, computer equipment and storage medium
CN109951859B (en) Wireless network connection recommendation method and device, electronic equipment and readable medium
CN110544166A (en) Sample generation method, device and storage medium
CN114265777B (en) Application program testing method and device, electronic equipment and storage medium
CN108229572A (en) A kind of parameter optimization method and computing device
CN112801327B (en) Method, device, equipment and storage medium for predicting flow and modeling flow

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination