CN116542511A - Wind control model creation method and device, electronic equipment and storage medium - Google Patents

Wind control model creation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116542511A
CN116542511A CN202210117865.0A CN202210117865A CN116542511A CN 116542511 A CN116542511 A CN 116542511A CN 202210117865 A CN202210117865 A CN 202210117865A CN 116542511 A CN116542511 A CN 116542511A
Authority
CN
China
Prior art keywords
wind control
data
model
control model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210117865.0A
Other languages
Chinese (zh)
Inventor
冯宏轩
鲁溪
陈�光
赵子渌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bairong Yunchuang Technology Co ltd
Original Assignee
Bairong Yunchuang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bairong Yunchuang Technology Co ltd filed Critical Bairong Yunchuang Technology Co ltd
Priority to CN202210117865.0A priority Critical patent/CN116542511A/en
Publication of CN116542511A publication Critical patent/CN116542511A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • General Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • General Engineering & Computer Science (AREA)
  • Stored Programmes (AREA)

Abstract

The application discloses a wind control model creation method, which comprises the following steps: acquiring original data associated with a business to be subjected to wind control management, wherein the original data comprises a plurality of characteristic values corresponding to a plurality of characteristics and respective wind control labels; performing data processing on the original data to generate sample data; presetting one or more wind control model algorithm types and/or model super-parameter search setting values, and training by utilizing the sample data based on a preset first model evaluation index to generate a wind control model for wind control management of the service, wherein the wind control model is an optimal algorithm type determined based on the first model evaluation index and/or has an optimal super-parameter value determined based on the first model evaluation index.

Description

Wind control model creation method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer-based financial technology (Fintech), and in particular, to a method and apparatus for creating a wind-controlled model, and related electronic devices and storage media.
Background
With the development and maturity of big data and machine learning technology and the improvement of computer computing power and algorithm, intelligent wind control has gradually replaced traditional wind control, machine learning with higher accuracy and feature recognition capability has gradually replaced traditional data analysis method, and becomes the wind control management and data mining mode of internet financial institution mainstream. However, the quality of the machine learning model depends largely on the quality of the data, the choice of features and the parameters of the model itself, so that applying machine learning in data mining requires both understanding of the data itself by the user and deep knowledge of the model, which in intangible terms increases the threshold and cost of applying machine learning for data analysis. In addition, if customized machine learning models are to be built for different businesses, different scenes, and different customer groups, the difficulty of model modeling is further increased.
The description of the background art is only for the purpose of facilitating an understanding of the relevant art and is not to be taken as an admission of prior art.
Disclosure of Invention
Therefore, the embodiment of the invention aims to provide a wind control model creation method and device, electronic equipment and storage medium, which can automatically model and optimize a model, remarkably improve modeling efficiency and reduce modeling complexity.
In a first aspect, a method for creating a wind control model is provided, including: acquiring original data associated with a business to be subjected to wind control management, wherein the original data comprises a plurality of characteristic values corresponding to a plurality of characteristics and respective wind control labels; performing data processing on the original data to generate sample data; presetting one or more wind control model algorithm types and/or model super-parameter search setting values, and training by utilizing the sample data based on a preset first model evaluation index to generate a wind control model for wind control management of the service, wherein the wind control model is an optimal algorithm type determined based on the first model evaluation index and/or has an optimal super-parameter value determined based on the first model evaluation index.
In a second aspect, there is provided a wind control model creation apparatus including: an acquisition module configured to acquire raw data associated with a service to be managed by wind control, the raw data including a plurality of feature values corresponding to a plurality of features and respective wind control tags; the generation module is configured to perform data processing on the original data to generate sample data; and a modeling module configured to preset one or more wind control model algorithm types and/or model hyper-parameter search setting values, based on a predetermined first model evaluation index, train with the sample data to generate a wind control model for wind control management of the service, wherein the wind control model is an optimal algorithm type determined based on the first model evaluation index and/or has an optimal hyper-parameter value determined based on the first model evaluation index.
In a third aspect, there is provided an electronic device comprising: a processor and a memory storing a computer program, the processor being configured to perform the processing method of any of the embodiments when the computer program is run.
In a fourth aspect, a storage medium is provided, the storage medium storing a computer program configured to perform the processing method of any of the embodiments when executed.
The embodiment of the invention provides an improved processing scheme, which is used for acquiring original data associated with a business to be subjected to wind control management, wherein the original data comprises a plurality of characteristic values corresponding to a plurality of characteristics and respective wind control labels; performing data processing on the original data to generate sample data; presetting one or more wind control model algorithm types and/or model super-parameter search setting values, and training by utilizing the sample data based on a preset first model evaluation index to generate a wind control model for wind control management of the service, wherein the wind control model is an optimal algorithm type determined based on the first model evaluation index and/or has an optimal super-parameter value determined based on the first model evaluation index. Therefore, compared with the traditional application of machine learning in data mining, which requires both understanding of the data by a user and deep understanding of the model, the scheme of the embodiment of the invention can automatically generate and optimize the model, thereby reducing modeling difficulty, improving modeling efficiency, improving model interpretability and endowing non-expert user modeling capability.
Optional features and other effects of embodiments of the invention are described in part below, and in part will be apparent from reading the disclosure herein.
Drawings
Embodiments of the present invention will be described in detail with reference to the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements, and wherein:
FIG. 1 illustrates an exemplary schematic diagram of a wind control model creation environment in accordance with an embodiment of the present invention;
FIG. 2 illustrates an exemplary flow chart of a method of creating a wind control model according to an embodiment of the invention;
FIG. 3 shows an exemplary schematic diagram of a data processing process according to an embodiment of the invention;
FIG. 4 shows an exemplary schematic diagram of a data experiment process according to an embodiment of the invention;
FIG. 5 illustrates an exemplary schematic diagram of an implementation of a machine learning tool in accordance with an embodiment of the present invention;
FIG. 6 illustrates an exemplary schematic diagram of a wind control model creation framework in accordance with an embodiment of the present invention;
FIG. 7 illustrates an exemplary schematic diagram of a wind control model creation system in accordance with an embodiment of the present invention;
FIG. 8 shows an exemplary schematic of a conventional scoring card modeling process;
FIG. 9 illustrates an exemplary flow chart of a scoring card modeling process according to an embodiment of the invention;
Fig. 10 is a schematic structural view showing a wind control model creation apparatus according to an embodiment of the present invention; and
fig. 11 shows an exemplary structural diagram of an electronic device capable of implementing a method according to an embodiment of the invention.
Detailed Description
The present invention will be described in further detail with reference to the following detailed description and the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. The exemplary embodiments of the present invention and the descriptions thereof are used herein to explain the present invention, but are not intended to limit the invention.
The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.
The embodiment of the invention provides a method and a device for creating a wind control model, and related electronic equipment and a storage medium. The method of creating a wind control model may be implemented by means of one or more computers, such as terminals. In some embodiments, the wind control model creation means may be implemented in software, hardware or a combination of software and hardware.
As described above, since the quality of the machine learning model depends largely on the quality of data, the selection of features, and parameters of the model itself, applying machine learning in data mining requires both understanding of the data itself by a user and deep knowledge of the model. This will lead to a number of problems, such as high modeling technological thresholds, low modeling efficiency, low model interpretability, difficulty in modeling by non-expert users, etc.
In particular, in terms of high modeling technology thresholds, modeling personnel are required to have both mathematical statistics, machine learning, computer programming capability, modeling practice experience, and the like. The pneumatic modeling work is usually only that a third party consultation company of a financial science and technology class and a large bank can input enough resources for development, and most small and medium banks do not have special data analysis and modeling teams.
In the aspect of low modeling efficiency, the development and processing flow of the wind control model is usually abnormal and complex, the development period is longer, the manpower input is more, and the engineering quantity is huge. The manual modeling needs to take a lot of time to perform various links such as data preprocessing, model selection, variable selection, parameter adjustment, model evaluation and the like, but in service application, rapid development, iteration and optimization are generally expected, and the response service requirements are supported with high efficiency.
In terms of low model interpretability, the financial institution's wind control models are often required to have some interpretability, so most wind control models are scoring card models based on logistic regression. The modeling process of the scoring card model is more complex than that of a general machine learning model, and the modeling difficulty and the process complexity are further improved. Machine learning models with superior performance other than the scoring card model are generally "black box" models, whose internal mechanisms and principles of operation are difficult to understand.
In the aspect that non-expert users are difficult to model, each business department, technical department, management department and the like need to conduct data analysis and data modeling on different businesses, different scenes, different client groups and the like, a customized characteristic model is rapidly developed, and personalized internet products are designed. However, many departments of personnel often do not have the specialized capabilities and experience of modeling.
To at least partially address one or more of the above-mentioned problems, as well as other potential problems, example embodiments of the present disclosure propose a solution for wind control model creation. In the scheme, sample data associated with a service to be subjected to wind control management is acquired; generating a model for performing wind control management on the service based on the sample data; and deploying the model so that the model can perform distributed computation to perform wind control management on the service.
Therefore, compared with the traditional application of machine learning in data mining, which requires both understanding of the data by a user and deep understanding of the model, the scheme of the embodiment of the invention can automatically generate and optimize the model, thereby reducing modeling difficulty, improving modeling efficiency, improving model interpretability and endowing non-expert user modeling capability.
In particular, in the aspect of reducing modeling difficulty, the automatic modeling platform can effectively help small and medium banks to develop wind control models of various products and various processes, quicken the construction of a big data wind control system and promote independent and independent digital wind control management capability.
In terms of improving modeling efficiency, the model development process which originally needs weeks or even months can be greatly shortened by using a better automatic modeling platform, the pressure and time of modeling staff can be greatly reduced by later model monitoring and tuning, more time can be saved by the modeling staff to study wind control strategies and updating of model algorithms, and more effort is put into research work.
In terms of improving model interpretability, automated modeling platforms can typically generate logistic regression scoring card models with interpretability quickly. The part of automatic machine learning platform can perform a certain exploration and analysis on the interpretability of the machine learning model, and further can extract the model result to generate a wind control rule combination with the interpretability, so that the selectivity of a model algorithm is expanded.
After the automatic modeling platform is introduced in the aspect of giving the modeling capability to non-expert users, each business department, technical department, management department and the like can perform data analysis and data modeling on different businesses, different scenes and different client groups based on the platform, rapidly develop and customize a characteristic model and design personalized internet products. To a certain extent, the data analysis and data modeling capability of each department can be improved, and the digital transformation and development are accelerated.
FIG. 1 illustrates a schematic diagram of a wind control model creation environment 100, according to an embodiment of the invention. It should be understood that the structure and function of the wind control model creation environment 100 as shown in FIG. 1 is for illustrative purposes only and does not imply any limitation on the scope of the present disclosure. Embodiments of the present disclosure may be embodied in different structures and/or functions.
As shown in FIG. 1, the wind control model creation environment 100 may include a computing device 110. Computing device 110 may be any suitable electronic device such as, but not limited to, a mobile phone, tablet, notebook, desktop, server, mainframe, wearable device, etc. Sample data 120 associated with a business to be managed by the wind may be input to the computing device 110.
The businesses requiring the management of the wind control may include various types of businesses such as car finance, consumer finance, cash staging, credit card and credit card compensation, etc. In addition, the business may be in multiple business phases such as marketing, anti-fraud, anti-money laundering, pre-credit applications, mid-credit behavior, post-credit collection, etc.
The sample data 120 may include features and feature values. For example, assuming that the business to be managed by the wind is a credit card business, in this case, the sample data 120 may include the features "name", "telephone number", "age", "gender", "income", "occupation development expectation", and the like, and the feature values "Zhang Sang", "88888888", "33", "Man", "10 ten thousand", "engineer", "stable", and the like of these features. It should be understood that the features and feature values of the sample data 120 described above are merely examples. In fact, the sample data 120 may include any suitable features and a large number of feature values for those features, depending on the needs of the service.
The computing device 110 may generate a model (hereinafter, alternatively referred to as a "wind control model") 130 for wind control management of the traffic based on the sample data 120. In some embodiments, the generation of the model 130 may use an automated machine learning tool to automate tuning to intelligently generate the optimal model. For example, the automated machine learning tools may include any suitable automated machine learning tools such as NNI (Nerual Network Intelligence), *** Cloud AutoML, easyDL, and the like. Taking NNI as an example, NNI is a tool pack that can effectively help users design and tune neural network architecture of machine learning models, parameters of complex systems (such as super parameters (alternatively referred to as "super parameters")), and so forth.
Traditionally, in machine learning modeling, the most time-consuming and labor-consuming process is to try various combinations of super-parameters to find the best effect of the model, in addition to preparing the data. Even for experienced algorithm engineers and data scientists, it is sometimes difficult to grasp the rules therein, and only a good combination of superparameters can be found through multiple attempts. And it takes more time and effort for the beginner. However, by means of an automatic machine learning tool, the optimal model can be automatically generated through intelligent parameter adjustment, so that the modeling efficiency is remarkably improved, and human resources are saved.
The computing device 110 may deploy the model 130 such that the model 130 is capable of distributed computing for the management of the business by wind control. For example, a large amount of user data may be entered 130 and the model 130 may predict risk ratings for these users to provide for the management of the wind for related services, such as credit card services.
In some embodiments, the model 130 may be deployed using container technology. The container technology can deploy experimental tasks, realize parallel execution of a plurality of experimental tasks, achieve the purpose of distributed processing of tasks, and have the advantages of high availability, reliability and expandability. For example, the container techniques may include any suitable container techniques such as DOCKER, coreos. By utilizing container technology, the defects that the calculation cannot be distributed under different servers traditionally and the performance of an individual server is depended on can be overcome.
In this way, the scheme of the embodiment of the invention can automatically generate and optimize the model, thereby reducing modeling difficulty, improving modeling efficiency, improving model interpretability and endowing non-expert user modeling capability.
Fig. 2 shows an exemplary flow chart of a method of creating a wind control model according to an embodiment of the invention. The actions involved in method 200 are described below in connection with wind controlled model creation environment 100 as described in FIG. 1. Moreover, method 200 may also optionally include additional acts not shown and/or may omit acts shown, the scope of the present disclosure being not limited in this respect.
As shown in fig. 2, a method 200 of creating a wind control model according to an embodiment of the present invention may include steps 210 to 230. In an embodiment of the invention, the wind control model creation method 200 may be implemented by the computing device 110.
Step 210: raw data associated with a business to be managed by wind control is acquired. The raw data includes a plurality of feature values corresponding to a plurality of features and respective wind control tags.
In some embodiments, computing device 110 may obtain raw data associated with a business to be managed by wind. The raw data may be acquired in any suitable manner. For example, computing device 110 may obtain raw data from a file uploaded by a user. In addition, computing device 110 may also obtain raw data from a specified database. The specified database may be, for example, a user-selected database.
The raw data may include a plurality of feature values corresponding to a plurality of features and respective wind control tags. For example, the features may be "name", "telephone number", "age", "gender", "income", "occupation", and the like, and have feature values of "Zhang san", "888888888", "33", "male", "10 wan", "engineer", and the like. The wind control tag may be a "risk rating" or the like, and have a value of "good" or "bad" or the like. For example, a quality sample may be defined based on whether the customer has at least one label that shows overdue for more than 15 days in the 3-period should be taken. The wind control tab is a feature that the user designates as an independent variable, also referred to as a Y-tab, while other features will be independent variables. In some embodiments, a wind control tag value may be defined for each characteristic value of the raw data.
In some embodiments, computing device 110 may also automatically identify the data type of the feature of the original data, e.g., continuous, discrete, etc. Thus, computing device 110 may generate an analysis report of the raw data. In addition, computing device 110 may also provide the analysis report to the user for previewing. Table 1 shows an example of an analysis report.
Table 1: analysis report
The process of acquiring raw data and generating an analysis report described above may also be referred to as a "data management" process.
Step 220: and carrying out data processing on the original data to generate sample data.
In some embodiments, computing device 110 may further process the raw data to generate sample data. This process is also referred to as a "data processing" process. FIG. 3 illustrates an exemplary schematic diagram of a plurality of sub-processes/flows of a data processing process 300 according to an embodiment of the present invention. Those skilled in the art will appreciate that the sub-processes/flows may be performed in parallel/in combination and/or sequentially, if so, in any feasible order other than the order shown in fig. 3, unless clearly contradicted by the teachings of the present embodiments.
Step 310: data is selected.
In some embodiments, computing device 110 may select the data. In particular, the computing device 110 may select a first data set from the raw data based on a set of specified features. The user may set a threshold to exclude data associated with invalid features. In addition, the user may select a specified feature to select data associated with the specified feature. For example, the user may select a "name" feature whereby data associated with the "name" feature is to be selected into the first dataset.
Step 320: and (5) sampling data.
In some embodiments, computing device 110 may sample data. In particular, the computing device 110 may select the second data set from the first data set based on the specified sampling rule. For example, the user may select the Y-tag and sampling method and set a random seed to set the sampling rules. The sampling rules may employ any suitable sampling scheme, such as random sampling, hierarchical sampling, oversampling, and the like.
Step 330: and (5) feature derivation.
In some embodiments, computing device 110 may perform feature derivation. In particular, the computing device 110 may also generate derived features from the features of the second data set based on specified feature derivation rules to obtain a third data set comprising feature values of the derived features.
For example, the derivative feature "professional development expectations" may be derived from the features "profession" and "age", e.g., compensation capabilities/risks may be indicated to some extent.
In some embodiments, a plurality of features of the second data set may be selected for feature derivation. For example, the features "occupation" and "age" may be selected. In addition, one or more derivative logics may be provided. For example, the user may select a feature used for derivatization and input derivatized feature information.
In addition, derivative codes may also be provided. The derivative code may be written in any suitable language, such as R, python, java and its function library, etc. For example, computing device 110 also has a function table that can be used to write derivative code that describes various functions of the derivative method, such as int (x) for converting a value to an integer, max (…) for maximizing, etc.
In addition, the one or more derived logics may be validated based on preset derived criteria to screen out derived logics that meet the derived criteria. For example, computing device 110 may also verify the derivative code to ensure its correctness. Thus, derived features and their feature values may be derived from the plurality of features of the second dataset and their feature values based on derived logic conforming to the derived criteria. For example, the derivative feature "career development expectation" and its feature value "stable" can be derived from the features "career" and "age" and its feature values "bank staff" and "33".
In a preferred embodiment, the derived criteria may include a first, data type (constraint) criteria, a second, feature coverage (constraint) criteria, and a third, semantic criteria.
In a preferred embodiment, the screening based on the first data type (constraint) criteria may comprise:
Determining a data type for the feature derived feature and a data type of the derived feature;
obtaining the possible derivation of the data type for the feature derived feature and the possible derivation of the data type for the derived feature, e.g., from the possible derivation (combining, transforming, or calculating) of all data types;
judging whether the set derived logic accords with a possible derived mode of the data type;
the derivatization logic that does not conform to the possible derivatization pattern is sifted out.
Here, derivatization logic conforming to the possible derivatization patterns may be used to screen based on the second criteria described previously.
In a specific example of the present invention, the data type may include one or more of a numeric type, a category type, a time type, and a combination type.
In a specific example of the invention, the possible derivation (combining, transforming or computing) of all data types may include possible derivation of single-feature or multi-feature combinations. For example, for numerical type features for derivatization, such as amount, quantity, etc., possible derivatization may include four operations, statistics, and/or conversion to category type. For example, for category types, possible derivations may be ordered based on ordered variables, counted based on nominal variables, or converted to dummy variables. For example, for time types, possible derivatization modes may be based on continuous or discrete values for corresponding derivatizations, such as durations during which continuous values may be derivatized, and the like. For example, for a combined type, the crossover operation may be performed for certain types, such as numerical or temporal type features.
In a preferred embodiment, the second criterion based screening may comprise a data coverage based on features used for derivatization and/or a data coverage of derivatized features, which may comprise in particular
Acquiring a first data coverage of a feature corresponding feature value used for feature derivation in the second sample data; screening out derivative logic related to the feature if the first data coverage is smaller than a preset first coverage threshold; and/or
Deriving feature values for the derived features from the second dataset based on one or more derived logics; determining a second data coverage of the feature values of the derived features; and screening out derivative logic related to the feature if the second data coverage is smaller than a preset second coverage threshold.
For example, if the coverage of the feature value corresponding to the feature used for deriving in the second sample set in the applicable guest group is less than 5%, feature derivation (i.e., screening out the derived logic associated therewith) may be performed without the feature. Alternatively, if the coverage of the feature value corresponding to the derived feature obtained from the second sample set in the applicable guest group is less than 3%, the derived feature may not be used (i.e., the derived logic associated therewith may be screened out).
In some embodiments, the second criterion and the third criterion may be implemented in parallel or alternatively, and may be consecutive to each other.
In a preferred embodiment, the screening based on the third criterion may comprise:
acquiring semantic meanings of the derived features based on the features for deriving and the corresponding derived logic, which may be implemented, for example, based on a variety of semantic analysis algorithms and optionally may be manually adjusted by a user; relevant derived logic is screened based on the semantic meaning of the derived feature and the third criterion. For example, screening related derivative logic based on semantics of the derivative features and the third criteria may include: and judging whether the semantics of the derived features meet a third standard, and screening out related derived logic which does not meet the third standard.
In some embodiments, the third criteria may include a business relevance sub-criteria, which may implement screening in association with the second criteria. Here, the screening step may further include: if the first/second data coverage is less than the first/second coverage threshold, but it is determined that the derived feature is above a preset business relevance sub-criterion based on the semantic meaning of the derived feature, relevant derived logic is preserved (not sifted out). For example, the coverage of the applicable guest groups is less than 5% or 3% as described above, but the business interpretability or business value is higher than a certain standard, the relevant derived logic may still be retained.
Furthermore, an interpretable specification of the derived features may also be generated, e.g., based on the same or similar means as the previously described acquisition semantics, for providing to the user in the report to assist the user in understanding.
The feature derivation and optional interpretability of the embodiments of the present invention have a very beneficial effect on the implementation of wind control model creation. Illustratively, for some users, for example, when a wind control model is to be built for a new, second service and/or scenario, it often has only (raw) data originating from the original, first service and/or scenario, whereas such data and its characteristics, while having a certain relevance to the new, second service/scenario for which the wind control model is intended, are often not suitable for use in building the wind control model or do not perform well, or the relevant person building said wind control model may not have sufficient expertise to build the wind control model or do not perform well, although the relevant person may have the ability to build the wind control model for the original, first service and/or scenario. In a further preferred embodiment, the derived criteria may be set based on one or more, preferably all, of the following factors: the new, second service, the type of one or more wind control model algorithms as further described below, and model evaluation criteria as further described below.
Step 340: transcoding.
In some embodiments, for ease of processing, computing device 110 may perform transcoding, such as transcoding character-type features to numeric-type features. In particular, the computing device 110 may convert the third data set into transcoded data based on the specified transcoding manner. Transcoding may take any suitable form, such as onehot transcoding, woe transcoding, ordered-variable transcoding, etc. Computing device 110 may transcode all of the data in bulk. Alternatively, the computing device 110 may also separately transcode individual data so that different data may be transcoded using different means.
Step 350: and (5) processing the missing value.
In some embodiments, the computing device 110 may perform missing value processing. In particular, computing device 110 may populate missing values in the transcoded data to generate the padded data. Computing device 110 may be populated in any suitable manner, such as populating a median, a fixed value, and the like. In the case of filling in the fixed value, the fixed value may be specified by the user.
Step 360: outlier processing.
In some embodiments, computing device 110 may perform outlier processing. In particular, computing device 110 may remove outliers in the padding data to generate normal data. The computing device 110 may perform outlier processing in any suitable manner, such as isolated forest detection, Z-score detection, custom screening, and the like. The user can process the field and the value detected as the abnormal value in batches, and can also process the individual abnormal value independently, so that different data can use different abnormal value processing modes. In addition, in the custom screening, the abnormal value range can be customized according to the threshold value.
Step 370: and (5) removing the weight.
In some embodiments, computing device 110 may perform deduplication processing. In particular, computing device 110 may remove duplicate values from the normal data to generate sample data. For example, a user may select one or more fields that require deduplication processing to achieve single-field deduplication or multiple-field deduplication. When single-field deduplication is performed, data with the same value in the field can be removed. And when multi-field deduplication is performed, the same data can be removed from the value combinations in the fields.
Thus, after data processing of the raw data as described in steps 310-370, the computing device 110 may obtain the sample data 120.
Step 230: one or more wind control model algorithm types and/or model super-parameter search setting values are preset, and training is carried out by utilizing sample data based on a preset first model evaluation index to generate a wind control model for wind control management of the service, wherein the wind control model is an optimal algorithm type determined based on the first model evaluation index and/or has an optimal super-parameter value determined based on the first model evaluation index.
Parameters of the model can be divided into two categories: parameters and superparameters. The parameters are parameter data obtained by training and learning the model. Often the superparameter needs to be set empirically to improve the model training effect. For example, the hyper-parameters may be the number of hidden layers of the model, the number of neurons per hidden layer, what activation functions and learning algorithms are employed, learning rates and regularization coefficients, etc., and the hyper-parameters specific to the wind control model. Further, the first model evaluation index may include, for example, KS, gini, ROC, AUC, PSI and the like. Specifically, for example, the features may be input into a wind control model that learns parameters under the guidance of initial hyper-parameters, and evaluates whether the setting of hyper-parameters is appropriate by model evaluation indicators, and if not, continues to adjust.
Still further, the presetting one or more wind control model algorithm types and/or model super parameter search setting values includes:
selecting one or more preset wind control model algorithm types from a plurality of candidate wind control model algorithm types; setting the model super-parameter search setting value.
Optionally, the candidate wind control model algorithm types include at least two of logistic regression, extreme gradient lifting, lifting machines, gradient lifting, naive bayes, decision trees, and random forests.
In some embodiments, the model hyper-parameter search settings include at least one of: the super-parameter searching method comprises the steps of training times, training duration and super-parameter data searching range.
Optionally, the super-parameter searching method comprises at least one of a tree structure estimating method, a grid searching method, a random searching method, a simulated annealing method, a naive evolution method, a batch optimizing method and a black box optimizing method.
In this case, the candidate optimal superparameters may be searched in the superparameter data search range using a superparameter search method, and the wind control model based on the preset one or more wind control model algorithm types may be trained using the candidate optimal superparameters. If the training times or training time of the wind control model exceeds the preset training times and training time, the wind control model can be stopped being trained. In this way, the training of the wind control model may be controlled. Correspondingly, the type of the optimal algorithm and the optimal super-parameter value obtained through training can be determined based on a preset first model evaluation index, and an optimal wind control model is correspondingly determined to be used for wind control management of the service.
In some embodiments, the process of generating the model may utilize a screening feature method, a model training method, a hyper-parametric search method, and/or a binning method. Correspondingly, in the embodiment of the invention, the method can also correspondingly comprise the screening characteristic and the box-dividing related characteristic. The process of generating a model is also referred to as a "model experiment" process, and the sub-process/flow of the "model experiment" will be described below with reference to fig. 4. Those skilled in the art will appreciate that the sub-processes/flows may be performed in parallel/in combination and/or sequentially, if so, in any feasible order other than the order shown in fig. 4, unless clearly contradicted by the teachings of the present embodiments. In particular, the sub-process/flow of the "model experiment" shown in fig. 4 is described on a functional basis, and the sub-process/flow shown in fig. 4 may be performed as described above or below in connection with the preferred embodiments of the present invention.
In some embodiments, the process of generating a model may be implemented using an automated machine learning tool (e.g., NNI). As described above, the automated machine learning tool is capable of automated tuning to intelligently generate an optimal model. By means of an automatic machine learning tool, the optimal model can be automatically generated through intelligent parameter adjustment, so that modeling efficiency is remarkably improved, and human resources are saved. The execution of the machine learning tool will be described hereinafter with reference to fig. 5.
Fig. 4 shows an exemplary schematic diagram of a sub-process/flow of a data experiment 400 according to an embodiment of the invention.
Step 410: screening characteristics.
In some embodiments, computing device 110 may employ any suitable method to screen features. For example, screening feature methods may include L1 regularization, random forest algorithms, index judgment, and the like.
In some embodiments of the invention, feature screening may be based on a preset one or more wind control model algorithm types. For example, in some embodiments, a first feature screening may be employed for a first predetermined wind control model algorithm type and a second feature screening (or no feature screening) may be employed for a second predetermined wind control model algorithm.
In an embodiment of the present invention, the feature screening of sub-step 410 may be combined with binning as described further below.
Step 420: and (5) model training.
In some embodiments, computing device 110 may employ any suitable method for model training. For example, model training methods may include logistic regression Logistic Regression, extreme gradient lifting XGBoost, lifting machine LightGBM, gradient lifting GBDT, naive Bayes, Bayes, decision Tree, random Forest algorithm, etc.
The model training in sub-step 420 may be combined with other sub-steps, in particular sub-step 430, in various possible forms. In model training in this substep 420, the initial model may be iteratively trained using at least a portion of the sample data (optionally feature filtered and/or binned) for example, for an initial model based on a given type of wind control model algorithm and a set of candidate optimal superparameters, until the iteration exits, e.g., the loss converges or the loss value is less than a predetermined threshold, and the iterative method may include a gradient descent method.
Step 430: super parameter searching (determining).
In some embodiments, computing device 110 may employ any suitable method to conduct the hyper-parametric search, and thus may be used to form an initial model, such as described in substep 420, for model training. For example, the super-parametric search method may include a tree structure Parzen estimation method, a grid search method, a random search method, a simulated annealing method, a naive evolution method, a batch optimization method, a black box optimization method, and the like.
Taking the grid search method as an example, the grid search refers to that each possibility is tried through cyclic traversal in all candidate hyper-parameters, and the hyper-parameters with the best performance are the final results. Assuming that there are two classes of superparameters, each class of superparameters having 3 values to be explored, they are subjected to Cartesian products to obtain 9 superparameter combinations. The model is trained by grid searching using each of the best-chosen superparameter combinations and the best superparameters are chosen on the validation set. This approach often lists tables according to different categories and loops through searches within the tables.
With respect to the super parameter search (determination) function, a (truly) optimal super parameter value may be obtained from the candidate optimal super parameter values based on the first model evaluation index. In this sub-step 430, the generation of candidate hyper-parameter values by the hyper-parameter search and the determination of truly optimal hyper-parameter values are described simultaneously. It is contemplated that the generation of candidate hyper-parameter values and determination of truly optimal hyper-parameter values by the hyper-parameter search may be accomplished in separate steps as described in some embodiments of the invention and still fall within the scope of the invention.
Step 440: and (5) separating boxes.
In some embodiments, the binning operation may be set for certain algorithm types. Computing device 110 may be binned using any suitable method. For example, the binning method may include decision tree optimal binning, chi-square binning, and the like. In addition, computing device 110 may also support manual adjustment binning. In some embodiments, automatic binning and/or manual binning may be selectively performed based on a preset one or more wind control model algorithm types. For example, in some embodiments, some or all of the manual binning settings may be provided on an automatic binning basis, or may be provided directly, for a logistic regression algorithm (type).
In some embodiments of the present invention, the binning method and/or the binning threshold may be performed based on a preset one or more wind control model algorithm types. For example, in some embodiments, a first binning method and/or a binning threshold may be employed for a first preset wind control model algorithm type and a second binning method and/or a binning threshold (or no binning) may be employed for a second preset wind control model algorithm. Accordingly, in embodiments of the present invention, sample data for a particular binning method and/or binning threshold may be used, or not used, respectively, in training a pneumatic control model.
For example only, in some embodiments, one or more binning variables and their thresholds may be preset and the sample data binned based on these binning variables and their thresholds. Thus, the wind control model can be generated based on the binned sample data. For example, customers aged 0-30, 30-60, and over 60 can be respectively binned into three groups based on age. In this way, the same type of objects (e.g., clients) may be grouped together, thereby reducing noise data in the sample data.
In an embodiment of the present invention, feature screening of sub-step 410 may be combined with binning of sub-step 420, which is within the scope of the present invention. For example, feature filtered and binned sample data may be used based on a first preset wind control model algorithm and feature filtered, but not binned sample data may be used based on a second preset wind control model algorithm.
As previously mentioned, the model training in sub-step 420 may be combined with other sub-steps, in particular sub-step 430, in various possible forms. Furthermore, the generation of candidate hyper-parameter values and the determination of truly optimal hyper-parameter values by hyper-parameter search may be combined or implemented in separate steps. Thus, a number of different embodiments are possible.
In one particular embodiment, the generation of candidate hyper-parameter values by the hyper-parameter search may be independent of model training. Whereby the embodiment may comprise:
searching a plurality of groups of candidate optimal super parameters in a super parameter data searching range by utilizing a super parameter searching method, wherein the plurality of groups can correspond to training times optionally;
within a predetermined training time or training duration, the following steps are circularly executed: for each set of candidate optimal superparameters, obtaining an initial model from the set of candidate optimal superparameters and a corresponding wind-controlled model algorithm, iteratively training the initial model using sample data until an iteration is completed (an iteration exit condition may include an exit condition described elsewhere herein);
after the cycle is completed, an optimal wind control model is determined from the plurality of trained wind control models obtained by the cycle based on a predetermined first model evaluation index, for example, an algorithm type and a super parameter are determined based on the evaluation index, and optionally parameters of the model are determined.
Instead of this first embodiment, determining an optimal model may be incorporated into the step of loop execution. For example, this second alternative embodiment may include:
searching a plurality of groups of candidate optimal super parameters in a super parameter data searching range by utilizing a super parameter searching method, wherein the plurality of groups can correspond to training times optionally;
within a predetermined training time or training duration, the following steps are circularly executed: for each set of candidate optimal superparameters, obtaining an initial model from the set of candidate optimal superparameters and a corresponding wind-controlled model algorithm, iteratively training the initial model using sample data until an iteration is completed (an iteration exit condition may include an exit condition described elsewhere herein); based on a predetermined first model evaluation index, the wind control model after the iteration is completed in the cycle is compared with the existing optimal wind control model to determine the current optimal wind control model (it can be understood herein that the current optimal wind control model can be used as the existing optimal wind control model in the next cycle).
In this particular embodiment, the number of exercises or duration of exercises may be alternatively triggered, i.e. one of the two conditions is met, and the exercises are terminated.
In other words, in this second alternative embodiment, an optimal wind control model (as well as an optimal algorithm type and optimal hyper-parameter values) may be obtained each cycle.
In this alternative embodiment, in the first cycle, there may be an existing optimal wind control model or the existing optimal wind control model may be empty (i.e., the first cycle directly takes the wind control model that completes the iteration as the current optimal wind control model).
Alternatively to the first two embodiments, the generation of candidate hyper-parameter values by the hyper-parameter search may be integrated with model training, the determination of the truly optimal hyper-parameter values being independent of model training. This third alternative embodiment may include:
within a predetermined training time or training duration, the following steps are circularly executed: searching a group of candidate optimal super parameters in a super parameter data searching range by utilizing a super parameter searching method; for the set of candidate optimal superparameters, obtaining an initial model from the set of candidate optimal superparameters and a corresponding wind-controlled model algorithm, iteratively training the initial model using sample data until an iteration is completed (an iteration exit condition may include an exit condition described elsewhere herein);
after the cycle is completed, an optimal wind control model is determined from the plurality of trained wind control models obtained by the cycle based on a predetermined first model evaluation index, for example, an algorithm type and a super parameter are determined based on the evaluation index, and optionally parameters of the model are determined.
In other words, in this third alternative embodiment, a set of hyper-parameters may be obtained per cycle, which may be beneficial in that the hyper-parameters may be obtained based on a particular wind control model algorithm type.
As an alternative to the first three embodiments, the hyper-parameter search to generate candidate hyper-parameter values and the determination of truly optimal hyper-parameter values may be integrated with model training. This fourth alternative embodiment may include:
within a predetermined training time or training duration, the following steps are circularly executed: searching a group of candidate optimal super parameters in a super parameter data searching range by utilizing a super parameter searching method; for the set of candidate optimal superparameters, obtaining an initial model from the set of candidate optimal superparameters and a corresponding wind-controlled model algorithm, iteratively training the initial model using sample data until an iteration is completed (an iteration exit condition may include an exit condition described elsewhere herein); based on a preset first model evaluation index, comparing the wind control model after iteration in the loop with the existing optimal wind control model to determine the current optimal wind control model. It will be appreciated here that in the next cycle, the current optimal wind control model may be taken as the existing optimal wind control model, and reference may be made to the second embodiment described above.
Of the 4 embodiments described above, the first embodiment has particular benefits in that it can effectively improve model training and hyper-parametric search efficiency, and the architecture is more modular, and users can adapt schemes according to embodiments of the invention as needed to optimize model training and hyper-parametric search, respectively.
In other embodiments, feature screening and/or binning may be combined with the 4 embodiments described above, respectively, and in particular may be integrated or independent of the loops described above, to arrive at new embodiments. For example, in a further embodiment where the binning operation is incorporated into the first specific embodiment, the steps of: determining a corresponding wind control model algorithm type based on the current set of candidate optimal super parameters; judging whether to trigger a box division operation or not based on the type of the wind control model algorithm (such as a logistic regression algorithm); if the binning operation is triggered, the binning operation is carried out on the sample data based on a preset binning method and a binning threshold value, and the initial model is trained iteratively by utilizing the binned sample data until iteration is completed; and if the binning operation is not triggered, iteratively training the initial model by using sample data which is not binned until the iteration is completed.
Fig. 5 illustrates an exemplary schematic diagram of an implementation 500 of a machine learning tool according to an embodiment of the present invention. The machine learning tool execution 500 shown in fig. 5 is exemplified by NNI execution. It should be appreciated that process 500 may also optionally include additional acts not shown and/or may omit acts shown, the scope of the present disclosure being not limited in this respect.
Step 510: the search space is customized. For example, computing device 110 may customize the search space of the hyper-parameters to search the space for the hyper-parameters.
Optionally step 520: an automatic machine learning tool is started. For example, the computing device 110 may launch an NNI.
Step 530: generating super parameters. For example, the computing device 110 may intelligently tune the model using NNIs to determine hyper-parameters for the model.
Step 540: a test was performed. For example, the computing device 110 may perform experiments on the NNI generated model, i.e., such as the model training described previously, e.g., inputting test data into the model to evaluate its accuracy.
Step 550: and (5) evaluating the result. For example, the computing device 110 may compare the differences between the output of the model and the actual results to assess its accuracy.
Step 560: and analyzing the result. For example, computing device 110 may analyze the results to determine whether to deploy the model, or to continue optimizing the model and methods of optimizing the model. Here, an analysis report may be generated.
Furthermore, in some embodiments, after the model is trained, the model may also be deployed so that the model can perform distributed computations to perform a management of the business. In some embodiments, the container distributed deployment model 130 may be utilized. The container technology can deploy experimental tasks, realize parallel execution of a plurality of experimental tasks, achieve the purpose of distributed processing of tasks, and have the advantages of high availability, reliability and expandability. For example, the container techniques may include any suitable container techniques such as DOCKER, coreos. By utilizing container technology, the defects that the calculation cannot be distributed under different servers traditionally and the performance of an individual server is depended on can be overcome.
Further, in some embodiments, the computing device 110 may also perform model 130 testing prior to deploying the model 130. For example, the wind control model may be tested using the test data based on a predetermined second model evaluation index. The manner of acquisition of test data and its characteristics and characteristic values are similar to those of sample data, and thus a description thereof is omitted here. In some embodiments, the second model evaluation index optionally includes KS, gini, ROC, AUC, PSI of the test data, or the like.
In addition, a wind control model evaluation report may also be generated to clearly provide the user with an interpretable specification of the wind control model, and so forth.
In this way, the scheme of the embodiment of the invention can automatically generate and optimize the model, thereby reducing modeling difficulty, improving modeling efficiency, improving model interpretability and endowing non-expert user modeling capability.
FIG. 6 illustrates an exemplary schematic diagram of a wind control model creation framework 600 according to an embodiment of the invention. As shown in fig. 6, the wind control model creation framework 600 includes an application layer, a functional layer, and a support layer. The application layer may indicate an application scenario and an application stage of the wind control model creation method according to an embodiment of the present invention. Specifically, the application layer includes multiple classes of inter-gold scenarios and multiple business phases. For example, the multi-class inter-funding scenarios may include automotive finance, consumer finance, cash staging, credit card compensation, and the like. Further, the plurality of business phases may include marketing, anti-fraud, pre-loan applications, mid-loan actions, post-loan rewards, and the like.
The functional layer may indicate a function for implementing the wind control model creation method according to an embodiment of the present invention. For example, these functions may include sample management, data processing, automatic training, model release, system management, and the like. In particular, sample management may include uploading samples, sample analysis, Y-tag definition, and the like. Data processing may include sampling, feature derivation, transcoding, missing value processing, outlier processing, deduplication, and the like. Automatic training may include logistic regression, lifting tree algorithms, gradient lifting, naive bayes, decision trees, random forests, and the like. Model release may include model generation, model deployment, model testing, and model online, among others. Further, system management may include operation log, organization management, user management, and rights management, etc.
The support layer may indicate the functionality provided by the wind control model creation method according to an embodiment of the present invention. For example, these support functions may include piggybacking up-to-date algorithms, integrating full-flow modeling experience, intelligent tuning, model verification, and automatically generating reports, among others.
FIG. 7 illustrates an exemplary schematic diagram of a wind control model creation system 700 according to an embodiment of the invention. It should be appreciated that the wind control model creation system 700 is merely an example system implementing a wind control model creation method according to an embodiment of the invention, and that any suitable system implementation may implement a wind control model creation method according to an embodiment of the invention.
As shown in FIG. 7, the wind control model creation system 700 may include a load balancing layer, a micro-service layer, a presentation layer, an application layer, and a persistence layer. As an example, the load balancing layer may be implemented by nginnx and keepalive techniques. The micro-service layer can be realized by Spring Cloud and other technologies. The presentation layer may be implemented by HTML, CSS3, ecars, VUEJS, etc. techniques. The application layer may be implemented by technologies such as Spring boot, spring security for rights management, RESTful, python, NNI for service interface, etc. The persistence layer may be implemented by techniques such as the data persistence framework myBATIS (using MySQL database), the cache database REDIS, and the like.
Hereinafter, a conventional wind control model creation method will be compared with a wind control model creation method according to an embodiment of the present invention by taking a typical scorecard modeling process as an example, wherein fig. 8 shows an exemplary schematic diagram of a conventional scorecard modeling process 800, and fig. 9 shows an exemplary flowchart of a scorecard modeling process 900 according to an embodiment of the present invention. A scoring card is a method for evaluating whether a customer of a business is at risk. Assuming that a customer does not have a risk to be assigned a value of 0 and that a risk to be present is assigned a value of 1, the evaluation customer is a probability p of predicting whether the customer has a risk.
The conventional wind control model creation process 800 illustrated in fig. 8 includes steps 810-890, wherein:
step 810: and (5) data acquisition. For example, the user may obtain raw data.
Step 820: data exploration and data description. For example, the user may determine a valid field of data from the original data, and so on.
Step 830: and (5) data integration. For example, the user may clean and convert the data.
Step 840: and (5) feature selection. For example, the user manually selects the features to be used for modeling.
Step 850: and (5) model development. For example, a user may manually develop a model based on experience, including selecting model types, manually adjusting the hyper-parameters of the model.
Step 860: the scoring card is created and graduated. For example, the user may create a scoring card and a computing scale.
Step 870: and (5) grading card implementation. For example, the user may implement a scoring card.
Step 880: and (5) evaluating a model. For example, a user may evaluate a model to determine the accuracy of the model. If the accuracy of the model is low, the reacquired data will be returned.
Step 890: monitoring and reporting. For example, a user may monitor the accuracy of a model during its operation. If the model is running with low accuracy, it will return reacquired data or integrated data.
The wind control model creation process 900 illustrated in FIG. 9, according to an embodiment of the present invention, includes steps 910-980, wherein:
step 910: select/upload select sample data. For example, sample data may be selected and uploaded using a data management module. In addition, as described above, a data source (such as a database) may also be selected using the data management module, and the computing device 110 may obtain sample data from the data source.
Step 920: a viewable data analysis report is generated. For example, the data processing module may automatically generate a user viewable data analysis report based on the sample to provide the user with an analysis of the sample data.
Step 930: select/configure data processing schemes. For example, a data processing scheme may be selected/configured using a data experiment module. The data processing scheme may include selecting data, data sampling, feature derivation, transcoding, missing value processing, outlier processing, deduplication, and the like. Since the data processing procedure has been described above with reference to fig. 3, a detailed description thereof is omitted herein.
Step 940: and setting a screening characteristic condition threshold value. For example, a threshold may be set using the data experiment module to exclude invalid features.
Step 950: and (5) an automatic screening algorithm and a super parameter. For example, modeling algorithms and hyper-parameters may be automatically screened using a data experiment module.
Step 960: generating a model and a grading card report. For example, the model and the score card report may be automatically generated using a data experiment model.
Step 970: a script/one-key deployment is generated. For example, scripts can be automatically generated to deploy the model using the data experiment model, thereby simplifying the model deployment process.
Step 980: model deployment management/model monitoring. For example, a model may be deployment managed using a data management model and the operation of the model monitored.
The data processing module and the data verification module are reusable, so that the modeling efficiency is further improved.
In an exemplary embodiment as shown in fig. 10, a wind control model creation apparatus 1000 is also provided. The wind control model creation apparatus 1000 may include an acquisition module 1010 configured to acquire raw data associated with a service to be wind control managed, the raw data including a plurality of feature values corresponding to a plurality of features and respective wind control tags; a generating module 1020 configured to perform data processing on the raw data to generate sample data; and a modeling module 1030 configured to preset one or more wind control model algorithm types and/or model hyper-parameter search settings, based on a predetermined first model evaluation index, train with the sample data to generate a wind control model for wind control management of the business, wherein the wind control model is an optimal algorithm type determined based on the first model evaluation index and/or has an optimal hyper-parameter value determined based on the first model evaluation index.
In some embodiments, the method for creating a wind control model further comprises:
and the testing module is configured to test the wind control model by using the testing data based on a preset second model evaluation index.
In some embodiments, the method for creating a wind control model further comprises:
the deployment module is configured to deploy the wind control model and is used for carrying out wind control management on the service.
In some embodiments, the generating module comprises:
a first selection module configured to select a first data set from the raw data based on a set of specified features;
a second selection module configured to select a second data set from the first data set based on a specified sampling rule;
a feature derivation module configured to generate derived features from features of the second data set based on specified feature derivation rules to obtain a third data set comprising feature values of the derived features; and
a sample data generation module configured to generate the sample data based on the third data set.
In some embodiments, the feature derivation module comprises:
a feature selection module configured to select a plurality of features of the second dataset for feature derivation;
a derivative logic setting module configured to set one or more derivative logics;
a screening module configured to verify the one or more derived logics based on a preset derived standard to screen derived logics conforming to the derived standard;
A deriving module configured to derive the derived feature and its eigenvalue from a plurality of features of the second dataset and its eigenvalue based on the deriving logic conforming to a deriving standard.
In some embodiments, the sample data generation module comprises:
a transcoding module configured to convert the third data set into transcoded data based on a specified transcoding manner;
a padding module configured to pad missing values in the transcoded data to generate padded data;
an outlier processing module configured to remove outliers in the padding data to generate normal data; and
a deduplication module configured to remove duplicate values in the normal data to generate the sample data.
In some embodiments, the model hyper-parameter search settings comprise at least one of:
the method for searching the super-parameters is that,
the number of times of training is that,
training duration, and
super-parameters are according to the searching range.
In some embodiments, the modeling module includes:
the super-parameter searching module is configured to search candidate optimal super-parameters in the super-parameter data searching range by utilizing a super-parameter searching method;
a training module configured to train the wind control model with the candidate optimal hyper-parameters;
And the stopping module is configured to stop training the wind control model in response to the training times or the training time period of training the wind control model exceeding the preset training times and training time period.
In some embodiments, the modeling module includes:
a preset module configured to preset one or more binning variables and thresholds thereof;
a binning module configured to bin the sample data based on the one or more binning variables and their thresholds; and
and the model generation module is configured to generate the wind control model based on the sample data after the binning.
In some embodiments, the acquisition module includes at least one of:
the uploading module is configured to acquire the original data from a file uploaded by a user; and
and the database acquisition module is configured to acquire the original data from a specified database.
In some embodiments, the acquisition module further comprises:
and the definition module is configured to define the wind control label value corresponding to each characteristic value of the original data.
In some embodiments, the deployment module comprises:
a distributed deployment module configured to deploy the wind control model in a distributed manner using a container.
In an embodiment of the present invention, there is provided an electronic apparatus including: a processor and a memory storing a computer program, the processor being configured to implement any of the methods according to embodiments of the invention when the computer program is run. In addition, a processing apparatus implementing the embodiment of the present invention may also be provided.
FIG. 11 shows a schematic diagram of an electronic device 1100 that may implement or implement embodiments of the present invention, and in some embodiments may include more or fewer electronic devices than shown. In some embodiments, it may be implemented with a single or multiple electronic devices. In some embodiments, implementation may be with cloud or distributed electronic devices.
As shown in fig. 11, the electronic device 1100 includes a Central Processing Unit (CPU) 1101 that can perform various appropriate operations and processes according to programs and/or data stored in a Read Only Memory (ROM) 1102 or programs and/or data loaded from a storage section 1108 into a Random Access Memory (RAM) 1103. The CPU 1101 may be one multi-core processor or may include a plurality of processors. In some embodiments, CPU 1101 may comprise a general purpose host processor and one or more special coprocessors such as, for example, a Graphics Processor (GPU), a neural Network Processor (NPU), a Digital Signal Processor (DSP), etc. In the RAM 1103, various programs and data necessary for the operation of the electronic device 1100 are also stored. The CPU 1101, ROM1102, and RAM 1103 are connected to each other by a bus 1104. Input/output (I/O) interface
1105 are also connected to bus 1104.
The above-described processor is used in combination with the memory to execute a program stored in the memory, which when executed by a computer is capable of realizing the steps or functions of the model generation method, the identification method described in the above-described embodiments.
The following components are connected to the I/O interface 1105: input section including keyboard, mouse, etc
1106, a step of selecting a target; an output portion 1107 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1108 including a hard disk or the like; and a communication section 1109 including a network interface card such as a LAN card, a modem, and the like. The communication section 1109 performs communication processing via a network such as the internet. The drive 1110 is also connected to the I/O interface 1105 as needed. Removable media 1111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in drive 1110, so that a computer program read therefrom is installed as needed in storage section 1108. Only some of the components are schematically illustrated in fig. 11, which does not mean that the computer system 1100 includes only the components illustrated in fig. 11.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer or its associated components. The computer may be, for example, a mobile terminal, a smart phone, a personal computer, a laptop computer, a car-mounted human-computer interaction device, a personal digital assistant, a media player, a navigation device, a game console, a tablet, a wearable device, a smart television, an internet of things system, a smart home, an industrial computer, a server, or a combination thereof.
In preferred embodiments, the training system and method may be implemented or realized at least in part or entirely in a cloud-based machine learning platform or in part or entirely in a self-built machine learning system, such as a GPU array.
In a preferred embodiment, the generating means and method may be implemented or realized in a server, e.g. a cloud or distributed server. In a preferred embodiment, data or content may also be pushed or sent to the interrupt by means of the server based on the generation result.
In an embodiment of the present invention, there is provided a storage medium storing a computer program configured to, when executed, perform a method of any of the embodiments of the present invention.
Storage media in embodiments of the invention include both permanent and non-permanent, removable and non-removable items that may be used to implement information storage by any method or technology. Examples of storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device.
In the embodiments described above and/or shown in the figures, a fraud recognition model generation method and a fraud recognition method and related electronic devices and storage media are presented.
However, under the teachings of the present disclosure, embodiments of the present invention may also be applied to a wider variety of scenarios, particularly to a variety of application scenarios that enable "two-class" assessment or identification, such as, but not limited to, other wind-controlled scenarios or other financial scenarios, such as lending assessment; and in other than financial, such as business development success assessment, spam assessment, commodity or advertisement effectiveness recommendation or user preference assessment. Thus, in some embodiments of the present invention, a classification model generating method and apparatus and a classification evaluation method and related electronic devices and storage media are also presented, and in particular, a wind control model generating method and apparatus and a risk evaluation method and related electronic devices and storage media are presented, which may include the corresponding features described in the related embodiments related to fraud identification (anti-fraud).
For example, the evaluation or recognition model generation method may include:
a sample set is obtained that contains a plurality of sample data, each sample data including a plurality of variable values corresponding to a plurality of initial variables and a respective tag.
The method comprises the steps of carrying out binning on a plurality of sample data of the sample set, and selecting a plurality of bin variables and thresholds thereof from a plurality of initial variables according to a binning result, wherein at least some of the bin variables have a plurality of thresholds and serve as multi-threshold variables, and the rest of bin variables have single thresholds and serve as first single-threshold variables;
processing the multi-threshold variables to generate a plurality of second single-threshold variables with single thresholds, respectively;
mapping the first single-threshold variable and the second single-threshold variable into an initial rule;
the initial rule is processed to generate a final rule set comprising a plurality of final rules.
Methods, programs, systems, apparatus, etc. in accordance with embodiments of the invention may be implemented or realized in single or multiple networked computers, or in distributed computing environments. In the present description embodiments, tasks may be performed by remote processing devices that are linked through a communications network in such a distributed computing environment.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Thus, it will be apparent to those skilled in the art that the functional modules/units or controllers and associated method steps set forth in the above embodiments may be implemented in software, hardware, and a combination of software/hardware.
The acts of the methods, procedures, or steps described in accordance with the embodiments of the present invention do not have to be performed in a specific order and still achieve desirable results unless explicitly stated. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Various embodiments of the invention are described herein, but for brevity, description of each embodiment is not exhaustive and features or parts of the same or similar between each embodiment may be omitted. Herein, "one embodiment," "some embodiments," "example," "specific example," or "some examples" means that it is applicable to at least one embodiment or example, but not all embodiments, according to the present invention. The above terms are not necessarily meant to refer to the same embodiment or example. Those skilled in the art may combine and combine the features of the different embodiments or examples described in this specification and of the different embodiments or examples without contradiction.
The exemplary systems and methods of the present invention have been particularly shown and described with reference to the foregoing embodiments, which are merely examples of the best modes for carrying out the systems and methods. It will be appreciated by those skilled in the art that various changes may be made to the embodiments of the systems and methods described herein in practicing the systems and/or methods without departing from the spirit and scope of the invention as defined in the following claims.

Claims (14)

1. A method for creating a wind control model, comprising:
acquiring original data associated with a business to be subjected to wind control management, wherein the original data comprises a plurality of characteristic values corresponding to a plurality of characteristics and respective wind control labels;
performing data processing on the original data to generate sample data;
presetting one or more wind control model algorithm types and/or model super-parameter search setting values, and training by utilizing the sample data based on a preset first model evaluation index to generate a wind control model for wind control management of the service, wherein the wind control model is based on an optimal algorithm type determined by the first model evaluation index and/or has an optimal super-parameter value determined by the first model evaluation index.
2. The method for creating a wind control model according to claim 1, further comprising:
and testing the wind control model by using test data based on a preset second model evaluation index.
3. The method for creating a wind control model according to claim 1, further comprising:
and deploying the wind control model in a container distributed mode, and performing wind control management on the service.
4. The method for creating a wind control model according to claim 1, wherein the performing data processing on the raw data to generate sample data includes:
selecting a first data set from the raw data based on a set of specified features;
selecting a second data set from the first data set based on a specified sampling rule;
generating derived features from features of the second data set based on specified feature derivation rules to obtain a third data set comprising feature values of the derived features; and
the sample data is generated based on the third data set.
5. The method of claim 4, wherein generating derived features from features of the second data set based on specified feature derivation rules to obtain a third data set comprising feature values of the derived features, comprises:
selecting a plurality of features of the second dataset for feature derivation;
setting one or more derivative logics;
verifying the one or more derivative logics based on preset derivative standards to screen out derivative logics conforming to the derivative standards;
deriving the derived features and their feature values from the plurality of features of the second dataset and their feature values based on the derived logic conforming to a derived criterion.
6. The method of claim 4 or 5, wherein generating the sample data based on the third data set comprises:
converting the third data set into transcoded data based on a specified transcoding manner;
filling missing values in the transcoding data to generate filling data;
removing abnormal values in the filling data to generate normal data; and
and removing the repeated value in the normal data to generate the sample data.
7. The method according to claim 1, wherein the presetting of one or more wind control model algorithm types and/or model hyper-parameters search settings comprises:
selecting one or more preset wind control model algorithm types from a plurality of candidate wind control model algorithm types, wherein the candidate wind control model algorithm types comprise at least two of logistic regression, extreme gradient lifting, lifting machines, gradient lifting, naive Bayes, decision trees and random forests;
setting the model super-parameter search setting value, wherein the model super-parameter search setting value comprises at least one of a super-parameter search method, training times, training duration and a super-parameter data search range, and optionally the super-parameter search method comprises at least one of a tree structure estimation method, a grid search method, a random search method, a simulated annealing method, a naive evolution method, a batch optimization method and a black box optimization method.
8. The method of claim 7, wherein training with the sample data to generate a wind control model for wind control management of the business comprises:
searching candidate optimal super parameters in the super parameter data searching range by utilizing a super parameter searching method;
training a wind control model based on the preset one or more wind control model algorithm types by using sample data based on the candidate optimal super parameters;
stopping training in response to training the wind control model for a number of times or for a duration exceeding a preset training duration;
based on a preset first model evaluation index, determining the type of the optimal algorithm and the optimal super-parameter value obtained through training, and correspondingly determining an optimal wind control model for wind control management of the service.
9. The method of claim 7 or 8, wherein training with the sample data to generate the wind control model further comprises:
presetting one or more sub-box variables and a threshold value thereof;
binning the sample data based on the one or more binning variables and their thresholds;
and training a wind control model based on the sample data after the box division.
10. The method of claim 7 or 8, wherein training with the sample data to generate the wind control model further comprises:
screening the characteristics of the sample data based on a preset characteristic screening method, wherein the characteristic screening method optionally comprises at least one of L1 regularization, a random forest algorithm and an index judgment method;
and training a wind control model based on the sample data after the screening of the features.
11. The method of claim 1, wherein the obtaining raw data associated with the business to be managed by the wind includes at least one of:
acquiring the original data from a file uploaded by a user; and
the raw data is obtained from a specified database.
12. A wind control model creation apparatus, characterized by comprising:
an acquisition module configured to acquire raw data associated with a service to be managed by wind control, the raw data including a plurality of feature values corresponding to a plurality of features and respective wind control tags;
the generation module is configured to perform data processing on the original data to generate sample data; and
the modeling module is configured to preset one or more wind control model algorithm types and/or model super-parameter search setting values, train by using the sample data based on a preset first model evaluation index to generate a wind control model for wind control management of the service, wherein the wind control model is an optimal algorithm type determined based on the first model evaluation index and/or has an optimal super-parameter value determined based on the first model evaluation index.
13. An electronic device, comprising: a processor and a memory storing a computer program, the processor being configured to perform the method of any one of claims 1 to 11 when the computer program is run.
14. A storage medium storing a computer program configured to perform the method of any one of claims 1 to 11 when run.
CN202210117865.0A 2022-02-08 2022-02-08 Wind control model creation method and device, electronic equipment and storage medium Pending CN116542511A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210117865.0A CN116542511A (en) 2022-02-08 2022-02-08 Wind control model creation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210117865.0A CN116542511A (en) 2022-02-08 2022-02-08 Wind control model creation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116542511A true CN116542511A (en) 2023-08-04

Family

ID=87453020

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210117865.0A Pending CN116542511A (en) 2022-02-08 2022-02-08 Wind control model creation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116542511A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062587A (en) * 2017-12-15 2018-05-22 清华大学 The hyper parameter automatic optimization method and system of a kind of unsupervised machine learning
CN109635953A (en) * 2018-11-06 2019-04-16 阿里巴巴集团控股有限公司 A kind of feature deriving method, device and electronic equipment
CN110334814A (en) * 2019-07-01 2019-10-15 阿里巴巴集团控股有限公司 For constructing the method and system of risk control model
CN110866819A (en) * 2019-10-18 2020-03-06 华融融通(北京)科技有限公司 Automatic credit scoring card generation method based on meta-learning
CN113344700A (en) * 2021-07-27 2021-09-03 上海华瑞银行股份有限公司 Wind control model construction method and device based on multi-objective optimization and electronic equipment
CN113870005A (en) * 2021-09-17 2021-12-31 百融至信(北京)征信有限公司 Method and device for determining hyper-parameters

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062587A (en) * 2017-12-15 2018-05-22 清华大学 The hyper parameter automatic optimization method and system of a kind of unsupervised machine learning
CN109635953A (en) * 2018-11-06 2019-04-16 阿里巴巴集团控股有限公司 A kind of feature deriving method, device and electronic equipment
CN110334814A (en) * 2019-07-01 2019-10-15 阿里巴巴集团控股有限公司 For constructing the method and system of risk control model
CN110866819A (en) * 2019-10-18 2020-03-06 华融融通(北京)科技有限公司 Automatic credit scoring card generation method based on meta-learning
CN113344700A (en) * 2021-07-27 2021-09-03 上海华瑞银行股份有限公司 Wind control model construction method and device based on multi-objective optimization and electronic equipment
CN113870005A (en) * 2021-09-17 2021-12-31 百融至信(北京)征信有限公司 Method and device for determining hyper-parameters

Similar Documents

Publication Publication Date Title
US10311368B2 (en) Analytic system for graphical interpretability of and improvement of machine learning models
US20180240041A1 (en) Distributed hyperparameter tuning system for machine learning
CN113935434A (en) Data analysis processing system and automatic modeling method
US20210303970A1 (en) Processing data using multiple neural networks
CN106095942B (en) Strong variable extracting method and device
US20140358828A1 (en) Machine learning generated action plan
CN112270547A (en) Financial risk assessment method and device based on feature construction and electronic equipment
US11443207B2 (en) Aggregated feature importance for finding influential business metrics
CN113344700B (en) Multi-objective optimization-based wind control model construction method and device and electronic equipment
US10963802B1 (en) Distributed decision variable tuning system for machine learning
CN111738331A (en) User classification method and device, computer-readable storage medium and electronic device
CN112288455B (en) Label generation method and device, computer readable storage medium and electronic equipment
Shukla et al. Comparative analysis of ml algorithms & stream lit web application
CN111583017A (en) Risk strategy generation method and device based on guest group positioning and electronic equipment
CN111199469A (en) User payment model generation method and device and electronic equipment
CN112328869A (en) User loan willingness prediction method and device and computer system
CN111582315A (en) Sample data processing method and device and electronic equipment
JP7479251B2 (en) Computer system and information processing method
CN113379124A (en) Personnel stability prediction method and device based on prediction model
Purushu et al. Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
US20210356920A1 (en) Information processing apparatus, information processing method, and program
CN116911994A (en) External trade risk early warning system
CN111582313A (en) Sample data generation method and device and electronic equipment
CN111160733A (en) Risk control method and device based on biased sample and electronic equipment
Han et al. Interestingness classification of association rules for master data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination