CN111931848B - Data feature extraction method and device, computer equipment and storage medium - Google Patents

Data feature extraction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111931848B
CN111931848B CN202010797913.6A CN202010797913A CN111931848B CN 111931848 B CN111931848 B CN 111931848B CN 202010797913 A CN202010797913 A CN 202010797913A CN 111931848 B CN111931848 B CN 111931848B
Authority
CN
China
Prior art keywords
feature
screening
subset
features
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010797913.6A
Other languages
Chinese (zh)
Other versions
CN111931848A (en
Inventor
张巧丽
林荣吉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010797913.6A priority Critical patent/CN111931848B/en
Publication of CN111931848A publication Critical patent/CN111931848A/en
Application granted granted Critical
Publication of CN111931848B publication Critical patent/CN111931848B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application belongs to the field of artificial intelligence and relates to a data feature extraction method, which comprises the steps of obtaining original data, and extracting features according to the original data to form a feature set; filtering type feature screening is carried out on the feature set to obtain a stable feature subset; and inputting the stable feature subset into an embedded feature screening model, outputting information gain provided by each feature in the stable feature subset in the model training process, sorting the information gain corresponding to each feature in the stable feature subset, and screening the features in the stable feature subset according to the sorting result to obtain a target feature subset. The application also provides a data characteristic extraction device, computer equipment and a storage medium. In addition, the application also relates to a blockchain technology, and the characteristic values of the target characteristic subset can be stored in the blockchain. The application adopts the filtering screening and then carries out the feature extraction through the embedded screening, and can obtain the feature subset with high stability and low information redundancy.

Description

Data feature extraction method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for extracting features of data, a computer device, and a storage medium.
Background
As data is increasingly expanded, more high-value information needs to be mined to improve the application effect of a machine learning model, but feature dimension expansion is caused, the information redundancy of features is improved, the model effect cannot be improved due to the features containing repeated information, the model training complexity is increased, the model iteration efficiency is reduced, and even noise information is introduced to reduce the model effect, so that feature selection is needed.
The existing screening scheme based on distribution stability, deletion rate and predictive capability cannot realize information duplication removal of the modeling characteristics; and the information duplication elimination realized by using the correlation coefficient can only process the information duplication between every two features, and the information duplication between multiple features can not be judged.
Disclosure of Invention
The embodiment of the application aims to provide a data feature extraction method, a data feature extraction device, computer equipment and a storage medium, which are used for solving the problems that in the prior art, effective duplication removal cannot be carried out on information among multiple features, and a feature set with high stability and low information redundancy cannot be obtained.
In order to solve the above technical problems, an embodiment of the present application provides a feature extraction method of data, which adopts the following technical scheme:
A method for feature extraction of data, comprising the steps of:
acquiring original data, and extracting features according to the original data to form a feature set;
Filtering type feature screening is carried out on the feature set to obtain a stable feature subset;
And inputting the stable feature subset into an embedded feature screening model, outputting information gain provided by each feature in the stable feature subset in the model training process, sorting the information gain corresponding to each feature in the stable feature subset, and screening the features in the stable feature subset according to the sorting result to obtain a target feature subset.
In order to solve the above technical problems, the embodiment of the present application further provides a data feature extraction device, which adopts the following technical scheme:
A feature extraction apparatus of data, comprising:
the feature extraction module is used for acquiring original data and extracting features according to the original data to form a feature set;
the first screening module is used for carrying out filtering type feature screening on the feature set to obtain a stable feature subset;
and the second screening module is used for inputting the stable feature subset into an embedded feature screening model, outputting information gain provided by each feature in the stable feature subset in the model training process, sequencing the information gain corresponding to each feature in the stable feature subset, and screening the features in the stable feature subset according to the sequencing result to obtain a target feature subset.
In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:
A computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of a method of feature extraction of data as described above.
In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:
A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of a method of feature extraction of data as described above.
Compared with the prior art, the data feature extraction method, the device, the computer equipment and the storage medium provided by the embodiment of the application have the following main beneficial effects:
According to the embodiment of the application, the characteristic extraction is performed through the embedded screening after the filtering screening, so that the characteristic distribution stability is considered, the redundancy condition of information among the characteristics is considered, the characteristic subset with high stability and low information redundancy can be obtained, and when the model is built and trained aiming at the scenes with long data time span, large characteristic distribution difference, high model input characteristic dimension and large redundancy of information, the model input characteristic dimension can be reduced by utilizing the scheme of the application, and the model operation efficiency and model precision can be improved.
Drawings
In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, the drawings in the following description corresponding to some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow chart of one embodiment of a method of feature extraction of data in accordance with the present application;
FIG. 3 is a flow chart of one embodiment of step S202 of FIG. 2;
FIG. 4 is a schematic diagram of the structure of one embodiment of a feature extraction device of data according to the application;
FIG. 5 is a schematic diagram illustrating the structure of one embodiment of the first screening module 402 shown in FIG. 4;
FIG. 6 is a schematic structural diagram of one embodiment of a computer device in accordance with the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that, the method for extracting the features of the data provided in the embodiment of the present application is generally executed by a server, and accordingly, the device for extracting the features of the data is generally disposed in the server.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow chart of one embodiment of a feature extraction method of data according to the present application is shown. The characteristic extraction method of the data comprises the following steps:
S201, acquiring original data, and extracting features according to the original data to form a feature set;
S202, filtering type feature screening is carried out on the feature set to obtain a stable feature subset;
s203, inputting the stable feature subset into an embedded feature screening model, outputting information gain provided by each feature in the stable feature subset in the model training process, sorting the information gain corresponding to each feature in the stable feature subset, and screening the features in the stable feature subset according to the sorting result to obtain a target feature subset.
The above steps are explained below.
For step S201, specifically, data mining is performed on the raw data during big data analysis modeling, and features that can be used as model input are extracted to form a feature set to be used as a training set or a verification set of the model. Taking the feature extraction of the data of the retention prediction model of the practitioner in the insurance industry as an example, each practitioner is one sample, the original data is specifically practitioner data, the original data comprises basic information (such as gender, age, academic, practitioner experience and the like) of the practitioner, information (such as training data, company application software use data and the like) in the process of entering the practitioner, and working performance information (such as signed historical policy data and the like) of the practitioner, each of the original data can be extracted as one feature of the sample, and finally an original feature set is formed.
In this embodiment, the electronic device (for example, the server shown in fig. 1) on which the data feature extraction method operates may receive, by a wired connection manner or a wireless connection manner, the data feature extraction request sent by the terminal device, to trigger step S201. It should be noted that the wireless connection may include, but is not limited to, 3G/4G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection.
For step S202, in this embodiment, when stability feature filtering is performed, filtering based on stability filtering conditions is performed for each feature independently without considering correlation between features, and features satisfying the stability filtering conditions are retained, and if the stability filtering conditions are not satisfied, this means that feature stability is poor, so that a stable feature subset can be obtained by removing features having poor stability.
In some embodiments, as shown in fig. 3, the step of filtering the feature set to obtain the stable feature subset specifically includes:
S301, acquiring feature evaluation dimensions, and calculating evaluation values of all features in the feature evaluation dimensions in the feature set;
s302, acquiring a threshold value of each feature evaluation dimension, and comparing the evaluation value of each feature in the feature set with the threshold value of the corresponding feature evaluation dimension;
And S303, sequentially screening and filtering all the features in the feature set according to the comparison result, and outputting the rest features to obtain a stable feature subset.
Specifically, the feature evaluation dimension in the embodiment of the present application may include the deletion rate, PSI, IV, and the like of each feature. Wherein, the missing rate refers to the ratio of the values of the characteristics of all samples to be null; PSI is collectively referred to as Population Stability Index, population stability index, used to evaluate the stability of the evaluation feature; IV, collectively information value, information value or information quantity, is used to evaluate the contribution of features to the model. And presetting a threshold value for each feature evaluation dimension to form screening conditions of each feature evaluation dimension, wherein the screening conditions are met when the threshold value is exceeded or is lower than the threshold value, for example, the screening conditions of the feature evaluation dimension, namely the deletion rate, are met when the threshold value of the deletion rate is not exceeded, the screening conditions of the feature evaluation dimension, namely the PSI (program specific information) are met when the threshold value of the PSI is not exceeded, and the screening conditions of the feature evaluation dimension, namely the IV (program specific information), are met when the threshold value of the IV is exceeded.
The embodiment of the application adopts a mode of sequentially filtering and screening based on a plurality of feature evaluation dimensions, for example, firstly, the first filtering and screening are carried out on all features based on the feature evaluation dimension of the deletion rate, whether each feature meets the screening condition is judged one by one, then the secondary filtering and screening are carried out on the remaining features after the previous filtering and screening based on PSI, and the like until the filtering and screening are completed on all the feature evaluation dimensions. According to the application, one or more characteristic evaluation dimensions can be selected for filtering and screening according to the requirement, so that the flexibility is high and the screening efficiency is high.
Because the time span is long in the original data, the feature evaluation dimension needs to be evaluated in a plurality of time partitions, in some embodiments, the step of obtaining the feature evaluation dimension in S301, and calculating the evaluation value of each feature in the feature set in each feature evaluation dimension may include:
dividing the time range covered by the original data to obtain a plurality of time partitions, and obtaining characteristic evaluation dimensions based on the time partitions; and calculating the evaluation values of the characteristics in the characteristic evaluation dimension based on the time partition in the characteristic set to obtain a plurality of evaluation values of the same characteristic evaluation dimension in different time partitions.
The feature evaluation dimension based on the time partition can be specifically divided into a deletion rate based on the time partition, a month-by-month PSI, a month-by-month IV and an overall IV. For example, under a user retention prediction model, the original data adopts 6 months of practitioner data as samples, the time range is divided into six time partitions, the time span of each time partition is 1 month, the proportion of the empty values of the features of each month sample is calculated based on the missing rate of the time partition, the PSI distributed on each month sample set is calculated on the month basis, the IV value of each month sample is calculated on the month basis, the whole IV refers to the IV value of the whole sample of 6 months, the whole IV is different from the month IV value, the whole IV can evaluate the whole prediction capability of the features, the month IV can evaluate the single month prediction capability of the features, and the stability of the prediction capability of each feature is judged through the variation coefficient of the month IV value.
In a specific embodiment, the time partition-based screening conditions of the present application are listed below:
(1) The miss rate of each time partition is not greater than a miss rate threshold (e.g., 0.999);
(2) The IV for each time partition, as well as the overall IV, is greater than an IV threshold (e.g., 0.001);
(3) The month-by-month PSI for each time zone is not greater than a first PSI threshold (e.g., 0.25).
When any one of the screening conditions (1) to (3) above is not satisfied, the features will be filtered out.
Further, in step S302, when comparing the evaluation value of each feature in the feature set with the threshold value of the corresponding feature evaluation dimension, corresponding to the feature evaluation dimension based on the time partition in step S301, comparing the feature evaluation dimension based on the time partition with each other, wherein each time partition corresponds to a comparison result, in step S303, filtering each feature in the feature set in turn according to the comparison result based on the time partition, and outputting the remaining features to obtain a stable feature subset.
According to the application, the characteristic screening is carried out in a time partition mode, so that the characteristics with high stability and low redundancy can be screened out more accurately for data with large time span.
In the embodiment of the present application, the IV and PSI may use the same calculation formula regardless of whether the time division is performed, and the samples of each time division are only changed from the samples when the time division is performed and then the calculation is performed.
In some embodiments, the method further comprises: and calculating fluctuation parameters of evaluation values of each feature evaluation dimension of the same feature in a plurality of time partitions, and filtering and screening the features in the feature set according to the fluctuation parameters. Specifically, in the embodiment of the present application, filtering and screening are performed on the features by forming screening conditions by using the fluctuation parameters of the evaluation values of the multiple time partitions, which may include the following screening conditions:
(1) The fluctuation coefficient (standard deviation/mean) of the IV of each time partition is not more than a fluctuation coefficient threshold value (such as 1), if the fluctuation coefficient is more than the threshold value, the IV value is unstable, the prediction capability of the feature is weak, and the feature should be removed from the model;
(2) The PSI of the time partition closest to the current time node's sample relative to the other time partitions is not greater than a second PSI threshold (e.g., 0.15), which may be the same as, or slightly less than, the first PSI threshold described above.
The application adopts the fluctuation parameter to form the screening condition to screen the data characteristics with long time span, can evaluate the stability of each characteristic more accurately, and is also beneficial to reducing the information redundancy.
In some embodiments, before the obtaining feature evaluation dimensions, computing an evaluation value of each feature in the feature set in each feature evaluation dimension, the method includes: and acquiring a pre-screening condition, pre-screening the features in the feature set according to the pre-screening condition, and directly removing the features which do not meet the pre-screening condition in the feature set. Specifically, for example, discrete features with a value number exceeding a threshold (such as 50), no lack and unique value features are regarded as invalid features, and the features are directly removed without screening judgment based on feature evaluation dimensions, so that the time consumption of the subsequent screening steps can be reduced, and the feature extraction efficiency is improved.
In this embodiment, the thresholds of each feature in the same feature evaluation dimension are consistent, and the determination mode is obtained by combining experimental trimming according to an empirical value, wherein the experimental trimming specifically performs a univariate experiment or a crossover experiment by taking several groups (for example, 3 to 4 groups) of different threshold parameters around the empirical value, determines an optimal parameter according to a model effect, and can further improve the accuracy of data feature extraction by optimizing the thresholds.
For step S203, in this embodiment, the embedded feature screening model mainly selects LightGBM (LIGHT GRADIENT Boosting Machine) tree models, lightGBM is a lightweight framework for implementing GBDT (Gradient Boosting Decison Tree) algorithm, supports efficient parallel training, and uses LightGBM tree models themselves with feature screening capability in training process, and can implement feature screening by adjusting regular parameters, but is limited by a model structure, lightGBM tree model screening can only screen out features with information gain of 0, so in this embodiment of the application, the information gain of each feature outputted by lightGBM tree model is sorted according to size based on the feature with information gain of 0, and screening is performed based on sorting result. Based on this, the step of screening the features in the stable feature subset according to the sorting result to obtain the target feature subset may specifically include: and selecting the features with the information gains ranked at the front in the stable feature subset, enabling the proportion of the sum of the information gains of the selected features to the sum of the information gains of all the features of the stable feature subset to exceed a preset gain parameter threshold, and outputting the selected features to obtain a target feature subset. By selecting the features with the information gains ranked at the front, not only the index with the information gain of 0 can be screened out, but also the feature with smaller information gain can be screened out, so that the model efficiency and model precision are improved. Specifically, the preset gain parameter threshold in the embodiment of the present application may be set at 95%, and in practical application, the preset gain parameter threshold corresponding to different data scenes has a certain floating, and needs to be adjusted to adapt to the data characteristics, for example, when the feature dimension is low, the preset gain parameter threshold may be considered to be adjusted to be high.
According to the embodiment of the application, the characteristic extraction is carried out through the embedded screening after the filtering screening, so that the characteristic distribution stability and the prediction capability are considered, the information redundancy among the characteristics is also considered, the characteristic subset with high stability and low information redundancy can be obtained, and when the model is built and trained aiming at the scene with long data time span, large characteristic distribution difference, high model input characteristic dimension and large information redundancy, the model input characteristic dimension can be reduced by utilizing the scheme of the application, and the model operation efficiency and the model precision are improved. The filtering and screening corresponding to step S202 in the embodiment of the present application is located before the embedded feature screening corresponding to step S203, which is more favorable for implementing feature information deduplication, because if the filtering and screening conditions are first used for filtering and screening features by using the optimal subset feature screening and reusing on the basis of the feature full set, there is a risk of losing information amount.
The method provided by the embodiment of the application can be applied to an application scene with low model training and prediction efficiency caused by data distribution difference and large information redundancy caused by long time span of the training set and the prediction set. For example, in a user (such as a practitioner in insurance industry) retention prediction scene, the target feature subset obtained in the step S203 is directly input into a retention prediction model to be trained to obtain a trained retention prediction model, when the prediction is performed, the input data is subjected to feature extraction by adopting the method, and the method can be regarded as searching an optimal subset of features, so that the information de-duplication of feature variables is realized, and the extracted target feature subset is input into the trained retention prediction model to obtain a target prediction result. When the time span of the predicted training set and the predicted set is reserved for 3 months, more features have larger difference in the distribution of the training set and the predicted set, the dimension of the model input features is high, and the information exists in the model with larger redundancy, at the moment, by using the embodiment of the application, the dimension of the model input features can be reduced to 30% before screening, the corresponding model operation efficiency is improved by 3 times, and the model precision can also be improved.
It should be emphasized that, to further ensure the privacy and security of the information, the feature values of each target feature in the target feature subset may also be stored in a node of a blockchain after the step of obtaining the target feature subset.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
With further reference to fig. 4, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a data feature extraction apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.
As shown in fig. 4, the feature extraction device of data according to the present embodiment includes: a feature extraction module 401, a first screening module 402, and a second screening module 403. Wherein:
The feature extraction module 401 is configured to obtain original data, and extract features according to the original data to form a feature set; the first filtering module 402 is configured to perform filtering feature filtering on the feature set to obtain a stable feature subset; the second filtering module 403 is configured to input the stable feature subset into an embedded feature filtering model, output information gains provided by features in the stable feature subset in a model training process, order information gains corresponding to the features in the stable feature subset, and filter the features in the stable feature subset according to an ordering result to obtain a target feature subset.
Specifically, in the big data analysis modeling, the feature extraction module 401, the first screening module 402 and the second screening module 403 provided by the application can perform data mining on the original data, and extract features which can be used as model input to form a feature set to be used as a training set or a verification set of the model. For example, in the insurance industry practitioner retention prediction model, features including essential information (such as gender, age, academic history, experience, etc.) of the practitioner, information in the process of entering the practitioner (such as training data, application software use data of the company, etc.), performance information of the practitioner (such as signed historical policy data, etc.) and the like are extracted from the original data by the feature extraction module 401 to form an original feature set; stability feature screening filtering is then performed by the first screening module 402, screening based on stability filtering conditions is performed for each feature separately, and features meeting the stability filtering conditions are retained, irrespective of correlation between the features, to obtain a stable feature subset.
In some embodiments, as shown in fig. 5, the first filtering module 402 includes a calculating unit 4021, a comparing unit 4022, and a filtering unit 4023, wherein:
The calculating unit 4021 is configured to obtain feature evaluation dimensions, and calculate evaluation values of each feature in each feature evaluation dimension in the feature set; the comparing unit 4022 is configured to obtain a threshold value of each feature evaluation dimension, and compare an evaluation value of each feature in the feature set with a threshold value of a corresponding feature evaluation dimension; the filtering unit 4023 is configured to sequentially screen and filter each feature in the feature set according to the comparison result, and output the remaining features to obtain a stable feature subset.
Specifically, the feature evaluation dimension in the embodiment of the present application may include the deletion rate, PSI, IV, and the like of each feature. And presetting a threshold value for each feature evaluation dimension to form screening conditions of each feature evaluation dimension, wherein the screening conditions are met when the threshold value is exceeded or is lower than the threshold value, for example, the screening conditions of the feature evaluation dimension, namely the deletion rate, are met when the threshold value of the deletion rate is not exceeded, the screening conditions of the feature evaluation dimension, namely the PSI (program specific information) are met when the threshold value of the PSI is not exceeded, and the screening conditions of the feature evaluation dimension, namely the IV (program specific information), are met when the threshold value of the IV is exceeded.
According to the embodiment of the application, the filtering unit 4023 sequentially performs filtering and screening based on a plurality of feature evaluation dimensions, for example, first performs filtering and screening on all features based on the feature evaluation dimension of the deletion rate, judges whether each feature meets the screening condition one by one, then performs secondary filtering and screening on the remaining features after the previous filtering and screening based on PSI, and so on until all feature evaluation dimensions complete the filtering and screening. According to the application, one or more characteristic evaluation dimensions can be selected for filtering and screening according to the requirement, so that the flexibility is high and the screening efficiency is high.
Because the time span is long in the original data, the above feature evaluation dimension needs to be evaluated in a plurality of time partitions, and in some embodiments, the computing unit 4021 is configured to, when obtaining the feature evaluation dimension, calculate the evaluation value of each feature in each feature evaluation dimension in the feature set, specifically divide the time range covered by the original data, so as to obtain a plurality of time partitions, and obtain a feature evaluation dimension based on the time partitions; and calculating the evaluation values of the characteristics in the characteristic evaluation dimension based on the time partition in the characteristic set to obtain a plurality of evaluation values of the same characteristic evaluation dimension in different time partitions.
The feature evaluation dimension based on the time partition can be specifically divided into a deletion rate based on the time partition, a month-by-month PSI, a month-by-month IV and an overall IV. For example, under a user retention prediction model, the original data adopts 6 months of practitioner data as samples, the time range is divided into six time partitions, the time span of each time partition is 1 month, the proportion of the empty values of the features of each month sample is calculated based on the missing rate of the time partition, the PSI distributed on each month sample set is calculated on the month basis, the IV value of each month sample is calculated on the month basis, the whole IV refers to the IV value of the whole sample of 6 months, the whole IV is different from the month IV value, the whole IV can evaluate the whole prediction capability of the features, the month IV can evaluate the single month prediction capability of the features, and the stability of the prediction capability of each feature is judged through the variation coefficient of the month IV value.
In a specific embodiment, the time partition-based screening conditions of the present application are listed below:
(1) The miss rate of each time partition is not greater than a miss rate threshold (e.g., 0.999);
(2) The IV for each time partition, as well as the overall IV, is greater than an IV threshold (e.g., 0.001);
(3) The month-by-month PSI for each time zone is not greater than a first PSI threshold (e.g., 0.25).
When any one of the screening conditions (1) to (3) above is not satisfied, the feature will be filtered out by the filtering unit 4023.
Further, the computing unit 4021 obtains an evaluation value of the feature evaluation dimension based on the time partition, the comparing unit 4022 compares the evaluation value of each feature in the feature set with the threshold value of the corresponding feature evaluation dimension, and the comparing unit 4023 sequentially screens and filters each feature in the feature set based on the comparison result of the time partition, and outputs the remaining feature to obtain the stable feature subset.
According to the application, the characteristic screening is carried out in a time partition mode, so that the characteristics with high stability and low redundancy can be screened out more accurately for data with large time span.
In the embodiment of the present application, the IV and PSI may use the same calculation formula regardless of whether the time division is performed, and the samples of each time division are only changed from the samples when the time division is performed and then the calculation is performed.
In some embodiments, the calculating unit 4021 is further configured to calculate a fluctuation parameter of the evaluation values of each feature evaluation dimension of the same feature in a plurality of time partitions, and the filtering unit 4023 is further configured to filter and screen the features in the feature set according to the fluctuation parameter. Specifically, in the embodiment of the present application, filtering and screening are performed on the features by forming screening conditions by using the fluctuation parameters of the evaluation values of the multiple time partitions, which may include the following screening conditions:
(1) The fluctuation coefficient (standard deviation/mean) of the IV of each time partition is not more than a fluctuation coefficient threshold value (such as 1), if the fluctuation coefficient is more than the threshold value, the IV value is unstable, the prediction capability of the feature is weak, and the feature should be removed from the model;
(2) The PSI of the time partition closest to the current time node's sample relative to the other time partitions is not greater than a second PSI threshold (e.g., 0.15), which may be the same as, or slightly less than, the first PSI threshold described above.
The application adopts the fluctuation parameter to form the screening condition to screen the data characteristics with long time span, can evaluate the stability of each characteristic more accurately, and is also beneficial to reducing the information redundancy.
In some embodiments, the first filtering module 402 further includes a pre-filtering unit, configured to obtain a pre-filtering condition before the calculating unit 4021 obtains the feature evaluation dimensions and calculates the evaluation value of each feature in each feature evaluation dimension in the feature set, perform pre-filtering on the features in the feature set according to the pre-filtering condition, and directly remove the features in the feature set that do not meet the pre-filtering condition. Specifically, for example, discrete features with a value number exceeding a threshold (such as 50), no lack and unique value features are regarded as invalid features, and the features are directly removed without screening judgment based on feature evaluation dimensions, so that the time consumption of the subsequent screening steps can be reduced, and the feature extraction efficiency is improved.
In this embodiment, the thresholds of each feature in the same feature evaluation dimension are consistent, and the determination mode is obtained by combining experimental trimming according to an empirical value, wherein the experimental trimming specifically performs a univariate experiment or a crossover experiment by taking several groups (for example, 3 to 4 groups) of different threshold parameters around the empirical value, determines an optimal parameter according to a model effect, and can further improve the accuracy of data feature extraction by optimizing the thresholds.
In this embodiment, the embedded feature screening model adopted by the second screening module 403 mainly selects LightGBM (LIGHT GRADIENT Boosting Machine) tree models, the LightGBM tree models have the capability of screening features in the training process, and feature screening can be realized by adjusting regular parameters, but the feature screening is limited by a model structure, and LightGBM tree model screening can only screen out features with information gain of 0, so in the embodiment of the application, on the basis of screening out features with information gain of 0, the second screening module 403 sorts the information gains of the features output by the lightGBM tree models according to the size, and screens based on the sorting result. Based on this, the second screening module 403 screens the features in the stable feature subset according to the sorting result, and is specifically configured to: and selecting the features with the information gains ranked at the front in the stable feature subset, enabling the proportion of the sum of the information gains of the selected features to the sum of the information gains of all the features of the stable feature subset to exceed a preset gain parameter threshold, and outputting the selected features to obtain a target feature subset. By selecting the features with the information gains ranked at the front, not only the index with the information gain of 0 can be screened out, but also the feature with smaller information gain can be screened out, so that the model efficiency and model precision are improved. Specifically, the preset gain parameter threshold in the embodiment of the present application may be set at 95%, and in practical application, the preset gain parameter threshold corresponding to different data scenes has a certain floating, and needs to be adjusted to adapt to the data characteristics, for example, when the feature dimension is low, the preset gain parameter threshold may be considered to be adjusted to be high.
According to the embodiment of the application, the characteristic extraction is carried out through the embedded screening after the filtering screening, so that the characteristic distribution stability and the prediction capability are considered, the information redundancy among the characteristics is also considered, the characteristic subset with high stability and low information redundancy can be obtained, and when the model is built and trained aiming at the scene with long data time span, large characteristic distribution difference, high model input characteristic dimension and large information redundancy, the model input characteristic dimension can be reduced by utilizing the scheme of the application, and the model operation efficiency and the model precision are improved. The filtering and screening performed by the first screening module 402 in the embodiment of the application is positioned before the embedded feature screening performed by the second screening module 403, so that the information deduplication of the features is more facilitated, the loss of information quantity can be avoided, and the target feature subset with high stability and low information redundancy can be obtained.
The device provided by the embodiment of the application can be applied to an application scene with low model training and prediction efficiency caused by data distribution difference and large information redundancy caused by long time span of the training set and the prediction set.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 6, fig. 6 is a basic structural block diagram of a computer device according to the present embodiment. The computer device 6 includes a memory 61, a processor 62, and a network interface 63 that are communicatively connected to each other through a system bus, where computer readable instructions are stored in the memory 61, and the processor 62 implements the steps of the feature extraction method of data described in the above method embodiment when executing the computer readable instructions, and has advantages corresponding to the feature extraction method of data described above, which are not expanded herein.
It is noted that only a computer device 6 having a memory 61, a processor 62, a network interface 63 is shown in the figures, but it is understood that not all illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
In the present embodiment, the memory 61 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal memory unit of the computer device 6 and an external memory device. In this embodiment, the memory 61 is typically used to store an operating system and various types of application software installed on the computer device 6, such as computer readable instructions corresponding to the feature extraction method of the data described above. Further, the memory 61 may be used to temporarily store various types of data that have been output or are to be output.
The processor 62 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute computer readable instructions stored in the memory 61 or process data, for example, execute computer readable instructions corresponding to a feature extraction method of the data.
The network interface 63 may comprise a wireless network interface or a wired network interface, which network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.
The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the feature extraction method of data as described above, and has advantages corresponding to the feature extraction method of data as described above, which are not expanded herein.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims (8)

1. A method for extracting features of data, comprising the steps of:
Acquiring original data, and extracting features according to the original data to form a feature set, wherein the original data is practitioner data, and the feature set comprises basic information of the practitioner, information in a flow of the practitioner and staff work performance information;
Filtering type feature screening is carried out on the feature set to obtain a stable feature subset;
Inputting the stable feature subset into an embedded feature screening model, outputting information gain provided by each feature in the stable feature subset in the model training process, sorting the information gain corresponding to each feature in the stable feature subset, and screening the features in the stable feature subset according to the sorting result to obtain a target feature subset;
the step of filtering the feature set to obtain a stable feature subset specifically comprises the following steps:
Acquiring feature evaluation dimensions, and calculating evaluation values of all features in all feature evaluation dimensions in the feature set;
acquiring a threshold value of each feature evaluation dimension, and comparing the evaluation value of each feature in the feature set with the threshold value of the corresponding feature evaluation dimension;
Sequentially screening and filtering each feature in the feature set according to the comparison result, and outputting the rest features to obtain a stable feature subset;
the step of obtaining feature evaluation dimensions and calculating the evaluation value of each feature in each feature evaluation dimension in the feature set comprises the following steps:
Dividing the time range covered by the practitioner data to obtain a plurality of time partitions, and obtaining characteristic evaluation dimensions based on the time partitions;
And calculating the evaluation values of the characteristics in the characteristic evaluation dimension based on the time partition in the characteristic set to obtain a plurality of evaluation values of the same characteristic evaluation dimension in different time partitions.
2. The method for extracting features from data according to claim 1, wherein the step of screening the features in the stable feature subset according to the sorting result to obtain the target feature subset comprises:
And selecting the features with the information gains ranked at the front in the stable feature subset, enabling the proportion of the sum of the information gains of the selected features to the sum of the information gains of all the features of the stable feature subset to exceed a preset gain parameter threshold, and outputting the selected features to obtain a target feature subset.
3. The method of feature extraction of data according to claim 1, characterized in that the method further comprises:
And calculating fluctuation parameters of evaluation values of each feature evaluation dimension of the same feature in a plurality of time partitions, and filtering and screening the features in the feature set according to the fluctuation parameters.
4. The method according to claim 1, wherein before the feature evaluation dimension is obtained and the evaluation value of each feature in each feature evaluation dimension in the feature set is calculated, the method comprises:
and acquiring a pre-screening condition, pre-screening the features in the feature set according to the pre-screening condition, and directly removing the features which do not meet the pre-screening condition in the feature set.
5. The feature extraction method of data according to claim 1 or 2, characterized in that after the step of obtaining a target feature subset, the method further comprises:
And storing the characteristic value of each target characteristic in the target characteristic subset into a block chain.
6. A data feature extraction device, characterized in that the data feature extraction device performs the steps of the data feature extraction method according to any one of claims 1 to 5, the data feature extraction device comprising:
the feature extraction module is used for acquiring original data and extracting features according to the original data to form a feature set;
the first screening module is used for carrying out filtering type feature screening on the feature set to obtain a stable feature subset;
and the second screening module is used for inputting the stable feature subset into an embedded feature screening model, outputting information gain provided by each feature in the stable feature subset in the model training process, sequencing the information gain corresponding to each feature in the stable feature subset, and screening the features in the stable feature subset according to the sequencing result to obtain a target feature subset.
7. A computer device comprising a memory having stored therein computer readable instructions which when executed by the processor implement the steps of the feature extraction method of data as claimed in any one of claims 1 to 5.
8. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the feature extraction method of data according to any of claims 1 to 5.
CN202010797913.6A 2020-08-10 2020-08-10 Data feature extraction method and device, computer equipment and storage medium Active CN111931848B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010797913.6A CN111931848B (en) 2020-08-10 2020-08-10 Data feature extraction method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010797913.6A CN111931848B (en) 2020-08-10 2020-08-10 Data feature extraction method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111931848A CN111931848A (en) 2020-11-13
CN111931848B true CN111931848B (en) 2024-06-14

Family

ID=73308195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010797913.6A Active CN111931848B (en) 2020-08-10 2020-08-10 Data feature extraction method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111931848B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613983B (en) * 2020-12-25 2023-11-21 北京知因智慧科技有限公司 Feature screening method and device in machine modeling process and electronic equipment
CN112766649B (en) * 2020-12-31 2022-03-15 平安科技(深圳)有限公司 Target object evaluation method based on multi-scoring card fusion and related equipment thereof
CN112990583B (en) * 2021-03-19 2023-07-25 中国平安人寿保险股份有限公司 Method and equipment for determining model entering characteristics of data prediction model

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378391A (en) * 2019-06-25 2019-10-25 北京三快在线科技有限公司 Feature Selection method, apparatus, electronic equipment and the storage medium of computation model

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045503B (en) * 2016-02-05 2019-03-05 华为技术有限公司 A kind of method and device that feature set determines
CN107133628A (en) * 2016-02-26 2017-09-05 阿里巴巴集团控股有限公司 A kind of method and device for setting up data identification model
CN108766585A (en) * 2018-05-31 2018-11-06 平安科技(深圳)有限公司 Generation method, device and the computer readable storage medium of influenza prediction model

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378391A (en) * 2019-06-25 2019-10-25 北京三快在线科技有限公司 Feature Selection method, apparatus, electronic equipment and the storage medium of computation model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于信息增益的特征选择方法;黄志艳;;山东农业大学学报(自然科学版);20130615(02);第97-101页 *

Also Published As

Publication number Publication date
CN111931848A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
CN112148987B (en) Message pushing method based on target object activity and related equipment
CN111931848B (en) Data feature extraction method and device, computer equipment and storage medium
CN112035549B (en) Data mining method, device, computer equipment and storage medium
CN112508118B (en) Target object behavior prediction method aiming at data offset and related equipment thereof
CN112766649B (en) Target object evaluation method based on multi-scoring card fusion and related equipment thereof
CN112328909B (en) Information recommendation method and device, computer equipment and medium
CN112990583B (en) Method and equipment for determining model entering characteristics of data prediction model
CN112036483B (en) AutoML-based object prediction classification method, device, computer equipment and storage medium
CN112085087A (en) Method and device for generating business rules, computer equipment and storage medium
CN112199374B (en) Data feature mining method for data missing and related equipment thereof
CN111752958A (en) Intelligent associated label method, device, computer equipment and storage medium
CN116777646A (en) Artificial intelligence-based risk identification method, apparatus, device and storage medium
CN116843395A (en) Alarm classification method, device, equipment and storage medium of service system
CN116012019A (en) Financial wind control management system based on big data analysis
CN112069807A (en) Text data theme extraction method and device, computer equipment and storage medium
CN114840692B (en) Image library construction method, image retrieval method, image library construction device and related equipment
CN115941712B (en) Method and device for processing report data, computer equipment and storage medium
CN117078406A (en) Customer loss early warning method and device, computer equipment and storage medium
CN116757192A (en) Word recognition method, device, computer equipment and storage medium
CN117786390A (en) Feature data arrangement method to be maintained and related equipment thereof
CN117874518A (en) Insurance fraud prediction method, device, equipment and medium based on artificial intelligence
CN116307742A (en) Risk identification method, device and equipment for subdivision guest group and storage medium
CN117407750A (en) Metadata-based data quality monitoring method, device, equipment and storage medium
CN116777641A (en) Model construction method, device, computer equipment and storage medium
CN117114894A (en) Method, device and equipment for predicting claim result and storage medium thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant