WO2019129060A1 - 自动生成机器学习样本的特征的方法及*** - Google Patents

自动生成机器学习样本的特征的方法及*** Download PDF

Info

Publication number
WO2019129060A1
WO2019129060A1 PCT/CN2018/123910 CN2018123910W WO2019129060A1 WO 2019129060 A1 WO2019129060 A1 WO 2019129060A1 CN 2018123910 W CN2018123910 W CN 2018123910W WO 2019129060 A1 WO2019129060 A1 WO 2019129060A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
feature
machine learning
target value
unit
Prior art date
Application number
PCT/CN2018/123910
Other languages
English (en)
French (fr)
Inventor
杨强
戴文渊
陈雨强
孙迪
杨慧斌
刘守湘
Original Assignee
第四范式(北京)技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 第四范式(北京)技术有限公司 filed Critical 第四范式(北京)技术有限公司
Publication of WO2019129060A1 publication Critical patent/WO2019129060A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present disclosure relates generally to the field of artificial intelligence and, more particularly, to a method and system for automatically generating features of machine learning samples.
  • the basic processes of training machine learning models mainly include:
  • a training model in which a model is learned based on machine learning samples obtained through feature engineering in accordance with programmed machine learning algorithms (eg, logistic regression algorithms, decision tree algorithms, neural network algorithms, etc.).
  • programmed machine learning algorithms eg, logistic regression algorithms, decision tree algorithms, neural network algorithms, etc.
  • Each data record in the data table may include a plurality of attribute information (ie, fields), and the features may indicate various field processing (or operation) results, such as each field itself, or a combination of fields, to better reflect the data distribution and The intrinsic relationship between the fields and the potential meaning. Therefore, the quality of the feature engineering directly determines the accuracy of the machine learning problem, and thus affects the pros and cons of the model.
  • the machine learning model training process can be completed by using a graphical interface-based interaction mode without requiring the user to write the program code himself.
  • the feature engineering process it is often the manual input of the feature generation method into the platform system. That is to say, the user needs to preset the characteristics of the machine learning sample.
  • the user needs to have a deep understanding of the business scenario, that is, the user sets the feature by using the business experience; on the other hand, generally in the machine learning process, The amount of data used is relatively large, and users sometimes cannot analyze the data comprehensively, which may result in setting some invalid features.
  • the user needs to make constant attempts when facing big data. Such work takes a long time when volume and high dimensional features are used. In this case, not only does the user have a deep understanding of the business scenario, but also increases the workload of the user, and also reduces the efficiency of machine learning.
  • An exemplary embodiment of the present disclosure is to provide a method and system for automatically generating features of machine learning samples to solve the problems of the prior art that cannot easily generate features of machine learning samples.
  • a method of automatically generating a feature of a machine learning sample comprising: (A) acquiring a user-specified data table, wherein one row of the data table corresponds to one data record, and one column of the data table corresponds to a field; (B) declaring a feature type corresponding to each non-target value field in the data table, wherein the feature type includes discrete features and/or continuous features; (C) processing each non-target value field according to the declared feature type (1) performing feature combination based on the generated unit features to generate combined features; and (E) obtaining features of the machine learning samples based on the generated unit features and combined features.
  • a system for automatically generating features of machine learning samples comprising: data table obtaining means for acquiring a data table specified by a user, wherein one row of the data table corresponds to one data record a column of the data table corresponding to a field; a declaring means for declaring a feature type corresponding to each non-target value field in the data table, wherein the feature type includes a discrete feature and/or a continuous feature; and a unit feature generating device is configured to: Processing each non-target value field as a unit feature according to the declared feature type; combining feature generating means for performing feature combination based on the generated unit feature to generate a combined feature; and feature acquiring means for generating the unit feature based on And combining features to derive features of machine learning samples.
  • a computer readable medium for automatically generating features of a machine learning sample, wherein an automatic generation machine for performing the above-described execution is recorded on the computer readable medium
  • a computer program that learns the characteristics of a sample.
  • a computing device for automatically generating features of a machine learning sample comprising a storage component and a processor, wherein the storage component stores therein a set of computer executable instructions when When the set of computer executable instructions is executed by the processor, a method of automatically generating features of the machine learning samples as described above is performed.
  • FIG. 1 illustrates a flowchart of a method of automatically generating features of a machine learning sample, according to an exemplary embodiment of the present disclosure
  • FIG. 2 illustrates an example of specifying a feature type corresponding to a non-target value field by a user, according to an exemplary embodiment of the present disclosure
  • FIG. 3 illustrates a flowchart of a method of automatically generating features of a machine learning sample, in accordance with another exemplary embodiment of the present disclosure
  • FIG. 4 illustrates a flowchart of a method of automatically generating features of a machine learning sample, in accordance with another exemplary embodiment of the present disclosure
  • FIG. 5 illustrates a flowchart of a method of automatically generating features of a machine learning sample, in accordance with another exemplary embodiment of the present disclosure
  • FIG. 6 illustrates an example of a DAG diagram for training a machine learning model, according to an exemplary embodiment of the present disclosure
  • FIG. 7 illustrates a block diagram of a system that automatically generates features of machine learning samples, in accordance with an exemplary embodiment of the present disclosure.
  • machine learning is an inevitable outcome of the development of artificial intelligence research to a certain stage. It is dedicated to improving the performance of the system itself through computational means and experience.
  • experience usually exists in the form of “data.”
  • Machine learning algorithms can generate “models” from data. That is, empirical data can be provided to machine learning algorithms based on these empirical data. The model, in the face of new situations, the model will provide the corresponding judgment, that is, the prediction results. Whether training a machine learning model or using a trained machine learning model for prediction, the data needs to be transformed into machine learning samples that include various features.
  • Machine learning may be implemented in the form of "supervised learning,” “unsupervised learning,” or “semi-supervised learning.” It should be noted that the exemplary embodiments of the present disclosure are not specifically limited to specific machine learning algorithms. In addition, it should be noted that in the process of training and applying the model, other means such as statistical algorithms can be combined.
  • “and/or”, “and/or” appearing in the present disclosure means that three types of juxtapositions are included.
  • “including A and/or B” means the following three parallel cases: (1) includes A; (2) includes B; and (3) includes A and B.
  • execution step one and/or step two indicates the following three parallel situations: (1) performing step one; (2) performing step two; (3) performing step one and step two.
  • FIG. 1 illustrates a flow chart of a method of automatically generating features of a machine learning sample, in accordance with an exemplary embodiment of the present disclosure.
  • the method may be performed by a computer program or by a dedicated system or computing device that automatically generates features of the machine learning sample.
  • the method can be performed automatically by initiating an operator corresponding to the automatic feature generation step.
  • the operator corresponds to a node in a directed acyclic graph (DAG graph) corresponding to a machine learning flow.
  • DAG graph directed acyclic graph
  • the DAG map corresponding to the machine learning flow may include a feature generation node that will automatically execute the execution of the feature generation node when the entire DAG map is run.
  • step S101 a data table designated by a user is acquired.
  • one row of the data table corresponds to one data record
  • one column of the data table corresponds to one field.
  • each data record in the data table has a field value corresponding to each field.
  • each data record can be viewed as a description of an event or object, corresponding to an example or example, each field can be used to describe the performance or nature of an event or object in one aspect (eg, name, age, Occupation, etc.)
  • a graphical interface for specifying a data table can be provided to the user and the data table specified by the user can be determined based on input operations performed by the user on the graphical interface.
  • step S102 the feature types corresponding to the respective non-target value fields in the data table are declared, wherein the feature types include discrete features and/or continuous features.
  • the target value field is a field corresponding to a mark (ie, label) to be estimated using machine learning technology, and the field corresponds to a predicted target in the case of supervised learning, and the target value field is not included in the data table.
  • a field other than the value field is a field corresponding to a mark (ie, label) to be estimated using machine learning technology, and the field corresponds to a predicted target in the case of supervised learning, and the target value field is not included in the data table.
  • a field other than the value field is not included in the data table.
  • a non-target value field can be obtained by removing the user-specified target value field from all fields in the data table.
  • a graphical interface for specifying a target value field may be provided to the user and the target value field specified by the user may be determined based on an input operation performed by the user on the graphical interface. Further, as an example, the operator may provide an exception reminder to prompt the user to specify a target value field when the user is launched without specifying a target value field.
  • target value field may or may not be included in the data table.
  • a continuous feature is a feature that is opposite to a discrete feature (eg, a category feature), and the value can be a value with a certain continuity, such as age, amount, and the like.
  • the value of the discrete feature does not have continuity.
  • it may be an unordered classification such as “from Beijing”, “from Shanghai” or “from Tianjin”, “gender is male”, “gender is female”, etc. Characteristics.
  • all non-target value fields may be declared as discrete features, either automatically or according to a user's indication, or each non-target value field may be declared as a discrete feature or a continuous feature corresponding to its field value data type.
  • the field value data type of the field can be continuous (eg, numeric (eg, integer int)) or discrete (eg, textual (eg, string type string)).
  • the step of declaring each non-target value field as a discrete feature or a continuous feature corresponding to its field value data type may include declaring a field value data type in the data table as a discrete non-target value field as a discrete feature, Declare a field value data type in the data table that is a continuous non-target value field as a continuous feature.
  • a graphical interface for specifying a feature type corresponding to a non-target value field may be provided to the user, and all non-target value fields are declared as discrete features according to an input operation performed by the user on the graphical interface, or each A non-target value field is declared as a discrete feature or a continuous feature corresponding to its field value data type.
  • the graphical interface for specifying the feature type corresponding to the non-target value field may display a radio button "all discrete” and a radio button “discrete + continuous" (the two buttons may be selected one by one), All non-target value fields in the data table may be declared as discrete features in response to the user's selection operation of the radio button "all discrete”; in response to the user's selection operation of the radio button "discrete + continuous", according to each
  • the data type of the non-target value field declares the field as a corresponding discrete feature or continuous feature.
  • the data type of the field can be automatically determined according to the characteristics of the field value, and then according to whether the data type is discrete or continuous Fields are declared as discrete features or continuous features.
  • a control for specifying a target value field may also be displayed in the graphical interface, and the user may specify a target value field by operating the control.
  • the left side of the graphical interface may also display the field name and field value data type of each field in the data table.
  • each non-target value field is processed as a unit feature according to the declared feature type.
  • each non-target value field is treated as a unit feature separately according to the declared feature type.
  • each non-target value field whose field value data type is continuous and declared as a discrete feature may be discretized to obtain a unit feature.
  • a unit feature herein means that the feature corresponds to a single field, which itself may have one or more dimensions depending on the definition of the value.
  • one or more bucket operations may be performed for each non-target value field whose field value data type is continuous and declared as a discrete feature to obtain a corresponding one or more bucket features, and The resulting bucket feature as a unit feature as a whole.
  • the binning operation refers to a specific manner of decentralizing a continuous type of field, that is, dividing the value range of the continuous type field into a plurality of intervals (ie, multiple buckets), and based on the division The bucket to determine the corresponding bucket feature value.
  • the bucket operation can be roughly divided into a supervised bucket and an unsupervised bucket.
  • Each of the two types includes some specific buckets.
  • the supervised bucket can include a minimum entropy bucket, a minimum description bucket, and the like.
  • the unsupervised bucket can include equal-width buckets, equal-depth buckets, buckets based on k-means clustering, and the like. In each bucket mode, you can set the corresponding bucket parameters, such as width, depth, and so on.
  • a bucket operation performed on a non-target value field whose field value data type is continuous and declared as a discrete feature does not limit the kind of the bucket mode, nor restricts the bucket operation.
  • the parameters, and the specific representation of the corresponding generated bucket features are also not limited.
  • various bucket operations performed on non-target value fields whose field value data type is continuous and declared as a discrete feature may differ in bucket mode and/or bucket parameters.
  • the plurality of bucket operations may be bucket operations of the same type but having different operational parameters (eg, depth, width, etc.), or different types of bucket operations.
  • each bucket operation can obtain a bucket feature, which together constitute a bucket group feature, the bucket group feature can reflect different bucket operations, thereby improving the effectiveness of machine learning materials. It provides a good foundation for the training/prediction of machine learning models.
  • At least one bucket operation may be performed for each non-target value field whose field value data type is continuous and declared as a discrete feature to obtain a corresponding at least one bucket
  • a feature is obtained by taking each of the bucket features as a constituent element and obtaining a feature corresponding to the field, and using the feature as a unit feature.
  • the execution of the bucket operation causes the field value data type to be continuous and the non-target value field declared as a discrete feature is decentralized into the corresponding specific bucket, and the converted plurality of bucket features
  • Each dimension may indicate whether a discrete value of a continuous feature (eg, "0" or "1") is assigned in the bucket, or may indicate a specific continuous value (eg, the actual feature value of the continuous feature or its return) a value, an average value of the continuous features in the bucket, an intermediate value, a boundary value, etc.).
  • step S104 feature combination is performed based on the generated unit features to generate a combined feature.
  • various combinations of the generated unit features may be used to obtain candidate combination features, or various combinations of unit features having higher feature importance among all generated unit features may be used to obtain candidate combination features;
  • the combined features can be selected from the candidate combined features by measuring the effects of the machine learning model corresponding to each candidate combination feature.
  • a machine learning model corresponding to each candidate combination feature can be trained, since the effect of the corresponding machine learning model can reflect the feature importance (eg, predictive power) of the candidate combination feature, thereby being measurable with each candidate Combining the effects of the corresponding machine learning model to select the combined features from the candidate combination features.
  • the specified model evaluation metrics may be used to evaluate the effects of the machine learning model corresponding to each candidate combination feature.
  • model evaluation metrics may be specified automatically or according to user instructions.
  • the model evaluation index may be AUC (area under ROC (Receiver Operating Characteristic) curve, Area Under ROC Curve), MAE (Mean Absolute Error) or log loss function (logloss) )Wait.
  • AUC area under ROC (Receiver Operating Characteristic) curve, Area Under ROC Curve
  • MAE Mean Absolute Error
  • logloss logloss
  • unit features of all unit features whose feature importance satisfies the first preset condition may be variously combined to obtain candidate combination features.
  • unit features of all unit features whose feature importance is within a first preset threshold range may be variously combined to obtain candidate combination features, or all unit features may be selected according to feature importance of unit features from high to low. Sorting is performed, and the first first predetermined number of unit features are combined in various combinations to obtain candidate combination features.
  • the feature importance of the unit feature can be determined by measuring the effect of the machine learning model corresponding to the feature, and the better the effect of the corresponding machine learning model, the higher the feature importance of the unit feature.
  • the machine learning model corresponding to the feature can be used to measure the feature importance of the unit feature with respect to the evaluation value of the model evaluation index.
  • the model evaluation indicator may be specified automatically or according to an instruction of the user.
  • step S105 features of the machine learning samples are obtained based on the generated unit features and combined features.
  • all of the generated unit features and all of the combined features may be featured as machine learning samples.
  • features having higher feature importance may be featured as machine learning samples.
  • the feature whose feature importance satisfies the second preset condition may be used as a feature of the machine learning sample, for example, the feature may be within the second preset threshold range.
  • the second second predetermined number of features are used as features of the machine learning sample.
  • a unit feature having a higher feature importance among all the generated unit features and all the combined features generated may be used as features of the machine learning sample.
  • all of the combined features, along with the unit features whose feature importance meets the third preset condition, may be used as features of the machine learning sample, for example, all of the combined features may be associated with unit features whose feature importance is within a third predetermined threshold range.
  • a feature of the machine learning sample or all of the unit features are sorted according to the feature importance of the unit features from high to low, and the first third predetermined number of unit features along with all the combined features are taken as features of the machine learning sample.
  • all of the generated unit features and the combined features of the generated combined features having higher feature importance may be used as features of the machine learning sample.
  • all of the unit features, together with the combined features whose feature importance meets the fourth preset condition may be used as features of the machine learning sample, for example, the combined features of all unit features along with the feature importance within a fourth predetermined threshold range As a feature of the machine learning sample, or all of the combined features are sorted from high to low according to the feature importance of the combined features, and the first fourth predetermined number of combined features along with all of the unit features are taken as features of the machine learning sample.
  • the method of automatically generating features of a machine learning sample may further include, after step S105, displaying a feature of the obtained machine learning sample to a user. Further, the feature importance of each feature can also be displayed to the user.
  • the method of automatically generating features of a machine learning sample may further include directly applying the obtained features of the machine learning sample to a subsequent machine learning step after step S105.
  • the model can be learned directly based on the characteristics of the resulting machine learning samples.
  • FIG. 3 illustrates a flow chart of a method of automatically generating features of a machine learning sample, in accordance with another exemplary embodiment of the present disclosure.
  • step S201 a data table specified by the user is acquired.
  • step S202 the feature types corresponding to the respective non-target value fields in the data table are declared.
  • each non-target value field is processed as a unit feature according to the declared feature type.
  • step S204 various combinations of the generated unit features are performed to acquire candidate combination features, and the combined features are selected from the candidate combination features by measuring the effect of the machine learning model corresponding to each candidate combination feature.
  • step S205 all of the generated unit features and all of the combined features are taken as features of the machine learning sample.
  • FIG. 4 illustrates a flow chart of a method of automatically generating features of a machine learning sample, in accordance with another exemplary embodiment of the present disclosure.
  • step S301 a data table designated by the user is acquired.
  • step S302 the feature types corresponding to the respective non-target value fields in the data table are declared.
  • each non-target value field is processed as a unit feature according to the declared feature type.
  • step S304 the unit features having higher feature importance among all the generated unit features are combined to obtain candidate combination features, and the candidates are selected by measuring the effect of the machine learning model corresponding to each candidate combination feature.
  • the combined features are selected from the combined features.
  • step S305 the unit features having higher feature importance among all the generated unit features and all the combined features generated are taken as features of the machine learning sample.
  • the feature value of the feature may be measured by using a machine learning model corresponding to the feature with respect to the evaluation value of the model evaluation index AUC.
  • a corresponding AUC value among all the generated unit features may be greater than 0.5 and less than
  • the unit features of 1 are subjected to various combinations to obtain candidate combination features, and, in step S305, unit features of the generated total unit features having a corresponding AUC value greater than 0.5 and less than 1 and all the combined features generated may be used as machines. Learn the characteristics of the sample.
  • FIG. 5 illustrates a flow chart of a method of automatically generating features of a machine learning sample, in accordance with another exemplary embodiment of the present disclosure.
  • step S401 a data table designated by the user is acquired.
  • step S402 the feature types corresponding to the respective non-target value fields in the data table are declared.
  • each non-target value field is processed as a unit feature according to the declared feature type.
  • step S404 various combinations of the generated unit features are performed to acquire candidate combination features, and the combined features are selected from the candidate combination features by measuring the effect of the machine learning model corresponding to each candidate combination feature.
  • step S405 among the generated unit features and all the combined features, features having higher feature importance are taken as features of the machine learning sample.
  • the feature value of the feature may be measured by using a machine learning model corresponding to the feature with respect to the evaluation value of the model evaluation index AUC.
  • the corresponding AUC may be generated among all the generated unit features and all the combined features.
  • Features with values greater than 0.5 and less than 1 are characteristic of machine learning samples.
  • the machine learning process may be performed in the form of a directed acyclic graph that may encompass all or part of the steps for performing machine learning model training, testing, or estimating.
  • a DAG map including at least one of the following steps may be established for machine learning model training: a historical data import step, a data split step, a feature generation step, a logistic regression step, and a model prediction step. That is, each of the above steps can be performed as a node in the DAG diagram.
  • FIG. 6 illustrates an example of a DAG diagram for training a machine learning model, according to an exemplary embodiment of the present disclosure.
  • the first step establishing a data import node.
  • the data import node may be set in response to a user operation to obtain a banking data table named "bank" (ie, the data table is imported into the machine learning platform), wherein the data table may include multiple Historical data record.
  • Step 2 Establish a data splitting node, and connect the data import node to the data splitting node to split the imported data table into a training set and a verification set, wherein the data record in the training set is converted into a machine
  • the sample is learned to learn the model, and the data records in the validation set are used to convert to test samples to verify the effect of the learned model.
  • the data splitting node may be set in response to a user operation to split the imported data table into a training set and a verification set in a set manner.
  • the third step establishing two feature generation nodes, and connecting the data splitting nodes to the two feature generation nodes respectively, to perform feature generation on the training set and the verification set output by the data splitting node respectively, for example, default data splitting
  • the left side of the node is the training set
  • the right side is the verification set.
  • the feature generation node may be set in response to a user operation, for example, a target value field, a feature type corresponding to the non-target value field, a metric of feature importance, and the like may be specified.
  • Step 4 Establish a feature algorithm (for example, logistic regression) node (that is, a model training node), and connect the left feature generation node to the logistic regression node to train machine learning based on machine learning samples using a logistic regression algorithm. model.
  • the logistic regression node can be set in response to user operations to train the machine learning model in accordance with the set logistic regression algorithm.
  • Step 5 Establish a model prediction node, and connect the logistic regression node and the right feature generation node to the model prediction node to verify the effect of the trained machine learning model based on the test sample.
  • the model prediction node can be set in response to user operations to verify the effects of the machine learning model in accordance with the set verification mode.
  • the entire DAG map can be run according to the user's instructions.
  • the method of automatically generating the features of the machine learning samples of the above-described exemplary embodiments may be automatically performed upon execution of the feature generation node.
  • FIG. 7 illustrates a block diagram of a system that automatically generates features of machine learning samples, in accordance with an exemplary embodiment of the present disclosure.
  • a system for automatically generating features of a machine learning sample includes: a data table acquiring device 10, a declaring device 20, a unit feature generating device 30, a combined feature generating device 40, and a feature acquiring device 50.
  • the data table obtaining apparatus 10 is configured to acquire a data table specified by the user, wherein one row of the data table corresponds to one data record, and one column of the data table corresponds to one field.
  • the declaring device 20 is configured to declare a feature type corresponding to each non-target value field in the data table, wherein the feature type includes a discrete feature and/or a continuous feature.
  • a non-target value field can be obtained by removing a user-specified target value field from all fields in the data table.
  • the declaring device 20 may declare all non-target value fields as discrete features, either automatically or according to a user's indication, or declare each non-target value field as a discrete feature or a continuous feature corresponding to its field value data type.
  • the unit feature generating means 30 is for processing each non-target value field into a unit feature according to the declared feature type.
  • the unit feature generation device 30 may perform one or more bucket operations to obtain a corresponding one or more buckets for each non-target value field whose field value data type is continuous and declared as a discrete feature. Feature, and the resulting bucket feature as a unit feature.
  • the combined feature generation device 40 is configured to perform feature combination based on the generated unit features to generate a combined feature.
  • the combined feature generating device 40 may include a candidate combined feature acquiring unit (not shown) and a combined feature screening unit (not shown).
  • the candidate combination feature acquiring unit is configured to perform various combinations on all the generated unit features to obtain candidate combination features, or perform various combinations on the unit features with high feature importance among all the generated unit features to obtain candidate combination features. .
  • the combined feature screening unit is configured to filter the combined features from the candidate combined features by measuring the effects of the machine learning model corresponding to each of the candidate combined features.
  • Feature acquisition device 50 is operative to derive features of the machine learning samples based on the generated unit features and combined features.
  • feature acquisition device 50 may treat all of the generated unit features and all of the combined features as features of a machine learning sample.
  • the feature acquisition device 50 may use, as a feature of the machine learning sample, among the generated total unit features and all of the combined features.
  • the feature acquisition device 50 may use the unit features having higher feature importance among all the generated unit features and all the combined features generated as features of the machine learning sample.
  • the feature acquisition device 50 may use the combined features with higher feature importance and all generated unit features among the generated combined features as features of the machine learning sample.
  • a system for automatically generating features of a machine learning sample may further include: a display device (not shown) for displaying a feature of the machine learning sample obtained by the feature acquisition device 50 to a user . Further, as an example, the display device can also display the feature importance of each feature to the user.
  • a system for automatically generating features of a machine learning sample may further include: an application device (not shown) for directly applying a feature application of the machine learning sample obtained by the feature acquisition device 50 Follow-up machine learning steps.
  • a system that automatically generates features of machine learning samples may automatically perform operations by initiating an operator corresponding to the automatic feature generation step.
  • the operator may correspond to a node in a directed acyclic graph corresponding to a machine learning flow.
  • a system for automatically generating features of a machine learning sample may further include: a reminding device (not shown) for the operator that the user does not specify a target value field An exception reminder is provided when the situation is initiated.
  • the apparatus included in the system for automatically generating the features of the machine learning samples may be separately configured to execute software, hardware, firmware, or any combination of the above items of a specific function.
  • these devices may correspond to dedicated integrated circuits, may also correspond to pure software code, and may also correspond to modules in which software and hardware are combined.
  • one or more of the functions implemented by these devices can also be performed collectively by components in a physical physical device (eg, a processor, a client or a server, etc.).
  • a method of automatically generating features of a machine learning sample may be implemented by a program recorded on a computationally readable storage medium, for example, according to an exemplary embodiment of the present disclosure, a a computer readable storage medium storing instructions, wherein, when the instructions are executed by at least one computing device, causing the at least one computing device to perform: obtaining a user-specified data table, wherein a row of the data table corresponds to a data record, A column of the data table corresponds to a field; a feature type corresponding to each non-target value field in the data table is declared, wherein the feature type includes a discrete feature and/or a continuous feature; and each non-target value field is processed according to the declared feature type as Unit features; feature combination based on generated unit features to generate combined features; and features of machine learning samples based on generated unit features and combined features.
  • the at least one computing device when the instructions are executed by at least one computing device, the at least one computing device is also caused to perform the method of automatically generating features of machine learning samples as referred to in any of the embodiments above.
  • the computer program in the computer readable storage medium described above can be executed in an environment deployed in a computer device such as a processor, a client, a host, a proxy device, a server, etc., for example, by at least one computer located in a stand-alone environment or a distributed cluster environment
  • the apparatus operates to provide, by way of example, a computing device, a computer, a processor, a computing unit (or module), a client, a host, a proxy device, a server, and the like.
  • the computer program can also be used to perform additional steps in addition to the above steps or to perform more specific processing when performing the above steps, the contents of which have been described with reference to FIGS. 1 through 6, Here, in order to avoid repetition, it will not be described again.
  • a system for automatically generating features of a machine learning sample may rely entirely on the operation of a computer program to implement a corresponding function, that is, each device corresponds to each step in a functional architecture of the computer program, such that The entire system is called through a specialized software package (for example, a lib library) to implement the corresponding functions.
  • a specialized software package for example, a lib library
  • the respective devices included in the system for automatically generating the features of the machine learning samples may also be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof.
  • the program code or code segments for performing the corresponding operations may be stored in a computer readable storage medium, such as a storage medium, such that the processor can read and operate the corresponding Program code or code segments to perform the corresponding operations.
  • a system including at least one computing device and at least one storage device storing instructions
  • the instructions when executed by the at least one computing device, cause the at least A computing device performs the following steps for automatically generating features of the machine learning sample: obtaining a user-specified data table, wherein one row of the data table corresponds to one data record, one column of the data table corresponds to one field; and each non-declaration in the data table is declared a feature type corresponding to the target value field; wherein the feature type includes a discrete feature, or includes a continuous feature, or includes a discrete feature and a continuous feature; and each non-target value field is processed into a unit feature according to the declared feature type; Unit features for feature combination to generate combined features; and to derive features of machine learning samples based on generated unit features and combined features.
  • the system may constitute a stand-alone computing environment or a distributed computing environment, and includes at least one computing device and at least one storage device.
  • the computing device may be a general-purpose or dedicated computer, a processor, etc., and may be simple
  • the unit that uses software to perform processing may also be an entity that combines hardware and software. That is, the computing device can be implemented as a computer, a processor, a computing unit (or module), a client, a host, a proxy device, a server, and the like.
  • the storage device can be a physical storage device or a logically partitioned storage unit that can be operatively coupled to the computing device or can communicate with each other, for example, through an I/O port, a network connection, or the like.
  • an exemplary embodiment of the present disclosure can also be implemented as a computing device including a storage component and a processor having a set of computer executable instructions stored therein, when the set of computer executable instructions is When the processor executes, a method of automatically generating features of the machine learning samples is performed.
  • the computing device can be deployed in a server or client, or can be deployed on a node device in a distributed network environment.
  • the computing device can be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the set of instructions described above.
  • the computing device does not have to be a single computing device, but can be any collection of devices or circuits capable of executing the above described instructions (or sets of instructions), either alone or in combination.
  • the computing device can also be part of an integrated control system or system manager, or can be configured as a portable electronic device interfaced locally or remotely (eg, via wireless transmission).
  • the processor can include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor.
  • the processor may also include, by way of example and not limitation, an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.
  • Some of the operations described in the method of automatically generating features of machine learning samples according to an exemplary embodiment of the present disclosure may be implemented by software, some of which may be implemented by hardware, and may also be combined by hardware and software. The way to achieve these operations.
  • the processor can execute instructions or code stored in one of the storage components, wherein the storage component can also store data.
  • the instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.
  • the storage component can be integrated with the processor, for example, by arranging the RAM or flash memory within an integrated circuit microprocessor or the like.
  • the storage components can include separate devices such as external disk drives, storage arrays, or other storage devices that can be used with any database system.
  • the storage component and processor may be operatively coupled or may be in communication with one another, such as through an I/O port, a network connection, etc., such that the processor can read the file stored in the storage component.
  • the computing device can also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device can be connected to each other via a bus and/or a network.
  • a video display such as a liquid crystal display
  • a user interaction interface such as a keyboard, mouse, touch input device, etc.
  • a computing device for automatically generating features of a machine learning sample may include a storage component and a processor, wherein the storage component stores therein a set of computer executable instructions when the set of computer executable instructions is When the processor executes, the following steps are performed: acquiring a data table specified by the user, wherein one row of the data table corresponds to one data record, and one column of the data table corresponds to one field; and each non-target value field in the data table is declared Feature type, wherein the feature type includes discrete features and/or continuous features; each non-target value field is processed as a unit feature according to the declared feature type; feature combination is performed based on the generated unit feature to generate a combined feature; The generated unit features and combined features are derived from the features of the machine learning samples.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)

Abstract

提供一种自动生成机器学习样本的特征的方法及***。所述方法包括:获取用户指定的数据表,其中,数据表的一行对应一条数据记录,数据表的一列对应一个字段;声明数据表中的各个非目标值字段所对应的特征类型;其中,所述特征类型包括离散特征,或包括连续特征,或包括离散特征和连续特征;按照声明的特征类型将各个非目标值字段处理为单位特征;基于生成的单位特征来进行特征组合,以生成组合特征;以及基于生成的单位特征和组合特征来得到机器学习样本的特征。

Description

自动生成机器学习样本的特征的方法及*** 技术领域
本公开总体说来涉及人工智能领域,更具体地讲,涉及一种自动生成机器学习样本的特征的方法及***。
背景技术
随着海量数据的出现,人们倾向于使用机器学习技术来从数据中挖掘出价值。
训练机器学习模型的基本过程主要包括:
1、导入包含历史数据记录的数据集(例如,数据表);
2、完成特征工程,其中,通过对数据集中的数据记录的属性信息进行各种处理,以得到各个特征,这些特征构成的特征向量可作为机器学习样本;
3、训练模型,其中,按照设置的机器学习算法(例如,逻辑回归算法、决策树算法、神经网络算法等),基于经过特征工程所得到的机器学习样本来学习出模型。
在上述过程中,产生特征的处理很重要,它会影响模型的优劣。数据表中每条数据记录可包括多个属性信息(即,字段),而特征可指示各字段本身、或字段的组合等各种字段处理(或运算)结果,以便更好地反映数据分布以及字段间的内在关联与潜在含义,因此,特征工程质量的好坏直接决定了机器学习问题刻画的准确性,进而影响模型的优劣。
在现有的机器学***台***中。也就是说,需要用户预先设定机器学习样本的特征,一方面,需要用户对业务场景有深刻的理解,即,用户凭借业务经验来设定特征;另一方面,一般在机器学习过程中,所使用数据的数据量都比较大,用户有时不能全面地分析数据,会导致设定一些无效的特征,为了提高机器学习样本的特征的效果,这就需要用户进行不断尝试,当面对大数据量和高维特征时,这样的工作需要花费较长的时间。这种情况下,不仅需要用户对业务场景有深刻的理解,增加了用户的工作量,还降低了机器学习的效率。
发明内容
本公开的示例性实施例在于提供一种自动生成机器学习样本的特征的方法及***,以解决现有技术存在的不能便捷地生成机器学习样本的特征的问 题。
根据本公开的示例性实施例,提供一种自动生成机器学习样本的特征的方法,包括:(A)获取用户指定的数据表,其中,数据表的一行对应一条数据记录,数据表的一列对应一个字段;(B)声明数据表中的各个非目标值字段所对应的特征类型,其中,特征类型包括离散特征和/或连续特征;(C)按照声明的特征类型将各个非目标值字段处理为单位特征;(D)基于生成的单位特征来进行特征组合,以生成组合特征;以及(E)基于生成的单位特征和组合特征来得到机器学习样本的特征。
根据本公开的另一示例性实施例,提供一种自动生成机器学习样本的特征的***,包括:数据表获取装置,用于获取用户指定的数据表,其中,数据表的一行对应一条数据记录,数据表的一列对应一个字段;声明装置,用于声明数据表中的各个非目标值字段所对应的特征类型,其中,特征类型包括离散特征和/或连续特征;单位特征生成装置,用于按照声明的特征类型将各个非目标值字段处理为单位特征;组合特征生成装置,用于基于生成的单位特征来进行特征组合,以生成组合特征;以及特征获取装置,用于基于生成的单位特征和组合特征来得到机器学习样本的特征。
根据本公开的另一示例性实施例,提供一种用于自动生成机器学习样本的特征的计算机可读介质,其中,在所述计算机可读介质上记录有用于执行如上所述的自动生成机器学习样本的特征的方法的计算机程序。
根据本公开的另一示例性实施例,提供一种用于自动生成机器学习样本的特征的计算装置,包括存储部件和处理器,其中,存储部件中存储有计算机可执行指令集合,当所述计算机可执行指令集合被所述处理器执行时,执行如上所述的自动生成机器学习样本的特征的方法。
在根据本公开示例性实施例的自动生成机器学习样本的特征的方法及***中,能够基于数据表自动生成机器学习样本的特征,既降低了特征工程的使用门槛,提高了特征工程的易用性,又提高了特征工程的效率。
将在接下来的描述中部分阐述本公开总体构思另外的方面和/或优点,还有一部分通过描述将是清楚的,或者可以经过本公开总体构思的实施而得知。
附图说明
通过下面结合示例性地示出实施例的附图进行的描述,本公开示例性实施例的上述和其他目的和特点将会变得更加清楚,其中:
图1示出根据本公开示例性实施例的自动生成机器学习样本的特征的方法的流程图;
图2示出根据本公开示例性实施例的由用户指定非目标值字段对应的特征类型的示例;
图3示出根据本公开的另一示例性实施例的自动生成机器学习样本的特征的方法的流程图;
图4示出根据本公开的另一示例性实施例的自动生成机器学习样本的特征的方法的流程图;
图5示出根据本公开的另一示例性实施例的自动生成机器学习样本的特征的方法的流程图;
图6示出根据本公开示例性实施例的用于训练机器学习模型的DAG图的示例;
图7示出根据本公开示例性实施例的自动生成机器学习样本的特征的***的框图。
具体实施方式
现将详细参照本公开的实施例,所述实施例的示例在附图中示出,其中,相同的标号始终指的是相同的部件。以下将通过参照附图来说明所述实施例,以便解释本公开。
这里,机器学习是人工智能研究发展到一定阶段的必然产物,其致力于通过计算的手段,利用经验来改善***自身的性能。在计算机***中,“经验”通常以“数据”形式存在,通过机器学习算法,可从数据中产生“模型”,也就是说,将经验数据提供给机器学习算法,就能基于这些经验数据产生模型,在面对新的情况时,模型会提供相应的判断,即,预测结果。不论是训练机器学习模型,还是利用训练好的机器学习模型进行预测,数据都需要转换为包括各种特征的机器学习样本。机器学习可被实现为“有监督学习”、“无监督学习”或“半监督学习”的形式,应注意,本公开的示例性实施例对具体的机器学习算法并不进行特定限制。此外,还应注意,在训练和应用模型的过程中,还可结合统计算法等其他手段。
在此需要说明的是,在本公开中出现的“并且/或者”、“和/或”均表示包含三种并列的情况。例如“包括A和/或B”表示如下三种并列的情况:(1)包括A;(2)包括B;(3)包括A和B。又例如“执行步骤一并且/或者步骤二”表示如下三种并列的情况:(1)执行步骤一;(2)执行步骤二;(3)执行步骤一和步骤二。
图1示出根据本公开示例性实施例的自动生成机器学习样本的特征的方法的流程图。这里,作为示例,所述方法可通过计算机程序来执行,也可由专门的自动生成机器学习样本的特征的***或计算装置来执行。
作为示例,所述方法可通过启动与自动特征生成步骤相应的算子而自动执行。换言之,当与自动特征生成步骤相应的算子被启动时,将自动执行所述方法。进一步地,作为示例,所述算子对应于与机器学习流程相应的有向无环图(DAG图)中的节点。例如,与机器学习流程相应的DAG图可包括特征生成节点,当运行整个DAG图时,在执行到所述特征生成节点时,将自动执行所述方法。下面,将结合图6来对根据本公开的示例性实施例的用于训练机器学习模型的DAG图进行详细说明。
参照图1,在步骤S101中,获取用户指定的数据表。这里,数据表的一行对应一条数据记录,数据表的一列对应一个字段。换言之,数据表中的每条数据记录具有与各个字段相应的字段值。作为示例,每条数据记录可被看作关于一个事件或对象的描述,对应于一个示例或样例,每个字段可用于描述事件或对象在一个方面的表现或性质(例如,名字、年龄、职业等)。
作为示例,可向用户提供用于指定数据表的图形界面,并根据用户在该图形界面上执行的输入操作,来确定用户所指定的数据表。
在步骤S102中,声明数据表中的各个非目标值字段所对应的特征类型,其中,特征类型包括离散特征和/或连续特征。
这里,目标值字段即使用机器学习技术要预估的标记(即,label)所对应的字段,该字段对应于有监督学习情况下的预测目标,而非目标值字段即数据表之中除目标值字段之外的字段。
在有监督学习的情况下,作为示例,非目标值字段可通过以下方式来获取:从数据表中的所有字段中去除用户指定的目标值字段。作为示例,可向用户提供用于指定目标值字段的图形界面,并根据用户在该图形界面上执行的输入操作,来确定用户所指定的目标值字段。进一步地,作为示例,所述算子在用户未指定目标值字段的情况下被启动时,可提供异常提醒,以提醒用户指定目标值字段。
此外,应该理解,数据表中可包括目标值字段,也可不包括目标值字段。
连续特征是与离散特征(例如,类别特征)相对的一种特征,其取值可以是具有一定连续性的数值,例如,年龄、金额等。相对地,作为示例,离散特征的取值不具有连续性,例如,可以是“来自北京”、“来自上海”或“来自天津”、“性别为男”、“性别为女”等无序分类的特征。
作为示例,可自动或根据用户的指示,将所有非目标值字段声明为离散特征,或者,将各个非目标值字段声明为与其字段值数据类型相应的离散特征或连续特征。
作为示例,字段的字段值数据类型可为连续型(例如,数值型(例如,整型int))或离散型(例如,文本型(例如,字符串型string))。作为示例,将各个非目标值字段声明为与其字段值数据类型相应的离散特征或连续特征的步骤可包括:将数据表中的字段值数据类型为离散型的非目标值字段声明为离散特征,并将数据表中的字段值数据类型为连续型的非目标值字段声明为连续特征。
作为示例,可向用户提供用于指定非目标值字段对应的特征类型的图形界面,并根据用户在该图形界面上执行的输入操作,将所有非目标值字段声明为离散特征,或者,将各个非目标值字段声明为与其字段值数据类型相应的离散特征或连续特征。
下面结合图2来描述根据本公开示例性实施例的由用户通过图形界面来指定非目标值字段对应的特征类型的示例。如图2所示,用于指定非目标值字 段对应的特征类型的图形界面可显示单选按钮“全部离散”和单选按钮“离散+连续”(这两个按钮可被择一选中),可响应于用户对单选按钮“全部离散”的选择操作,将数据表中的所有非目标值字段声明为离散特征;可响应于用户对单选按钮“离散+连续”的选择操作,根据各个非目标值字段的数据类型将所述字段声明为相应的离散特征或连续特征,这里,可根据字段值的特性来自动判断出字段的数据类型,并进而根据数据类型为离散型还是连续型将字段声明为离散特征或连续特征。此外,所述图形界面中还可显示用于指定目标值字段的控件,用户可通过对该控件的操作来指定目标值字段。此外,所述图形界面的左侧还可显示数据表中的各字段的字段名及字段值数据类型。
参照回图1,在步骤S103中,按照声明的特征类型将各个非目标值字段处理为单位特征。换言之,按照声明的特征类型分别将每个非目标值字段处理为一个单位特征。
作为示例,可对每一个字段值数据类型为连续型且被声明为离散特征的非目标值字段进行离散化处理,以得到一个单位特征。
应理解,这里的单位特征是指该特征对应于单个字段,其本身可根据取值的定义而具有一个或多个维度。可选地,可针对每一个字段值数据类型为连续型且被声明为离散特征的非目标值字段,执行一种或多种分桶运算以得到相应的一个或多个分桶特征,并将得到的分桶特征整体作为一个单位特征。
这里,分桶(binning)运算是指对连续型的字段进行分散化的一种特定方式,即,将连续型的字段的值域划分为多个区间(即,多个桶),并基于划分的桶来确定相应的分桶特征值。分桶运算大体上可划分为有监督分桶和无监督分桶,这两种类型各自包括一些具体的分桶方式,例如,有监督分桶可包括最小熵分桶、最小描述长度分桶等,而无监督分桶可包括等宽分桶、等深分桶、基于k均值聚类的分桶等。在每种分桶方式下,可设置相应的分桶参数,例如,宽度、深度等。
应注意,根据本公开的示例性实施例,对字段值数据类型为连续型且被声明为离散特征的非目标值字段执行的分桶运算不限制分桶方式的种类,也不限制分桶运算的参数,并且,相应产生的分桶特征的具体表示方式也不受限制。
作为示例,针对字段值数据类型为连续型且被声明为离散特征的非目标值字段执行的多种分桶运算可以在分桶方式和/或分桶参数方面存在差异。例如,所述多种分桶运算可以是种类相同但具有不同运算参数(例如,深度、宽度等)的分桶运算,也可以是不同种类的分桶运算。相应地,每一种分桶运算可得到一个分桶特征,这些分桶特征共同组成一个分桶组特征,该分桶组特征可体现出不同分桶运算,从而提升了机器学习素材的有效性,为机器学习模型的训练/预测提供了较好的基础。
也就是说,根据本公开的示例性实施例,可针对每一个字段值数据类型为连续型且被声明为离散特征的非目标值字段执行至少一种分桶运算而得到相应的至少一个分桶特征,将每一个分桶特征作为一个组成元素而得到与该字段 对应的特征,并将该特征作为单位特征。这里,应理解,分桶运算的执行使得字段值数据类型为连续型且被声明为离散特征的非目标值字段被分散化地置入相应的特定桶中,在转换后的多个分桶特征中,每个维度既可以指示桶中是否被分配了连续特征的离散值(例如,“0”或“1”),也可以指示具体的连续数值(例如,连续特征的实际特征值或其归一化值、所述桶中各连续特征的平均值、中间值、边界值等)。相应地,在机器学习中具体应用各个维度的离散值(例如,针对分类问题)或连续数值(例如,针对回归问题)时,可进行离散值之间的组合(例如,笛卡尔积等)或连续数值之间的组合(例如,算术运算组合等)。
在步骤S104中,基于生成的单位特征来进行特征组合,以生成组合特征。
作为示例,可对生成的全部单位特征进行各种组合来获取候选组合特征,或者,对生成的全部单位特征之中特征重要性较高的单位特征进行各种组合来获取候选组合特征;然后,可通过衡量与每个候选组合特征相应的机器学习模型的效果来从候选组合特征中筛选出组合特征。具体说来,可训练与每个候选组合特征相应的机器学习模型,由于相应的机器学习模型的效果能够反映候选组合特征的特征重要性(例如,预测力),从而可通过衡量与每个候选组合特征相应的机器学习模型的效果来从候选组合特征中筛选出组合特征,例如,机器学习模型的效果越好,相应的候选组合特征越容易被筛选为组合特征。作为示例,可使用指定的模型评价指标来评价与每个候选组合特征相应的机器学习模型的效果。作为示例,可自动或根据用户的指示,来指定模型评价指标。
作为示例,模型评价指标可以是AUC(ROC(受试者工作特征,Receiver Operating Characteristic)曲线下的面积,Area Under ROC Curve)、MAE(平均绝对误差,Mean Absolute Error)或对数损失函数(logloss)等。
作为示例,可将全部单位特征之中特征重要性满足第一预设条件的单位特征进行各种组合来获取候选组合特征。例如,可将全部单位特征之中特征重要性处于第一预设阈值范围内的单位特征进行各种组合来获取候选组合特征,或者,按照单位特征的特征重要性由高到低将全部单位特征进行排序,并将前第一预定数量的单位特征进行各种组合来获取候选组合特征。
作为示例,可通过衡量与特征相应的机器学习模型的效果来确定单位特征的特征重要性,相应的机器学习模型的效果越好,单位特征的特征重要性越高。例如,可使用与特征相应的机器学习模型关于模型评价指标的评价值来衡量单位特征的特征重要性。这里,作为示例,可自动或根据用户的指示,来指定该模型评价指标。
在步骤S105中,基于生成的单位特征和组合特征来得到机器学习样本的特征。
作为示例,可将生成的全部单位特征和全部组合特征作为机器学习样本的特征。
作为另一示例,可将生成的全部单位特征和全部组合特征之中,特征重要 性较高的特征作为机器学习样本的特征。作为示例,可将全部单位特征和全部组合特征之中,特征重要性满足第二预设条件的特征作为机器学习样本的特征,例如,可将特征重要性处于第二预设阈值范围内的特征作为机器学习样本的特征,或者,按照特征的特征重要性由高到低将全部单位特征和全部组合特征共同进行排序,并将前第二预定数量的特征作为机器学习样本的特征。
作为另一示例,可将生成的全部单位特征之中特征重要性较高的单位特征和生成的全部组合特征,作为机器学习样本的特征。作为示例,可将全部组合特征连同特征重要性满足第三预设条件的单位特征作为机器学习样本的特征,例如,可将全部组合特征连同特征重要性处于第三预设阈值范围内的单位特征作为机器学习样本的特征,或者,按照单位特征的特征重要性由高到低将全部单位特征进行排序,并将前第三预定数量的单位特征连同全部组合特征作为机器学习样本的特征。
作为另一示例,可将生成的全部单位特征和生成的全部组合特征之中特征重要性较高的组合特征,作为机器学习样本的特征。作为示例,可将全部单位特征连同特征重要性满足第四预设条件的组合特征作为机器学习样本的特征,例如,可将全部单位特征连同特征重要性处于第四预设阈值范围内的组合特征作为机器学习样本的特征,或者,按照组合特征的特征重要性由高到低将全部组合特征进行排序,并将前第四预定数量的组合特征连同全部单位特征作为机器学习样本的特征。
此外,作为示例,根据本公开示例性实施例的自动生成机器学习样本的特征的方法还可包括:在步骤S105之后,向用户显示得到的机器学习样本的特征。进一步地,还可向用户显示每个特征的特征重要性。
作为示例,根据本公开示例性实施例的自动生成机器学习样本的特征的方法还可包括:在步骤S105之后,直接将得到的机器学习样本的特征应用于后续的机器学习步骤。例如,可直接基于得到的机器学习样本的特征来学习出模型。
图3示出根据本公开的另一示例性实施例的自动生成机器学习样本的特征的方法的流程图。
参照图3,在步骤S201中,获取用户指定的数据表。
在步骤S202中,声明数据表中的各个非目标值字段所对应的特征类型。
在步骤S203中,按照声明的特征类型将各个非目标值字段处理为单位特征。
在步骤S204中,对生成的全部单位特征进行各种组合来获取候选组合特征,并通过衡量与每个候选组合特征相应的机器学习模型的效果来从候选组合特征中筛选出组合特征。
在步骤S205中,将生成的全部单位特征和全部组合特征作为机器学习样本的特征。
图4示出根据本公开的另一示例性实施例的自动生成机器学习样本的特 征的方法的流程图。
参照图4,在步骤S301中,获取用户指定的数据表。
在步骤S302中,声明数据表中的各个非目标值字段所对应的特征类型。
在步骤S303中,按照声明的特征类型将各个非目标值字段处理为单位特征。
在步骤S304中,对生成的全部单位特征之中特征重要性较高的单位特征进行各种组合来获取候选组合特征,并通过衡量与每个候选组合特征相应的机器学习模型的效果来从候选组合特征中筛选出组合特征。
在步骤S305中,将生成的全部单位特征之中特征重要性较高的单位特征和生成的全部组合特征作为机器学习样本的特征。
作为示例,可使用与特征相应的机器学习模型关于模型评价指标AUC的评价值来衡量特征的特征重要性,在步骤S304中,可对生成的全部单位特征之中对应的AUC值大于0.5且小于1的单位特征进行各种组合来获取候选组合特征,并且,在步骤S305中,可将生成的全部单位特征之中对应的AUC值大于0.5且小于1的单位特征和生成的全部组合特征作为机器学习样本的特征。
图5示出根据本公开的另一示例性实施例的自动生成机器学习样本的特征的方法的流程图。
参照图5,在步骤S401中,获取用户指定的数据表。
在步骤S402中,声明数据表中的各个非目标值字段所对应的特征类型。
在步骤S403中,按照声明的特征类型将各个非目标值字段处理为单位特征。
在步骤S404中,对生成的全部单位特征进行各种组合来获取候选组合特征,并通过衡量与每个候选组合特征相应的机器学习模型的效果来从候选组合特征中筛选出组合特征。
在步骤S405中,将生成的全部单位特征和全部组合特征之中,特征重要性较高的特征作为机器学习样本的特征。
作为示例,可使用与特征相应的机器学习模型关于模型评价指标AUC的评价值来衡量特征的特征重要性,在步骤S405中,可将生成的全部单位特征和全部组合特征之中,对应的AUC值大于0.5且小于1的特征作为机器学习样本的特征。
以上列出了一些自动生成机器学习样本的特征的示例性方法,然而,本领域技术人员应理解,本公开的示例性实施例并不受限于这些方法,而可以采用任何适当的特征(单位特征、候选组合特征或组合特征)生成或筛选方式。
根据本公开的示例性实施例,可通过有向无环图的形式来执行机器学习流程,该机器学习流程可涵盖用于进行机器学习模型训练、测试或预估的全部或部分步骤。例如,可针对机器学习模型训练来建立包括以下步骤之中的至少一个步骤的DAG图:历史数据导入步骤、数据拆分步骤、特征生成步骤、逻辑回归步骤和模型预测步骤。也即,上述各个步骤可作为DAG图中的节点而被 执行。
图6示出根据本公开示例性实施例的用于训练机器学习模型的DAG图的示例。
参照图6,第一步:建立数据导入节点。作为示例,可响应于用户操作对数据导入节点进行设置以获取名称为“bank”的银行业务数据表(即,将该数据表导入机器学习平台中),其中,该数据表中可包含多条历史数据记录。
第二步:建立数据拆分节点,并将数据导入节点连接到数据拆分节点,以将上述导入的数据表拆分为训练集和验证集,其中,训练集中的数据记录用于转换为机器学习样本以学习出模型,而验证集中的数据记录用于转换为测试样本以验证学习出的模型的效果。可响应于用户操作对数据拆分节点进行设置以按照设置的方式将上述导入的数据表拆分为训练集和验证集。
第三步:建立两个特征生成节点,并将数据拆分节点分别连接到这两个特征生成节点,以对数据拆分节点输出的训练集和验证集分别进行特征生成,例如,默认数据拆分节点左侧输出的是训练集,右侧输出的是验证集。应理解,对于机器学习样本和测试样本而言,两者的特征生成方式是对应一致的。可响应于用户操作对特征生成节点进行设置,例如,可指定目标值字段、非目标值字段对应的特征类型、特征重要性的衡量指标等。
第四步:建立特点算法(例如,逻辑回归)节点(也即,模型训练节点),并将左侧特征生成节点连接到逻辑回归节点,以利用逻辑回归算法基于机器学习样本来训练出机器学习模型。可响应于用户操作对逻辑回归节点进行设置以按照设置的逻辑回归算法来训练机器学习模型。
第五步:建立模型预测节点,并将逻辑回归节点和右侧特征生成节点连接到模型预测节点,以基于测试样本来验证训练出的机器学习模型的效果。可响应于用户操作对模型预测节点进行设置以按照设置的验证方式来验证机器学习模型的效果。
在建立包括上述步骤的DAG图之后,可根据用户的指示来运行整个DAG图。在执行到所述特征生成节点时,可自动执行上述示例性实施例的自动生成机器学习样本的特征的方法。
图7示出根据本公开示例性实施例的自动生成机器学习样本的特征的***的框图。如图7所示,根据本公开示例性实施例的自动生成机器学习样本的特征的***包括:数据表获取装置10、声明装置20、单位特征生成装置30、组合特征生成装置40以及特征获取装置50。
具体说来,数据表获取装置10用于获取用户指定的数据表,其中,数据表的一行对应一条数据记录,数据表的一列对应一个字段。
声明装置20用于声明数据表中的各个非目标值字段所对应的特征类型,其中,特征类型包括离散特征和/或连续特征。
作为示例,非目标值字段可通过以下方式来获取:从数据表中的所有字段中去除用户指定的目标值字段。
作为示例,声明装置20可自动或根据用户的指示,将所有非目标值字段声明为离散特征,或者,将各个非目标值字段声明为与其字段值数据类型相应的离散特征或连续特征。
单位特征生成装置30用于按照声明的特征类型将各个非目标值字段处理为单位特征。
作为示例,单位特征生成装置30可针对每一个字段值数据类型为连续型且被声明为离散特征的非目标值字段,执行一种或多种分桶运算以得到相应的一个或多个分桶特征,并将得到的分桶特征整体作为一个单位特征。
组合特征生成装置40用于基于生成的单位特征来进行特征组合,以生成组合特征。
作为示例,组合特征生成装置40可包括:候选组合特征获取单元(未示出)和组合特征筛选单元(未示出)。
候选组合特征获取单元用于对生成的全部单位特征进行各种组合来获取候选组合特征,或者,对生成的全部单位特征之中特征重要性较高的单位特征进行各种组合来获取候选组合特征。
组合特征筛选单元用于通过衡量与每个候选组合特征相应的机器学习模型的效果来从候选组合特征中筛选出组合特征。
特征获取装置50用于基于生成的单位特征和组合特征来得到机器学习样本的特征。
作为示例,特征获取装置50可将生成的全部单位特征和全部组合特征作为机器学习样本的特征。
作为另一示例,特征获取装置50可将生成的全部单位特征和全部组合特征之中,特征重要性较高的特征作为机器学习样本的特征。
作为另一示例,特征获取装置50可将生成的全部单位特征之中特征重要性较高的单位特征和生成的全部组合特征,作为机器学习样本的特征。
作为另一示例,特征获取装置50可将生成的全部组合特征之中特征重要性较高的组合特征和生成的全部单位特征,作为机器学习样本的特征。
作为示例,根据本公开示例性实施例的自动生成机器学习样本的特征的***还可包括:显示装置(未示出),显示装置用于向用户显示特征获取装置50得到的机器学习样本的特征。进一步地,作为示例,显示装置还可向用户显示每个特征的特征重要性。
作为示例,根据本公开示例性实施例的自动生成机器学习样本的特征的***还可包括:应用装置(未示出),应用装置用于直接将特征获取装置50得到的机器学习样本的特征应用于后续的机器学习步骤。
作为示例,可通过启动与自动特征生成步骤相应的算子来使根据本公开示例性实施例的自动生成机器学习样本的特征的***自动执行操作。
作为示例,所述算子可对应于与机器学习流程相应的有向无环图中的节点。
此外,作为示例,根据本公开示例性实施例的自动生成机器学习样本的特征的***还可包括:提醒装置(未示出),提醒装置用于所述算子在用户未指定目标值字段的情况下被启动时,提供异常提醒。
应该理解,根据本公开示例性实施例的自动生成机器学习样本的特征的***的具体实现方式可参照结合图1至图6描述的相关具体实现方式来实现,在此不再赘述。
根据本公开示例性实施例的自动生成机器学习样本的特征的***所包括的装置可被分别配置为执行特定功能的软件、硬件、固件或上述项的任意组合。例如,这些装置可对应于专用的集成电路,也可对应于纯粹的软件代码,还可对应于软件与硬件相结合的模块。此外,这些装置所实现的一个或多个功能也可由物理实体设备(例如,处理器、客户端或服务器等)中的组件来统一执行。
应理解,根据本公开示例性实施例的自动生成机器学习样本的特征的方法可通过记录在计算可读存储介质上的程序来实现,例如,根据本公开的示例性实施例,可提供一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行:获取用户指定的数据表,其中,数据表的一行对应一条数据记录,数据表的一列对应一个字段;声明数据表中的各个非目标值字段所对应的特征类型,其中,特征类型包括离散特征和/或连续特征;按照声明的特征类型将各个非目标值字段处理为单位特征;基于生成的单位特征来进行特征组合,以生成组合特征;以及基于生成的单位特征和组合特征来得到机器学习样本的特征。
此外,当所述指令被至少一个计算装置运行时,还促使所述至少一个计算装置执行前述任一实施例中涉及的自动生成机器学习样本的特征的方法。
上述计算机可读存储介质中的计算机程序可在诸如处理器、客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行,例如,由位于单机环境或分布式集群环境的至少一个计算装置来运行,作为示例,这里的计算装置可作为计算机、处理器、计算单元(或模块)、客户端、主机、代理装置、服务器等。应注意,所述计算机程序还可用于执行除了上述步骤以外的附加步骤或者在执行上述步骤时执行更为具体的处理,这些附加步骤和进一步处理的内容已经参照图1至图6进行了描述,这里为了避免重复将不再进行赘述。
应注意,根据本公开示例性实施例的自动生成机器学习样本的特征的***可完全依赖计算机程序的运行来实现相应的功能,即,各个装置与计算机程序的功能架构中与各步骤相应,使得整个***通过专门的软件包(例如,lib库)而被调用,以实现相应的功能。
另一方面,根据本公开示例性实施例的自动生成机器学习样本的特征的***所包括的各个装置也可以通过硬件、软件、固件、中间件、微代码或其任意组合来实现。当以软件、固件、中间件或微代码实现时,用于执行相应操作的程序代码或者代码段可以存储在诸如存储介质的计算机可读存储介质中,使得处理器可通过读取并运行相应的程序代码或者代码段来执行相应的操作。
例如,根据本公开示例性实施例,可提供一种包括至少一个计算装置和至少一个存储指令的存储装置的***,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行用于自动生成机器学习样本的特征的以下步骤:获取用户指定的数据表,其中,数据表的一行对应一条数据记录,数据表的一列对应一个字段;声明数据表中的各个非目标值字段所对应的特征类型;其中,所述特征类型包括离散特征,或包括连续特征,或包括离散特征和连续特征;按照声明的特征类型将各个非目标值字段处理为单位特征;基于生成的单位特征来进行特征组合,以生成组合特征;以及基于生成的单位特征和组合特征来得到机器学习样本的特征。
这里,所述***可构成单机计算环境或分布式计算环境,其包括至少一个计算装置和至少一个存储装置,这里,作为示例,计算装置可以是通用或专用的计算机、处理器等,可以是单纯利用软件来执行处理的单元,还可以是软硬件相结合的实体。也就是说,计算装置可实现为计算机、处理器、计算单元(或模块)、客户端、主机、代理装置、服务器等。此外,存储装置可以是物理上的存储设备或逻辑上划分出的存储单元,其可与计算装置在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信。
此外,例如,本公开的示例性实施例还可以实现为计算装置,该计算装置包括存储部件和处理器,存储部件中存储有计算机可执行指令集合,当所述计算机可执行指令集合被所述处理器执行时,执行自动生成机器学习样本的特征的方法。
具体说来,所述计算装置可以部署在服务器或客户端中,也可以部署在分布式网络环境中的节点装置上。此外,所述计算装置可以是PC计算机、平板装置、个人数字助理、智能手机、web应用或其他能够执行上述指令集合的装置。
这里,所述计算装置并非必须是单个的计算装置,还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。计算装置还可以是集成控制***或***管理器的一部分,或者可被配置为与本地或远程(例如,经由无线传输)以接口互联的便携式电子装置。
在所述计算装置中,处理器可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器***、微控制器或微处理器。作为示例而非限制,处理器还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。
根据本公开示例性实施例的自动生成机器学习样本的特征的方法中所描述的某些操作可通过软件方式来实现,某些操作可通过硬件方式来实现,此外,还可通过软硬件结合的方式来实现这些操作。
处理器可运行存储在存储部件之一中的指令或代码,其中,所述存储部件还可以存储数据。指令和数据还可经由网络接口装置而通过网络被发送和接收,其中,所述网络接口装置可采用任何已知的传输协议。
存储部件可与处理器集成为一体,例如,将RAM或闪存布置在集成电路微处理器等之内。此外,存储部件可包括独立的装置,诸如,外部盘驱动、存储阵列或任何数据库***可使用的其他存储装置。存储部件和处理器可在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信,使得处理器能够读取存储在存储部件中的文件。
此外,所述计算装置还可包括视频显示器(诸如,液晶显示器)和用户交互接口(诸如,键盘、鼠标、触摸输入装置等)。计算装置的所有组件可经由总线和/或网络而彼此连接。
根据本公开示例性实施例的自动生成机器学习样本的特征的方法所涉及的操作可被描述为各种互联或耦合的功能块或功能示图。然而,这些功能块或功能示图可被均等地集成为单个的逻辑装置或按照非确切的边界进行操作。
根据本公开示例性实施例,用于自动生成机器学习样本的特征的计算装置可包括存储部件和处理器,其中,存储部件中存储有计算机可执行指令集合,当所述计算机可执行指令集合被所述处理器执行时,执行下述步骤:获取用户指定的数据表,其中,数据表的一行对应一条数据记录,数据表的一列对应一个字段;声明数据表中的各个非目标值字段所对应的特征类型,其中,特征类型包括离散特征和/或连续特征;按照声明的特征类型将各个非目标值字段处理为单位特征;基于生成的单位特征来进行特征组合,以生成组合特征;以及基于生成的单位特征和组合特征来得到机器学习样本的特征。
以上描述了本公开的各示例性实施例,应理解,上述描述仅是示例性的,并非穷尽性的,本公开不限于所披露的各示例性实施例。在不偏离本公开的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此,本公开的保护范围应该以权利要求的范围为准。

Claims (26)

  1. 一种由至少一个计算装置自动生成机器学习样本的特征的方法,包括:
    获取用户指定的数据表,其中,数据表的一行对应一条数据记录,数据表的一列对应一个字段;
    声明数据表中的各个非目标值字段所对应的特征类型;其中,所述特征类型包括离散特征,或包括连续特征,或包括离散特征和连续特征;
    按照声明的特征类型将各个非目标值字段处理为单位特征;
    基于生成的单位特征来进行特征组合,以生成组合特征;以及
    基于生成的单位特征和组合特征来得到机器学习样本的特征。
  2. 根据权利要求1所述的方法,其中,所述方法通过启动与自动特征生成步骤相应的算子而自动执行。
  3. 根据权利要求2所述的方法,其中,所述算子对应于与机器学习流程相应的有向无环图中的节点。
  4. 根据权利要求3所述的方法,其中,非目标值字段通过以下方式来获取:从数据表中的所有字段中去除用户指定的目标值字段。
  5. 如权利要求4所述的方法,其中,所述算子在用户未指定目标值字段的情况下被启动时,提供异常提醒。
  6. 根据权利要求1-5中任一项所述的方法,其中,所述声明数据表中的各个非目标值字段所对应的特征类型包括:
    自动或根据用户的指示,将所有非目标值字段声明为离散特征,或者,将各个非目标值字段声明为与其字段值数据类型相应的离散特征或连续特征。
  7. 根据权利要求1-5中任一项所述的方法,其中,所述基于生成的单位特征来进行特征组合,以生成组合特征包括:
    对生成的全部单位特征进行各种组合来获取候选组合特征,或者,对生成的全部单位特征之中特征重要性较高的单位特征进行各种组合来获取候选组合特征;
    通过衡量与每个候选组合特征相应的机器学习模型的效果来从候选组合特征中筛选出组合特征。
  8. 根据权利要求1-5中任一项所述的方法,其中,所述基于生成的单位特征和组合特征来得到机器学习样本的特征包括:
    将生成的全部单位特征和全部组合特征作为机器学习样本的特征;
    或者,将生成的全部单位特征和全部组合特征之中,特征重要性较高的特征作为机器学习样本的特征;
    或者,将生成的全部单位特征之中特征重要性较高的单位特征和生成的全部组合特征,作为机器学习样本的特征;
    或者,将生成的全部组合特征之中特征重要性较高的组合特征和生成的全部单位特征,作为机器学习样本的特征。
  9. 根据权利要求1-5中任一项所述的方法,还包括:
    向用户显示得到的机器学习样本的特征。
  10. 根据权利要求9所述的方法,其中,在向用户显示得到的机器学习样本的特征时,还向用户显示每个特征的特征重要性。
  11. 根据权利要求1-5中任一项所述的方法,还包括:
    直接将得到的机器学习样本的特征应用于后续的机器学习步骤。
  12. 根据权利要求6所述的方法,其中,所述按照声明的特征类型将各个非目标值字段处理为单位特征包括:
    针对每一个字段值数据类型为连续型且被声明为离散特征的非目标值字段,执行一种或多种分桶运算以得到相应的一个或多个分桶特征,并将得到的分桶特征整体作为一个单位特征。
  13. 一种包括至少一个计算装置和至少一个存储指令的存储装置的***,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行用于自动生成机器学习样本的特征的以下步骤:
    获取用户指定的数据表,其中,数据表的一行对应一条数据记录,数据表的一列对应一个字段;
    声明数据表中的各个非目标值字段所对应的特征类型;其中,所述特征类型包括离散特征,或包括连续特征,或包括离散特征和连续特征;
    按照声明的特征类型将各个非目标值字段处理为单位特征;
    基于生成的单位特征来进行特征组合,以生成组合特征;以及
    基于生成的单位特征和组合特征来得到机器学习样本的特征。
  14. 根据权利要求13所述的***,其中,通过启动与自动特征生成步骤相应的算子来使所述***自动执行操作。
  15. 根据权利要求14所述的***,其中,所述算子对应于与机器学习流程相应的有向无环图中的节点。
  16. 根据权利要求15所述的***,其中,非目标值字段通过以下方式来获取:从数据表中的所有字段中去除用户指定的目标值字段。
  17. 如权利要求16所述的***,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置还执行以下步骤:
    在所述算子在用户未指定目标值字段的情况下被启动时,提供异常提醒。
  18. 根据权利要求13-17中任一项所述的***,其中,所述声明数据表中的各个非目标值字段所对应的特征类型的步骤包括:
    自动或根据用户的指示,将所有非目标值字段声明为离散特征,或者,将各个非目标值字段声明为与其字段值数据类型相应的离散特征或连续特征。
  19. 根据权利要求13-17中任一项所述的***,其中,所述基于生成的单位特征来进行特征组合,以生成组合特征的步骤包括:
    对生成的全部单位特征进行各种组合来获取候选组合特征,或者,对生成的全部单位特征之中特征重要性较高的单位特征进行各种组合来获取候选组合特征;
    通过衡量与每个候选组合特征相应的机器学习模型的效果来从候选组合特征中筛选出组合特征。
  20. 根据权利要求13-17中任一项所述的***,其中,所述基于生成的单位特征和组合特征来得到机器学习样本的特征的步骤包括:
    将生成的全部单位特征和全部组合特征作为机器学习样本的特征;
    或者,将生成的全部单位特征和全部组合特征之中,特征重要性较高的特征作为机器学习样本的特征;
    或者,将生成的全部单位特征之中特征重要性较高的单位特征和生成的全部组合特征,作为机器学习样本的特征;
    或者,将生成的全部组合特征之中特征重要性较高的组合特征和生成的全部单位特征,作为机器学习样本的特征。
  21. 根据权利要求13-17中任一项所述的***,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置还执行以下步骤:
    向用户显示得到的机器学习样本的特征。
  22. 根据权利要求21所述的***,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置还执行以下步骤:
    在向用户显示得到的机器学习样本的特征时,还向用户显示每个特征的特征重要性。
  23. 根据权利要求13-17中任一项所述的***,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置还执行以下步骤:直接将得到的机器学习样本的特征应用于后续的机器学习步骤。
  24. 根据权利要求18所述的***,其中,所述按照声明的特征类型将各个非目标值字段处理为单位特征的步骤包括:
    针对每一个字段值数据类型为连续型且被声明为离散特征的非目标值字段,执行一种或多种分桶运算以得到相应的一个或多个分桶特征,并将得到的分桶特征整体作为一个单位特征。
  25. 一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行如权利要求1至12中任一所述的自动生成机器学习样本的特征的方法。
  26. 一种用于自动生成机器学习样本的特征的***,包括:
    数据表获取装置,用于获取用户指定的数据表,其中,数据表的一行对应一条数据记录,数据表的一列对应一个字段;
    声明装置,用于声明数据表中的各个非目标值字段所对应的特征类型;其中,所述特征类型包括离散特征,或包括连续特征,或包括离散特征和连续特征;
    单位特征生成装置,用于按照声明的特征类型将各个非目标值字段处理为单位特征;组合特征生成装置,用于基于生成的单位特征来进行特征组合,以生成组合特征;以及
    特征获取装置,用于基于生成的单位特征和组合特征来得到机器学习样本的特征。
PCT/CN2018/123910 2017-12-27 2018-12-26 自动生成机器学习样本的特征的方法及*** WO2019129060A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711445538.3 2017-12-27
CN201711445538.3A CN108090516A (zh) 2017-12-27 2017-12-27 自动生成机器学习样本的特征的方法及***

Publications (1)

Publication Number Publication Date
WO2019129060A1 true WO2019129060A1 (zh) 2019-07-04

Family

ID=62179713

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/123910 WO2019129060A1 (zh) 2017-12-27 2018-12-26 自动生成机器学习样本的特征的方法及***

Country Status (2)

Country Link
CN (1) CN108090516A (zh)
WO (1) WO2019129060A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347320A (zh) * 2020-11-05 2021-02-09 杭州数梦工场科技有限公司 数据表字段的关联字段推荐方法及装置
CN112613983A (zh) * 2020-12-25 2021-04-06 北京知因智慧科技有限公司 一种机器建模过程中的特征筛选方法、装置及电子设备
US11062792B2 (en) 2017-07-18 2021-07-13 Analytics For Life Inc. Discovering genomes to use in machine learning techniques
US11139048B2 (en) 2017-07-18 2021-10-05 Analytics For Life Inc. Discovering novel features to use in machine learning techniques, such as machine learning techniques for diagnosing medical conditions

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090516A (zh) * 2017-12-27 2018-05-29 第四范式(北京)技术有限公司 自动生成机器学习样本的特征的方法及***
CN109408592B (zh) * 2018-10-12 2021-09-24 北京聚云位智信息科技有限公司 一种决策型分布式数据库***中ai的特征工程知识库及其实现方法
CN109634961B (zh) * 2018-12-05 2021-06-04 杭州大拿科技股份有限公司 一种试卷样本生成方法、装置、电子设备和存储介质
CN109739855B (zh) * 2018-12-28 2022-03-01 第四范式(北京)技术有限公司 实现数据表拼接及自动训练机器学习模型的方法和***
CN109697066B (zh) * 2018-12-28 2021-02-05 第四范式(北京)技术有限公司 实现数据表拼接及自动训练机器学习模型的方法和***
CN110297833A (zh) * 2019-07-05 2019-10-01 税安科技(杭州)有限公司 一种业务报表纠错方法
CN112184279A (zh) * 2019-07-05 2021-01-05 上海哔哩哔哩科技有限公司 Auc指标快速计算方法、装置以及计算机设备
CN110443864B (zh) * 2019-07-24 2021-03-02 北京大学 一种基于单阶段少量样本学习的艺术字体自动生成方法
CN110457329B (zh) * 2019-08-16 2022-05-06 第四范式(北京)技术有限公司 一种实现个性化推荐的方法及装置
CN110851500B (zh) * 2019-11-07 2022-10-28 北京集奥聚合科技有限公司 一种用于机器学习建模所需的专家特征维度的生成方法
CN111832740A (zh) * 2019-12-30 2020-10-27 上海氪信信息技术有限公司 一种对结构化数据实时衍生机器学习用特征的方法
CN111325578B (zh) * 2020-02-20 2023-10-31 深圳市腾讯计算机***有限公司 预测模型的样本确定方法及装置、介质和设备
CN114443639A (zh) * 2020-11-02 2022-05-06 第四范式(北京)技术有限公司 处理数据表及自动训练机器学习模型的方法和***
CN112380205B (zh) * 2020-11-17 2024-04-02 北京融七牛信息技术有限公司 一种分布式架构的特征自动生成方法和***
CN112434032B (zh) * 2020-11-17 2024-04-05 北京融七牛信息技术有限公司 一种自动特征生成***和方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677353A (zh) * 2016-01-08 2016-06-15 北京物思创想科技有限公司 特征抽取方法、机器学习方法及其装置
CN107316082A (zh) * 2017-06-15 2017-11-03 第四范式(北京)技术有限公司 用于确定机器学习样本的特征重要性的方法及***
CN107392319A (zh) * 2017-07-20 2017-11-24 第四范式(北京)技术有限公司 生成机器学习样本的组合特征的方法及***
CN107451266A (zh) * 2017-07-31 2017-12-08 北京京东尚科信息技术有限公司 用于处理数据方法及其设备
CN108090516A (zh) * 2017-12-27 2018-05-29 第四范式(北京)技术有限公司 自动生成机器学习样本的特征的方法及***

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677353A (zh) * 2016-01-08 2016-06-15 北京物思创想科技有限公司 特征抽取方法、机器学习方法及其装置
CN107316082A (zh) * 2017-06-15 2017-11-03 第四范式(北京)技术有限公司 用于确定机器学习样本的特征重要性的方法及***
CN107392319A (zh) * 2017-07-20 2017-11-24 第四范式(北京)技术有限公司 生成机器学习样本的组合特征的方法及***
CN107451266A (zh) * 2017-07-31 2017-12-08 北京京东尚科信息技术有限公司 用于处理数据方法及其设备
CN108090516A (zh) * 2017-12-27 2018-05-29 第四范式(北京)技术有限公司 自动生成机器学习样本的特征的方法及***

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11062792B2 (en) 2017-07-18 2021-07-13 Analytics For Life Inc. Discovering genomes to use in machine learning techniques
US11139048B2 (en) 2017-07-18 2021-10-05 Analytics For Life Inc. Discovering novel features to use in machine learning techniques, such as machine learning techniques for diagnosing medical conditions
CN112347320A (zh) * 2020-11-05 2021-02-09 杭州数梦工场科技有限公司 数据表字段的关联字段推荐方法及装置
CN112613983A (zh) * 2020-12-25 2021-04-06 北京知因智慧科技有限公司 一种机器建模过程中的特征筛选方法、装置及电子设备
CN112613983B (zh) * 2020-12-25 2023-11-21 北京知因智慧科技有限公司 一种机器建模过程中的特征筛选方法、装置及电子设备

Also Published As

Publication number Publication date
CN108090516A (zh) 2018-05-29

Similar Documents

Publication Publication Date Title
WO2019129060A1 (zh) 自动生成机器学习样本的特征的方法及***
CN111797998B (zh) 生成机器学习样本的组合特征的方法及***
US20220414544A1 (en) Parallel Development and Deployment for Machine Learning Models
CN111652380A (zh) 针对机器学习算法进行算法参数调优的方法及***
US20190213115A1 (en) Utilizing artificial intelligence to test cloud applications
US11416768B2 (en) Feature processing method and feature processing system for machine learning
US9454454B2 (en) Memory leak analysis by usage trends correlation
WO2021084286A1 (en) Root cause analysis in multivariate unsupervised anomaly detection
CN108228861B (zh) 用于执行机器学习的特征工程的方法及***
WO2019015631A1 (zh) 生成机器学习样本的组合特征的方法及***
US9276821B2 (en) Graphical representation of classification of workloads
JP2017508210A5 (zh)
CN113822440A (zh) 用于确定机器学习样本的特征重要性的方法及***
CN108008942B (zh) 对数据记录进行处理的方法及***
US10740361B2 (en) Clustering and analysis of commands in user interfaces
WO2022089652A1 (zh) 处理数据表及自动训练机器学习模型的方法和***
US11631205B2 (en) Generating a data visualization graph utilizing modularity-based manifold tearing
CN108898229B (zh) 用于构建机器学习建模过程的方法及***
US20220076157A1 (en) Data analysis system using artificial intelligence
Mostaeen et al. Clonecognition: machine learning based code clone validation tool
CN110895718A (zh) 用于训练机器学习模型的方法及***
US20240086165A1 (en) Systems and methods for building and deploying machine learning applications
KR20210143460A (ko) 특징 추천 장치 및 그것의 특징 추천 방법
WO2023066237A9 (en) Artificial intelligence model learning introspection
US20220404515A1 (en) Systems and methods for mapping seismic data to reservoir properties for reservoir modeling

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18894239

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18894239

Country of ref document: EP

Kind code of ref document: A1