CN111435463B - Data processing method, related equipment and system - Google Patents

Data processing method, related equipment and system Download PDF

Info

Publication number
CN111435463B
CN111435463B CN201910028386.XA CN201910028386A CN111435463B CN 111435463 B CN111435463 B CN 111435463B CN 201910028386 A CN201910028386 A CN 201910028386A CN 111435463 B CN111435463 B CN 111435463B
Authority
CN
China
Prior art keywords
data set
data
feature
candidate
candidate data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910028386.XA
Other languages
Chinese (zh)
Other versions
CN111435463A (en
Inventor
权涛
缪丹丹
孙伟健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201910028386.XA priority Critical patent/CN111435463B/en
Publication of CN111435463A publication Critical patent/CN111435463A/en
Application granted granted Critical
Publication of CN111435463B publication Critical patent/CN111435463B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a data processing method and related equipment and system. The method relates to the field of artificial intelligence, in particular to the field of automatic feature engineering, and comprises the following steps: the execution device performs multi-order feature transformation on a plurality of data features in the acquired first group of data sets, and selects an optimal data set from the data sets obtained by the feature transformation; when the nth order feature transformation is carried out, the feature transformation is respectively carried out on each data set in the nth group of data sets to obtain a plurality of candidate data sets; calculating a first evaluation value for each of the plurality of candidate data sets; further, an n+1st set of data sets entering the next-order feature transformation is determined based on the first evaluation value of each candidate data set, the number of data sets in the n+1st set of data sets being smaller than the number of the plurality of candidate data sets.

Description

Data processing method, related equipment and system
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a data processing method, and related devices and systems.
Background
With the advent of industry 4.0, the traditional industry is gradually moving to digital services. However, some traditional industries lack technology accumulation in terms of big data processing, cloud computing, artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), etc., and do not have the ability to apply AI technology transformation. Cloud computing has become an important service platform for digital economy, and providing automatic machine learning services based on the cloud has become the core competitiveness of the cloud platform.
The feature engineering (feature engineering) is an important link of automatic machine learning, and is to transform the original data set through features to obtain a plurality of candidate data sets, and evaluate the plurality of candidate data sets to obtain an optimal data set, wherein the optimal data set comprises data features which can be used for machine learning, the data features can describe the characteristics of the original data set in an omnibearing and multi-angle mode, and a model built by using the data features can show good performance.
At present, the method for obtaining high-order features through iterative feature transformation is a main means for obtaining a plurality of candidate data sets through automatic feature engineering, however, when the feature transformation operation is more, the candidate data sets obtained through transformation are exponentially increased, and each transformed data set needs to be subjected to performance evaluation, so that the time for determining the optimal data set is longer, and the automation efficiency of the feature engineering is low.
Disclosure of Invention
The embodiment of the application provides a data processing method, related equipment and a system, which solve the defect that a candidate data set obtained by transformation grows exponentially when the characteristic transformation operation is more in the prior art, and improve the automation efficiency of characteristic engineering.
In a first aspect, an embodiment of the present application provides a data processing method, which is applicable to an execution device, including: the execution device obtains a first set of data sets, the first set of data sets including a plurality of data features; performing multi-order feature transformation on a plurality of data features in a first set of data sets, and further determining a target data set from a first set, wherein the first set comprises a data set obtained by each order feature transformation in the multi-order feature transformation process. The implementation of the nth order feature transformation in the multi-order feature transformation is specifically as follows: respectively carrying out feature transformation on each data set in an nth group of data sets to obtain a plurality of candidate data sets, wherein the nth group of data sets are sets of data sets obtained by carrying out n-1 order feature transformation on a first data set, and n is an integer greater than 1; respectively calculating a first evaluation value of each of the plurality of candidate data sets, the first evaluation value being used for evaluating accuracy of a model obtained by training the candidate data sets; further, an n+1st set of data sets is determined based on the first evaluation value of each of the plurality of candidate data sets, the n+1st set of data sets being entered into a feature transformation of a next order, the number of data sets in the n+1st set of data sets being smaller than the number of the plurality of candidate data sets.
The first data set may be an original data set submitted or sent to the execution device by a user, or data of the original data set after preprocessing. The first set of data sets comprises a plurality of samples, the target data set is an optimal data set determined in the feature engineering, and a model obtained through training of the optimal data set is better.
The "multi-order feature transformation" refers to feature transformation performed multiple times by taking a data set obtained by the feature transformation as a basis of the next feature transformation.
It should be understood that after the execution device obtains the target data set, the execution device may further obtain a target feature transformation algorithm for transforming to obtain the target data set, and the execution device may further train the newly built machine learning model through the target data set to obtain the target machine learning model, and further send the target machine learning model and the target feature transformation algorithm to the device on the user side through the communication interface of the execution device.
It should be further understood that the execution device may be a terminal device, a server, or a device capable of performing data calculation, such as a virtual machine, which is not limited thereto.
According to the method, only part of candidate data sets in a plurality of candidate data sets obtained through the nth order feature transformation are selected as the (n+1) th group of data sets to carry out the next order feature transformation, so that the exponential increase of the number of the data sets is avoided, the data processing speed is improved, and the automation efficiency of the feature engineering is further improved.
As a possible implementation manner, the first candidate data set is any one of a plurality of candidate data sets, and the calculation method of the first evaluation value of the first candidate data set may be: the execution device calculates a meta-feature of the first candidate data set according to the first candidate data set, wherein the meta-feature is used for representing the attribute of the first candidate data set; inputting the meta-features into a first machine learning model to predict a second evaluation value of a first candidate data set, wherein the second evaluation value of the first candidate data set is used for evaluating the accuracy of the model obtained by training the first candidate data set; further, a first evaluation value of the first candidate data set is determined based on the second evaluation value of the first candidate data set.
It should be understood that, since the first machine learning model is obtained by taking meta-features of the data set as training data, and since the meta-features are attributes describing the data set and are independent of physical meaning of the data features in the data set and values of the data features, the first machine learning model can be obtained by offline training and is suitable for evaluating all the data sets.
In the prior art, the candidate data set evaluation method needs to train and test for each candidate data set, and the online training consumes a large amount of time. In the method, the first machine learning model is an off-line trained model, the evaluation value of the data set corresponding to the meta-feature can be directly predicted according to the meta-feature, the candidate data set obtained through screening is further screened based on the first evaluation value, only a small amount of candidate data sets are reserved to enter the next-order feature transformation, the process of the feature transformation is accelerated, and the target data set can be rapidly obtained.
As a possible implementation manner, the first candidate data set includes a plurality of data features and a tag, and the meta-feature calculation method of the first candidate data set may be: the execution device calculates first information according to a first candidate data set, wherein the first information can comprise at least one of data similarity and distribution similarity of every two data features in a plurality of data features of the first candidate data set, data similarity and distribution similarity of each data feature in the plurality of data features of the first candidate data set and a label, data distribution information of each data feature in the plurality of data features of the first candidate data set, data distribution information of the label and the like; further, a meta-feature of the first candidate data set is calculated from the first information.
Optionally, the meta-features of the first candidate data set may include: at least one of a basic feature of the first candidate data set, a feature of a continuous data feature of a plurality of data features of the first candidate data set, a feature of a discrete data feature of a plurality of data features of the first candidate data set, a feature of a tag, a feature of data similarity, a feature of distribution information of the data features, and the like.
Optionally, the first data feature and the second data feature are any two data features in the plurality of data features of the first candidate data set, and the method for calculating the data similarity between the first feature and the second feature may be: the execution device calculates mutual information of the first data feature and the second data feature according to the data of the first data feature and the data of the second data feature in the first candidate data set, and further determines the data similarity of the first data feature and the second data feature according to the mutual information. For example, the data similarity between the first data feature and the second data feature is mutual information between the first data feature and the second data feature.
Where mutual information (mutual information, MI) is an information measure in an information theory, it can be seen as the amount of information contained in one random variable about another random variable, or as the uncertainty that one random variable is decreasing due to the knowledge of another random variable. Therefore, the mutual information can describe the data similarity between the data features, and when the correlation between the data features is strong, the corresponding mutual information value is larger, and vice versa
Further, mutual information of the first data feature and the tag can be calculated, and further data similarity of the first data feature and the tag is obtained.
Optionally, the first data feature and the second data feature are any two data features in the plurality of data features of the first candidate data set, and the calculation method of the distribution similarity between the first data feature and the second data feature may be: and the execution equipment calculates the chi-square value of the first data feature and the second data feature through chi-square test or calculates the T statistic of the first data feature and the second data feature through T test according to the data of the first data feature and the data of the second data feature, wherein the chi-square value or the T statistic is the distribution similarity of the first data feature and the second data feature.
Further, the chi-square value or t statistic of the first data feature and the tag can be calculated, so that the distribution similarity of the first data feature and the tag is obtained.
Optionally, the first data feature is any one of a plurality of data features of the first candidate data set, and the calculating method of the distribution information of the first data feature may be: the execution device may calculate the skewness and kurtosis of the first data feature from the data of the first data feature, the distribution information of the first data feature including the skewness and kurtosis.
Further, skewness and kurtosis of the labels may also be calculated. Wherein the skewness (skewness) is the degree of asymmetry or skew of the data distribution, and is a measure of the direction and degree of skew of the data distribution; kurtosis (kurtosis) refers to the degree of concentration of the data and the degree of steepness (or flatness) of the distribution curve.
As a possible implementation, in a first implementation of determining the first evaluation value of the first candidate data set from the second evaluation value of the first candidate data set: the first evaluation value of the first candidate data set is the second evaluation value of the first candidate data set.
As a possible implementation manner, the first candidate data set is obtained by a first feature transformation for a first data set, and the first data set is one data set in the nth group of data sets. In a second implementation of determining the first evaluation value of the first candidate data set from the second evaluation value of the first candidate data set: the first evaluation value of the first candidate data set may be a sum of the first data item and the second data item; wherein the first data item is positively correlated with a second evaluation value of the first candidate data set, the second data item being determined by a historical gain number of the first feature transformation.
It will be appreciated that the first evaluation value of the data set after the first feature transformation in the first n sets of data sets is greater than the first evaluation value of the data set before the first feature transformation, the first feature transformation occurs a gain.
In the above method, the first evaluation value of the first candidate data set is adjusted together by the second evaluation value of the first candidate data set and the number of times of history gain of the first feature transformation that generates the first candidate data set, taking into account the number of times of history gain of the feature transformation, so that the transformation can be prevented from falling into a local optimum.
As a possible implementation, the first implementation of determining the n+1st group of data sets according to the first evaluation values among the plurality of candidate data sets may be: the execution device selects, as the n+1th group data set, a candidate data set in which the first evaluation value is larger than the first threshold value among the plurality of candidate data sets.
As a possible implementation, the second implementation of determining the n+1th group of data sets according to the first evaluation values among the plurality of candidate data sets may be: the execution device selects candidate data sets, of which the first m first evaluation values are respectively corresponding to the evaluation value sequences in the plurality of candidate data sets, as an n+1st group of data sets, the evaluation values are ranked as first evaluation values respectively corresponding to the plurality of candidate data sets arranged from large to small, and m is a positive integer.
As a possible implementation, the third implementation of determining the n+1th group of data sets according to the first evaluation values among the plurality of candidate data sets may be: the execution device selects a candidate data set of which a first evaluation value satisfies a first condition among the plurality of candidate data sets; further, training and testing the model are respectively carried out on each candidate data set in the candidate data sets meeting the first condition, so that third evaluation values respectively corresponding to each candidate data set in the candidate data sets meeting the first condition are obtained; further, the candidate data set satisfying the second condition for the third evaluation value among the candidate data sets satisfying the first condition is selected as the n+1th group data set.
According to the third implementation, firstly, the number of candidate data sets screened based on the first evaluation value is reduced, further, the screened candidate data sets are evaluated more accurately, the candidate data sets are further screened based on the accurate evaluation value, branches are further reduced, complexity of feature transformation is reduced, and data processing efficiency is improved.
Alternatively, the candidate data set, of which the first evaluation value satisfies the first condition, among the plurality of candidate data sets may be a candidate data set, of which the first evaluation value is greater than the second threshold value, among the plurality of candidate data sets; or, the first g first evaluation values of the plurality of candidate data sets are respectively corresponding to the first evaluation values, the evaluation values are respectively corresponding to the plurality of candidate data sets arranged from big to small, and g is a positive integer.
Optionally, the second candidate data set is any one of candidate data sets satisfying the first condition, the second candidate data set includes a training data set and a test data set, wherein any one of the training data set and the test data set includes a plurality of data features and a tag; the calculation method of the third evaluation value of the second candidate data set may be: the execution device trains a second machine learning model according to the training data set; inputting a plurality of data features of each sample in the test data set into the second machine learning model to obtain a prediction label of each sample in the test data set; further, a third evaluation value of the second candidate data set is calculated from the labels of each sample in the test data set and the predicted labels.
It should be appreciated that the third evaluation value may be an F1score (F1 score), an average accuracy (MAP), AUC (area under roc curve), a mean-square error (MSE), a root mean-square error (root mean square error), a recall, an accuracy, etc., which are not limited thereto.
As a possible implementation, before inputting the meta-feature into the first machine learning model, predicting the second evaluation value of the first candidate data set, the method may further comprise: the execution device acquires a plurality of first samples, any one of which includes meta-characteristics of the third data set and evaluation values of the third data set; the first machine learning model is trained from a plurality of first samples.
The meta-feature calculation method may be referred to the related description in the first aspect, and the embodiments of the present application are not repeated.
It should be appreciated that the first machine learning model may be used to predict an evaluation value of an input dataset based on meta-features of the dataset, the evaluation value being the second evaluation value in the first aspect described above.
The first machine learning model training method may be executed by the training device, or the execution device may be the same device as the training device, which is not limited thereto.
According to the method, the first machine learning model trained offline can be applied to all data sets, the second evaluation value of the candidate data set can be predicted based on the meta-characteristics of the candidate data set, the candidate data set is further screened based on the second evaluation value, inferior candidate data sets are removed, the number of the data sets is further limited from being increased, and the data processing efficiency is improved.
As a possible implementation manner, before the executing device performs feature transformation on each data set in the nth set of data sets to obtain a plurality of candidate data sets, the executing device may further apply a feature transformation algorithm of the data set according to the selection of the data set in the nth set of data sets, which may be specifically implemented: the execution device may input meta-features of a third dataset into a third machine learning model, and predict to obtain fourth evaluation values corresponding to B feature transforms, where the fourth evaluation value corresponding to the second feature transform is used to evaluate accuracy of a model obtained by training a candidate dataset obtained by the third dataset through the second feature transform, the third dataset is any dataset in the nth group of datasets, the second feature transform is any one feature transform of the B feature transforms, and B is a positive integer; selecting a feature transformation corresponding to a fourth evaluation value meeting a fourth condition from the B feature transformations to be A feature transformations, wherein A is a positive integer not more than B; at this time, one embodiment in which the performing device performs the feature transformation for each of the n-th group of data sets to obtain a plurality of candidate data sets may be: and the execution equipment performs A feature transformation on the third data set to obtain A candidate data sets.
It should be understood that, since the third machine learning model is obtained by taking meta-features of the data set as training data, and since the meta-features are attributes describing the data set, and are irrelevant to the physical meaning of the data features in the data set and the value of the data features, the third machine learning model can be obtained by offline training and is suitable for evaluating all the data sets.
According to the method, among feature transformation of the data sets in the nth group of data sets, fourth evaluation values corresponding to the feature transformation are estimated through a third machine learning model trained offline, feature transformation of the data sets which can enable the data sets to generate best is screened out based on the fourth evaluation values, feature transformation is only carried out on the data sets through the screened feature transformation, special transformation and calculation of the first evaluation values are reduced, and data processing is accelerated through pre-pruning before transformation.
As a possible implementation manner, before inputting the meta-features of the third dataset into the third machine learning model and predicting to obtain fourth evaluation values corresponding to the B feature transforms, the method further includes: a third machine learning model is trained. The training method can be realized by the following two ways:
the first implementation:
The method comprises the steps that an execution device obtains a plurality of second samples, any one of the second samples comprises meta-characteristics of a fourth data set and a difference value between an evaluation value of the data set after second characteristic transformation of the fourth data set and a previous evaluation value of the fourth data set, and the second characteristic transformation is any one of the B characteristic transformations; training the third machine learning model based on the plurality of second samples.
In this case, the a-type feature transformation may specifically be a feature transformation corresponding to a fourth evaluation value having a selected value greater than 0 among the B-type feature transformations.
The second implementation:
the method comprises the steps that an execution device obtains a plurality of third samples, wherein any one of the third samples comprises meta-characteristics of a fourth data set, and a fourth evaluation value of the data set of the second data set after second characteristic transformation; the third machine learning model is trained from a plurality of third samples.
At this time, the a-type feature transformation may specifically be a feature transformation corresponding to a third evaluation value having a value larger than the first evaluation value of the data set selected from the B-type feature transformation, into the a i -type feature transformation.
According to the method, the third machine learning model trained offline can be applied to all data sets, and the advantages and disadvantages of candidate data sets obtained through feature transformation can be predicted based on the meta-features of the data sets, so that feature transformation of the inferior data sets is avoided, the increase of the number of the data sets is limited, and the data processing efficiency is improved.
In a second aspect, embodiments of the present application provide a data processing system, the system may include:
A first acquisition unit for acquiring a first set of data sets, the first set of data sets comprising a plurality of data features;
a transformation unit for performing a multi-order feature transformation on a plurality of data features in the first set of data sets;
a first selection unit, configured to determine a target data set from a first set, where the first set includes a data set obtained by each order of feature transformation in the multi-order feature transformation process;
wherein, the transformation unit is specifically configured to: performing feature transformation on each data set in an nth group of data sets to obtain a plurality of candidate data sets, wherein the nth group of data sets are data sets obtained by performing n-1 order feature transformation on the first data set, and n is an integer greater than 1;
the system further comprises:
A first evaluation unit configured to calculate a first evaluation value of each of the plurality of candidate data sets, the first evaluation value being used to evaluate accuracy of a model obtained by training the candidate data sets;
And the first screening unit is used for determining an n+1st group of data sets according to the first evaluation value of each candidate data set in the plurality of candidate data sets, and the number of the data sets in the n+1st group of data sets is smaller than the number of the plurality of candidate data sets.
It should be noted that the system may further include other functional units for implementing the data processing method according to the first aspect, which may be referred to in the data calculation method according to the first aspect, and will not be described herein.
It should be understood that each functional unit in the above system may be disposed on one or more computing devices that may implement data computation such as an executing device, for example, the executing device may be one or more servers, one or more computers, etc., which is not limited thereto.
In a third aspect, an embodiment of the present application further provides an execution device, where the execution device may include a processor and a memory, where the memory is configured to store data and program codes, and the processor is configured to call the data and the program codes in the memory to execute:
Obtaining a first set of data sets, the first set of data sets comprising a plurality of data features;
performing a multi-order feature transformation on the plurality of data features in the first set of data sets;
determining a target data set from a first set, wherein the first set comprises a data set obtained by each order of feature transformation in the process of the multi-order feature transformation;
wherein said performing a multi-order feature transformation on a plurality of data features in said first set of data sets comprises:
Respectively carrying out feature transformation on the data features in each data set in an nth group of data sets to obtain a plurality of candidate data sets, wherein the nth group of data sets are data sets obtained by carrying out n-1 order feature transformation on the first data set, and n is an integer greater than 1;
Calculating a first evaluation value for each of the plurality of candidate data sets; the first evaluation value is used for evaluating the accuracy of a model obtained through training of the candidate data set;
and determining an n+1st group of data sets according to the first evaluation value of each candidate data set in the plurality of candidate data sets, wherein the number of the data sets in the n+1st group of data sets is smaller than the number of the plurality of candidate data sets.
It should be noted that the processor may also execute the data processing method according to the first aspect, which is described in the related description in the data computing method according to the first aspect, and will not be described herein again.
In one implementation of the embodiment of the present application, the processor may be a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a graphics processor (graphics processing unit, GPU), an artificial intelligence processor, one or more integrated circuits, or the like.
In another implementation of the embodiment of the present application, the execution device may further include an artificial intelligence processor, which may be a neural network processor (network processing unit, NPU), a tensor processor (tensor processing unit, TPU), or a graphics processor (graphics processing unit, GPU), which is all suitable for large-scale exclusive-or operation processing. The artificial intelligence processor may be mounted as a coprocessor to a Host CPU (Host CPU) that is assigned tasks by the Host CPU.
It should be appreciated that the computing device or executing device described above may be one or more servers, one or more computers, etc., as this is not limiting.
In a fourth aspect, embodiments of the present application also provide a computer storage medium for computer software instructions which, when executed by a computer, cause the computer to perform any of the data processing methods as described in the first aspect.
In a fifth aspect, embodiments of the present application also provide a computer program comprising computer software instructions which, when executed by a computer, cause the computer to perform any of the data processing methods as described in the first aspect.
In a sixth aspect, an embodiment of the present application further provides a training method of a machine learning model, which is applicable to a training device, where the method includes: the training equipment acquires a plurality of first samples, wherein any one of the plurality of first samples comprises meta-characteristics of a second data set and evaluation values of the second data set; a first machine learning model is trained from the plurality of first samples.
Optionally, the method for calculating the meta-feature is the same as the method for calculating the meta-feature of the first candidate data set in the first aspect, which may be referred to as related description in the first aspect, and embodiments of the present application are not repeated.
It should be noted that, the first machine learning model obtained by training is used for processing meta-features of a dataset input to the model to obtain a second evaluation value, and the second evaluation value is used for evaluating accuracy of the dataset obtained by training.
According to the method, the first machine learning model obtained through training can be suitable for all data sets, the evaluation value of the data sets can be predicted based on the meta-characteristics of the data sets, the data sets are evaluated through the evaluation value, further training and testing on each data set needing to be predicted with the evaluation value are avoided, and the evaluation efficiency of the data sets is improved.
In a seventh aspect, an embodiment of the present application further provides a training method of a machine learning model, which is applicable to a training device, where the method includes: the training equipment acquires a plurality of second samples, wherein any one of the second samples comprises meta-characteristics of a fourth data set and differences between evaluation values of the data set after second characteristic transformation of the fourth data set and evaluation values of the fourth data set, and the second characteristic transformation is any one of the B characteristic transformations; training the third machine learning model based on the plurality of second samples.
Optionally, a method for calculating the meta-feature of the fourth dataset is the same as the method for calculating the meta-feature of the first candidate dataset in the first aspect, which may be referred to the related description in the first aspect, and embodiments of the present application are not repeated.
The third machine learning model obtained by training is used for processing meta-features of the dataset input to the model to obtain fourth evaluation values corresponding to the B kinds of feature transformations one by one, wherein the fourth evaluation values are used for evaluating accuracy of the candidate dataset obtained by the dataset through the feature transformations corresponding to the fourth evaluation values.
According to the method, the third machine learning model obtained through training can be suitable for all data sets, whether the evaluation value of the candidate data set generated after the data set is subjected to the feature transformation has a gain can be predicted based on the meta-feature of the data set, and then before the feature transformation, the feature transformation (namely, the feature transformation corresponding to the candidate data set with the gain evaluation value) suitable for the data set is predicted, so that unnecessary feature transformation is avoided, and the data processing efficiency is improved.
In an eighth aspect, an embodiment of the present application further provides a training method of a machine learning model, which is applicable to a training device, where the method includes: the training equipment acquires a plurality of third samples, wherein any one of the plurality of third samples comprises meta-characteristics of a fourth data set, and a fourth evaluation value of the data set of the second data set after the second characteristic transformation; the third machine learning model is trained from a plurality of third samples.
Optionally, a method for calculating the meta-feature of the fourth dataset is the same as the method for calculating the meta-feature of the first candidate dataset in the first aspect, which may be referred to the related description in the first aspect, and embodiments of the present application are not repeated.
According to the method, the third machine learning model obtained through training can be suitable for all data sets, the evaluation value of the candidate data set generated after the data set is subjected to feature transformation can be predicted based on the meta-feature of the data set, and then the feature change suitable for the data set is predicted before the feature transformation, so that unnecessary feature transformation is avoided, and the data processing efficiency is improved.
The training device according to the sixth aspect, the seventh aspect, or the eighth aspect may be one or more servers, one or more computers, or the like, which is not limited thereto.
In a ninth aspect, an embodiment of the present application further provides a training device, where the training device may include a processor and a memory, where the memory is configured to store data and program codes, and the processor is configured to invoke the data and the program codes in the memory, and perform the training method of the machine learning model according to the sixth aspect.
In a tenth aspect, an embodiment of the present application further provides a training device, which may be a training device, and the computing device may include a processor and a memory, where the memory is configured to store data and program codes, and the processor is configured to invoke the data and the program codes in the memory, and perform the training method of the machine learning model according to the seventh aspect or the eighth aspect.
The processor in the ninth or tenth aspect may be a general-purpose central processing unit (Central Processing Unit, CPU), microprocessor, application SPECIFIC INTEGRATED Circuit (ASIC), graphics processor (graphics processing unit, GPU), artificial intelligence processor, or one or more integrated circuits, etc.
In another implementation of an embodiment of the present application, the training device in the seventh aspect or the eighth aspect may further include an artificial intelligence processor, where the artificial intelligence processor may be a neural network processor (network processing unit, NPU), a tensor processor (tensor processing unit, TPU), or a graphics processor (graphics processing unit, GPU) that is all suitable for use in the large-scale exclusive-or operation processing. The artificial intelligence processor may be mounted as a coprocessor to a Host CPU (Host CPU) that is assigned tasks by the Host CPU.
In an eleventh aspect, embodiments of the present application further provide a computer storage medium for computer software instructions which, when executed by a computer, cause the computer to perform the training method of any one of the machine learning models according to the sixth aspect.
In a twelfth aspect, embodiments of the present application also provide a computer program comprising computer software instructions which, when executed by a computer, cause the computer to perform the training method of any one of the machine learning models according to the sixth aspect.
In a thirteenth aspect, embodiments of the present application also provide a computer storage medium for computer software instructions which, when executed by a computer, cause the computer to perform the training method of any one of the machine learning models according to the seventh or eighth aspects.
In a fourteenth aspect, embodiments of the present application also provide a computer program comprising computer software instructions which, when executed by a computer, cause the computer to perform the training method of any one of the machine learning models of the seventh or eighth aspects.
In a fifteenth aspect, an embodiment of the present application further provides a chip, where the chip includes a processor and a data interface, and the processor reads an instruction stored on a memory through the data interface, and executes the data processing method in the first aspect.
Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, where the instructions, when executed, are configured to perform the method of any one of the data processing method in the first aspect, the training method of the machine learning model in the sixth aspect, the seventh aspect, or the eighth aspect.
In a sixteenth aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes the data processing system or the execution device in any one of the second aspect and the third aspect.
Drawings
In order to more clearly describe the embodiments of the present application or the technical solutions in the background art, the following description will describe the drawings that are required to be used in the embodiments of the present application or the background art.
FIG. 1 is a schematic block diagram of a system in an embodiment of the application;
FIG. 2 is a schematic block diagram of another system in accordance with an embodiment of the present application;
FIG. 3 is an interface diagram of a graphical user interface according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for computing meta-features of a dataset according to an embodiment of the present application;
FIG. 5 is a flow chart of a data processing method according to an embodiment of the application;
FIG. 6A is a schematic flow chart of an nth order feature transformation and selection in accordance with an embodiment of the present application;
FIG. 6B is a flowchart of a data processing method according to an embodiment of the present application;
FIG. 6C is a schematic illustration of a feature transformation and screening in an embodiment of the application;
FIG. 7 is a schematic block diagram of a data processing system in accordance with an embodiment of the present application;
FIG. 8 is a schematic block diagram of an implementation device in an embodiment of the present application;
FIG. 9 is a schematic block diagram of a training apparatus in an embodiment of the present application;
FIG. 10 is a schematic block diagram of a training apparatus in an embodiment of the present application;
Fig. 11 is a schematic block diagram of a chip in an embodiment of the application.
Detailed Description
The concepts involved in the present application are first introduced.
In the embodiment of the application, the machine learning model is also called a model, such as a first machine learning model, a second machine learning model or a third machine learning model, and can receive input data and generate prediction output according to the received input data and current model parameters. The machine learning model may be a regression model, a neural network (ARTIFICIAL NEURAL NETWORK, ANN), a deep neural network (deep neural network, DNN), a support vector machine (support vector machine, SVM), or other machine learning model, or the like.
In the embodiment of the application, an "original data set" is an original data set submitted or sent by a user to a cloud platform or very careful equipment. The original data set is used for training the established machine learning model to obtain the machine learning model capable of realizing a certain function. The data in the original dataset may be structured data, for example, represented by a "table". The original dataset includes M samples, each of which may include a plurality of data features and tags.
In the embodiment of the application, the first group of data sets is obtained by carrying out data preprocessing on the original data sets, wherein the first group of data sets can comprise M samples, and any one sample in the M samples comprises N1 data features and labels. The preprocessing may include one or more of data cleansing (DATA CLEANING), formatting, feature digitizing, and the like, among others. For example, it is necessary to encode "male", "female" in the data set, such as one-hot encoding (one-hot encoding), average encoding (mean encoding), and the like, and the data feature is described by a vector. It will be appreciated that in feature transforming a data set, only the data features in the samples are transformed, the tags in the data set are not transformed, and the feature transformation may produce new data features. That is, the number of data features and the meaning of the data feature reference of each sample in the feature-transformed data set or candidate data set are changed.
The "data set", the candidate data set of each group obtained by processing all comprise M samples, and the samples in different data sets or candidate data sets can comprise different data features, different numbers of data features and the like. It should be noted that the labels corresponding to the respective samples are unchanged. That is, a new data set is obtained by transforming the features of the samples in the new data set, the features of the samples in the new data set are transformed, and higher-order data features appear in proportion, but the labels corresponding to the samples are unchanged.
In the embodiment of the application, the data sets can have a hierarchical relationship, the relationship between the data sets can be described by a group, each group of data sets can comprise one or more data sets, and the relationship between the data sets of a plurality of groups in the embodiment of the application can also be described by a tree structure (also called a search tree). Performing multiple feature transformations on the 1 st group of data sets (1 st layer nodes, also called root nodes) to obtain multiple candidate data sets, and selecting multiple candidate data sets with better evaluation values from the multiple candidate data sets as 2 nd group of data sets (data sets corresponding to the 2 nd layer nodes); further, a plurality of feature transformations are performed for each data set in the group 2 data set to obtain a plurality of candidate data sets, and a part of candidate data sets having a better evaluation value is selected from the plurality of candidate data sets as the group 3 data set (data set corresponding to the layer 3 node), and so on. It can be seen that the layer 2 node is a child node of the layer 1 node, and similarly, the layer 3 node is a child node of the layer 2 node. In addition, the 1 st set of data sets includes a data set, which may be a data set obtained by performing a preliminary processing on an original data set, where a manner of the preliminary processing may include one or more of encoding (for example, encoding operations such as one-hot, meanencoder), normalization operations, and the like, which are not limited herein.
In the embodiment of the application, "pruning" refers to reducing the number of data sets in the data structure of each order of feature transformation through screening, so that unnecessary feature transformation is avoided, and what is more, certain branches in the search tree are pruned.
The data set or the candidate data set can be divided into a training data set and a test data set, wherein the training data set is used for training a model to obtain a trained model; and predicting the test data set by using the trained model, and comparing the predicted result with the real result of the test data, wherein the compared evaluation is called an evaluation value, and also called the performance of the model obtained on the data set. It should be appreciated that the evaluation process of the data set described above evaluates based on a model trained from the data set, resulting in a highly reliable evaluation value.
The data may be classified into continuous data (continuous data), discrete data (DISCRETE DATA), and the like; the data may be classified into distance type data (SCALE DATE), sequence type data (ordinal date), and fixed type data (normal date) according to the measurement scale of the data. According to the data type of the data, the data features can be divided into continuous data features and discrete data features, and the execution device can screen out a feature transformation algorithm suitable for the data features according to the type of the data features.
For example, a "cost" is a continuous data feature that can include normalization, log, evolution, squaring, etc. for a transformation corresponding to the "cost" feature; for another example, a "gender" is a discrete data characteristic, and the transformation corresponding to the data characteristic "gender" may include a coding operation such as one-hot, meanencoder, and a frequency operation (Freg).
In the embodiment of the application, the characteristic transformation refers to processing the characteristic data through a characteristic transformation algorithm to obtain new characteristics or higher-order characteristics. The transformation operation may be performed with respect to a single feature or with respect to a plurality of features, and is not limited thereto.
Feature transformations may include transformations performed on one data feature (also referred to as single feature transformations), feature transformations performed on two data features (also referred to as binary transformations), and transformations performed on more than two data features (also referred to as multi-element transformations). For single feature transformations, the feature transformation algorithm for continuous data features may include one or more of normalization operations, non-linear operations, discretization operations, and the like. The normalization method can comprise maximum and minimum normalization (min max normalization), 0-1normalization (0-1 normalization), linear function normalization or dispersion normalization and the like; the nonlinear operation may include taking one or more of a logarithm (log), a square (square), an evolution (sqrt), a sigmoid function, a hyperbolic tangent (tanh function), and the like; the discretization operations may include one or more of discretization operations based on equal width (equal frequency) or equal frequency (equal frequency), supervised discretization operations based on minimum description length principle (minimum description LENGTH PRINCIPLE), and rounding operations (e.g., round functions, etc.). The feature transformation operation for discrete data features may include Frequency (Frequency), i.e., the number of samples for which the statistical data feature takes a particular value. The binary or multivariate transformation can include one or more of basic mathematical operations (e.g., addition, subtraction, multiplication, division, etc.), aggregation operations (groupby), and time aggregation (group by time) operations, etc., for a plurality of data features.
It should be noted that, the foregoing description is only illustrative of some feature transformation, and embodiments of the present application may also include other feature transformation methods, which are not limited to this embodiment.
In the embodiment of the application, the multi-order feature transformation refers to the multi-time feature transformation by taking the data set obtained by the feature transformation as the basis of the next feature transformation. That is, the first set of data sets is subjected to the first-order feature transformation to obtain the second set of data sets, the second set of data sets is subjected to the second-order feature transformation to obtain the third set of data sets, and so on, when the condition of stopping transformation is satisfied, the feature transformation is not performed. It should be noted that the feature transformation algorithm used in each order of feature transformation may be the same or different.
In the embodiment of the present application, "evaluation value" (first evaluation value, second evaluation value, third evaluation value, fourth evaluation value, etc.) is used to evaluate the merits of a data set or candidate data set, and is generally used to describe the performance (accuracy, generalization ability, etc.) of a model obtained by training the data set.
In the embodiment of the application, the data feature is used for describing the data set or the sample in the candidate data set, and the meta feature is used for describing the data set or the candidate data set. Where "meta-features" describe general attributes of a dataset or candidate dataset by a single feature, the complexity of the dataset or candidate dataset may be characterized.
For example, the data set includes a plurality of samples, each sample includes data features such as "age", "academic", "graduation institution", "sex", "birth date", "occupation", "working year", and the like, and the label corresponding to the sample is "payroll". It can be seen that the user aims to derive a machine learning model that can predict payroll through dataset training. The meta-features of the dataset may include the number of samples, the number of data features, the data similarity of each data feature to the tag, the distribution information of the values of each data feature, the information entropy of the tag, and the like.
A system architecture according to an embodiment of the present application is described below in conjunction with fig. 1, where the system 10 may include a training device 110, an execution device 120, a client device 130, a terminal device 140, a data storage system 150, and so on. Wherein:
The data storage system 150 may store a plurality of sample data for training the first machine learning model, the third machine learning model, and the training device 110 is configured to execute program code of a model training method to train the machine learning model; the execution device 120 is configured to execute program code for a data processing method, a dataset, a candidate dataset generated by a feature transformation of the dataset, a second machine learning model trained from the candidate dataset, and so on.
The training device 110 may acquire sample data in the data storage system 150 to train the first machine learning model and the third machine learning model, and a specific training method may refer to the following description related to the embodiment of the training method of the first machine learning model or the embodiment of the training method of the third machine learning model, which is not repeated in the embodiments of the present application. The training device 110 transmits the trained first and third machine learning models to the execution device 120.
The first machine learning model obtained through training is used for processing meta-characteristics of a data set input into the model to obtain a second evaluation value, and the second evaluation value is used for evaluating accuracy of the data set obtained through training. The third machine learning model obtained through training is used for processing meta-features of the data set input into the model to obtain fourth evaluation values corresponding to the B types of feature transformations one by one, wherein the fourth evaluation values are used for evaluating the accuracy of the model obtained through training of candidate data sets obtained through feature transformations corresponding to the fourth evaluation values.
Because the first machine learning model and the third machine learning model are obtained by taking the meta-characteristics of the data set as training data, and because the meta-characteristics describe the attributes of the data set and are irrelevant to the physical meaning of the data characteristics in the data set and the value of the data characteristics, the first machine learning model and the third machine learning model can be suitable for evaluating all the data sets.
In one case, the customer may specify data (e.g., the original data set in an embodiment of the application) entered into the execution device 120, e.g., operating in an interface provided by the I/O interface of the execution device 120. In another case, the client device 130 may automatically input data to the I/O interface and obtain the result, and if the client device 130 automatically inputs data to obtain the authorization of the user, the client may set the corresponding rights in the client device 130. The client device 130 requests the execution device 120 to use an automatic machine learning service for the original data set to obtain a machine learning model (also referred to as a target machine learning model in the embodiment of the present application) required by the user. The results output by the execution device 120 may be viewed by the client at the client device 130, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client may input data, such as an original data set, to the execution device 120 via the client device 130. The client device 130 may also act as a data collection terminal to store the collected data set in the data storage system 150.
The execution device 120 is implemented by one or more servers, optionally in cooperation with other computing devices, such as: data storage, routers, load balancers and other devices; the execution device 120 may be disposed on one physical site or distributed across multiple physical sites. The execution device 120 may implement the data processing method according to the embodiment of the present application by using the data of the data storage system 150, or invoking the program code in the data storage system 150, specifically, the execution device 120 performs data preprocessing on the received original data set to obtain a first set of data sets (for example, the original data set sent by the client device 130), and further obtains an optimal data set (also referred to as a target data set in the embodiment of the present application) and a feature transformation algorithm (also referred to as a target feature transformation algorithm) corresponding to the optimal data set through multi-level feature transformation and selection. Further, the execution device can train the established machine learning model through the optimal data set to obtain the target machine learning model. Wherein the first machine learning model and the third machine learning model can be trained offline in the process of performing multi-order feature transformation to achieve acceleration of feature transformation and selection.
Before the feature transformation is performed on the data set, the executing device 120 may input meta features of the data set to the third machine learning model to obtain fourth evaluation values corresponding to the B feature transformations one by one, and screen out feature transformations corresponding to the larger fourth evaluation values from the B feature transformations based on the feature transformations, so that only the screened feature transformations are performed on the data set, and all feature transformations of the data set are avoided.
Wherein, after performing feature transformation on the data set to obtain a plurality of candidate data sets, the execution device 120 may input the plurality of candidate data sets into the first machine learning model, and the second evaluation value of each candidate data set in the plurality of candidate data sets may filter the candidate data sets based on the second evaluation value, so as to further reduce the number of candidate data sets. The candidate data set may be divided into a training data set and a test data set, and the execution device 120 may train the second machine learning model through the training data set, and further test and evaluate the second machine learning model through the test data set, to obtain a third evaluation value for evaluating the accuracy of the second machine learning model obtained by training the candidate data set. Since the third evaluation value is obtained by evaluating the model trained by the training data set in the candidate data set, the candidate data set can be evaluated more accurately. The execution device 120 may further screen the candidate data sets obtained by screening based on the third evaluation value, and only keep a small amount of candidate data sets to enter the feature transformation of the next stage, so as to greatly reduce the number of data sets and improve the feature transformation efficiency. The specific implementation may be referred to the relevant description in the embodiments of the data processing method in the embodiments of the present application, and will not be described herein in detail.
Further, the executing device 120 obtains a plurality of data sets after the multi-order feature transformation, and further, a target data set (also referred to as an optimal data set in the embodiment of the present application) in the plurality of data sets and a target feature transformation algorithm adopted by the original data set to transform the target data set may be determined according to a third evaluation value of the plurality of data sets, and further, model training is performed through the target data set, so as to obtain a target machine learning model required by the first user.
Still further, the execution device 120 may also send the target feature transformation algorithm and the target machine learning model to the user device 130.
The user may operate the respective terminal device 140 to interact with the execution device 120 or the client device 130 via a communication network of any communication mechanism/communication standard to use the target feature transformation algorithm and the target machine learning model for predictive services. The communication network may be a wide area network, a local area network, a point-to-point connection, or the like, or any combination thereof.
For example, the original dataset is as shown in table 1:
TABLE 1
The trained target machine learning model has predictive payroll capabilities. The terminal device 140 sends a first request to the execution device 120, the request carrying information of a first object, the information of the first object comprising gender, academy, date of birth, profession, working years. The execution device 120 performs feature transformation on the information of the first object through the target feature transformation algorithm, and inputs the data after the feature transformation to the target machine learning model, so as to obtain the prediction wages of the first object. The enforcement device 120 may send the predictive payroll to the terminal device 140.
It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship among devices, apparatuses, modules, etc. shown in fig. 1 is not limited in any way, for example, in fig. 1, the data storage system 150 is an external memory with respect to the execution device 120, and in other cases, the data storage system 150 may be disposed in the execution device 120.
It should be further noted that, in the embodiment of the present application, the training device 110 and the executing device 120 may be the same device, or different devices. The training device 110 and/or the executing device 120 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an AR/VR, a vehicle-mounted terminal, etc., or may be a server or a virtual machine, etc., or may be a distributed computer system formed by one or more servers and/or computers, etc., which is not limited by the embodiment of the present application. The client device 130 may be a server, a computer, a terminal device, or the like. The terminal device 140 may include a smart phone, a tablet computer, a personal computer, a desktop computer, an On Board Unit (OBU), a virtual reality device, an artificial intelligent device (e.g., a robot, etc.), or an intelligent wearable device, etc., which are not limited by the embodiments of the present application.
The following describes an application scenario designed according to an embodiment of the present application in conjunction with the system shown in fig. 2 and the graphical user interface shown in fig. 3. The cloud system may include a cloud platform 210 and a cloud host, the cloud platform may create a virtual machine on the cloud host, and the virtual machine may occupy computing resources of the cloud host when running on the cloud host, where the computing resources may be resources such as a central processor (central processing unit, CPU), a neural network processor (network processing unit, NPU), and/or a memory of the cloud host.
In an embodiment of the present application, the cloud platform 210 may implement the functions of the execution device 120 and/or the training device 110 in fig. 1.
Cloud platform 210 provides automated machine learning services to users and provides users with a graphical user interface based on which user device 220 may interact with cloud platform 210. Fig. 3 is a graphical user interface provided by cloud platform 210 to a user, which graphical user interface 300 may be displayed on client device 220 for providing automated machine learning services to the user. The graphical user interface may include at least one control, and the user interface associated with the control is displayed in response to a detected user operation with respect to the control.
The client device 220 may display an import data window to read the original dataset in response to a user operation on a first control (e.g., icon 301 labeled "read data" in fig. 3), which may include files and/or folders stored within the client device 220, and upload the first file to the cloud platform 210 in response to a user operation entered by the user on the first file (e.g., a file containing the original dataset or a file containing the first set of data).
The client device 220 displays a data modification window to modify the data for the imported raw data set as described above in response to the user's selection of a second control (e.g., icon 302 labeled "modify raw data" in FIG. 2). The data modification window may include a plurality of modification operations for each data feature and/or tag in the original dataset. It should be appreciated that the cloud platform 210 may automatically modify the original data set, and the user may also autonomously select a preprocessing mode to modify the original data set to obtain a satisfactory data set. The modification operations may include conversion of a value type, specifying tag data, etc., and embodiments of the present application are not limited.
In response to a user operation input by a user for a third control (such as the icon 303 labeled "automatic modeling" in fig. 2), the client device 220 sends a modeling instruction to the cloud platform 210, where the instruction instructs the cloud platform 210 to process the modified data of the original data set through the data processing method provided by the embodiment of the present application, that is, perform feature preprocessing, multi-order feature transformation and selection, so as to obtain an optimal data set (also referred to as a target data set in the embodiment of the present application). Specifically, cloud platform 210 may include a receiving module 211, a preprocessing module 212, a feature transformation module 213, a dataset determination module 214, a training module 215, a sending module 216, and the like. The cloud platform 210 may receive the original data set sent by the client device 220 through the receiving module 211, perform data preprocessing on the original data set through the preprocessing module 212 to obtain a first set of data sets (i.e., root node data sets), further perform multi-order feature transformation on the first set of data sets through the feature transformation module 213 to obtain a plurality of data sets and an evaluation value of each data set in the plurality of data sets (e.g., a first evaluation value, a second evaluation value and/or a third evaluation value of the data set obtained in the embodiment of the present application), and find an optimal data set and a target feature transformation algorithm for obtaining the optimal data set according to the evaluation values of the plurality of data sets through the data set determining module 214. The training module 215 determines the super parameters of the machine learning model based on the optimal data set and establishes the machine learning model, and further trains the established machine learning model through the optimal data set to obtain a target machine learning model which needs to have a certain function by the user.
The client device 220 sends instructions to the cloud platform 210 indicating to save the target feature transformation algorithm and the target machine learning model in response to user operation entered by the user for the fourth control (e.g., icon 304 labeled "save model" in fig. 2). In response to the instructions, the cloud platform 210 may save the target machine learning model and the target feature transformation algorithm, or may send the target machine learning model and the target feature transformation algorithm to the client device 220 via the sending module 216.
The graphical user interface may further include a fifth control, and responsive to a detected user operation input to the fifth control (e.g., icon 305 labeled "split data" in fig. 2), the data uploaded by the user is divided into training data and test data, where the training data is the original data set or the first set of data sets, and the test data is used to implement the evaluation of the target machine learning model.
In response to user operation of the client device 220 entered by the user for the sixth control (e.g., icon 306 labeled "model application" in fig. 2), the client device 220 sends instructions to the cloud platform 210 indicating that the model is to be predicted. After receiving the instruction, the cloud platform 210 performs feature transformation on the test data through a target feature transformation algorithm, and inputs the data after the feature transformation into a target machine learning model to obtain a prediction result of each test sample, and optionally, the cloud platform 210 may send the prediction result to the client device 220.
In response to user operation of the client device 220 entered by the user for the seventh control (e.g., icon 307 labeled "model evaluate" in fig. 2), the user device 220 sends instructions to the cloud platform 210 indicating that the model is to be evaluated. After receiving the instruction, the cloud platform 210 compares the prediction result with the real result (i.e., the label in the test sample) to obtain an evaluation value for evaluating the prediction accuracy of the target machine learning model. The cloud platform 210 may also send the evaluation value to the client device 220, which the client device 220 may display.
In response to a user operation entered by the user for the eighth control (e.g., icon 308 labeled "save data to data set" in FIG. 2), client device 220 sends an instruction to cloud platform 210 indicating that the predicted outcome is to be saved. In another implementation, the cloud platform 210 may also store other data, for example, meta-features of a data set or a candidate data set obtained in the multi-level feature transformation process, and evaluation values (for example, third evaluation values) corresponding to the meta-features, where the meta-features and the third evaluation values may be described in the following embodiments of the meta-feature calculation method and the data processing method, and the embodiments of the present application are not described in detail.
It should be noted that, fig. 2 is only an exemplary process for illustrating how to implement the human-computer interaction, and in practical application, other forms of graphical user interfaces may be further included, and the human-computer interaction process may also include other implementation manners, which are not limited herein. It should be further noted that the client device 220 may be the client device 130 in fig. 1 described above. Cloud platform 210 may be execution device 120 of fig. 1 described above.
It should be appreciated that preprocessing of the data may convert the received raw data set to a format required by the automated machine learning service, for example, converting the data to wide-table data, i.e., each row representing one sample, each column representing one data feature, including a tag column.
It should also be appreciated that the performance of the trained machine learning model (e.g., accuracy of predictions, generalization ability, etc.) depends on the optimal data set and algorithm for training the machine model, etc. The application aims to obtain a target data set from an original data set sent by a user, wherein the target data set can be used for determining super parameters or training a machine learning model required by the user.
Specific application scenarios are exemplarily described below.
First application scenario:
Mobile communication operators wish to mine more prepaid subscribers for conversion to postpaid subscribers, and identify potential postpaid subscribers from among the prepaid subscribers. At this time, the operator of the mobile communication may upload sample data (an original data set) including information of a plurality of prepaid subscribers to the cloud platform 210 based on the automatic machine learning service, and the information of one subscriber indicates one sample, and the information of the subscriber may include data characteristics of age, usage package, month average telephone charge, month average data flow, SIM card usage duration, etc. of the subscriber, and a subscriber type (subscriber type including prepaid subscriber and postpaid subscriber) of the subscriber after a preset duration is designated as a tag.
The cloud platform 210 may apply the data processing method provided by the embodiment of the present application to process data features in an original dataset including information of a large number of prepaid users, so as to obtain a target dataset in the application scenario and a target feature transformation algorithm corresponding to the target dataset, further determine super parameters through the target dataset and establish a machine learning model, take data features of the target dataset as input, train the established machine learning model through supervising the user type, and finally obtain the target machine learning model. By means of the target feature transformation algorithm and the target machine learning model, whether the user is a potential post-paid user can be predicted on the data features of the known pre-paid user.
The second application scenario:
The communication carrier wants to predict the package for the user after L months, at this time, the package usage information of the user at the first time may be used as a feature, the package used by the user at the second time may be used as a tag, and the broad-table data may be constructed as training data of the machine learning model capable of realizing the package recommendation of the operation and maintenance center (service operation center, SOC). The second time may be a time after the first time passes through L months, and the first time and the second time may be in units of months, where L is a positive integer.
The training data (i.e., the original data set in the embodiment of the present application) may include a plurality of data features: user Identification (ID), whether the user uses a fixed-mobile converged package at the first time, the online time of the user up to the first time, the total account count of the month at the first time, the accumulated flow of the month at the first time, the continuous four-month oversleeve identification, the contract time, the local voice calling time of the month at the first time, and the like.
According to the data processing method provided by the embodiment of the application, the target data set and the target feature transformation algorithm can be obtained, the target data set can be used for training a machine learning model to obtain the target machine learning model, the target feature transformation algorithm adopted by the target data set and the target machine learning model obtained through training can predict future packages of a user based on the current package service condition of the user, and further, the predicted packages are recommended to the user so as to realize network optimization.
Third application scenario:
The communication operator hopes to identify OTT (over-the-top) service type of network behavior, the characteristics of data flow can be used as training data, the OTT service type is used as a label, and wide-table data is constructed and used as training data of a machine learning model capable of realizing OTT service identification. The OTT service class may include a video service, a web browsing service, a voice call service, a video call service, a music download service, and the like.
The training data (i.e., the original data set in the embodiment of the present application) may include a plurality of data features: at least one of a distribution of a number of stream packets, a distribution of a size of a stream packet, a distribution of an interval of a stream packet, a distribution of a number of upstream stream packets, a distribution of a size of an upstream stream packet, a distribution of an interval of an upstream stream packet, a distribution of a number of downstream stream packets, a distribution of a size of a downstream stream packet, a distribution of an interval of a downstream stream packet, and the like. It should be understood that the data of the first duration (such as 20 seconds, 30 seconds, etc.) is one data stream, and in the embodiment of the present application, one data stream is taken as a unit, and one data stream includes a plurality of data packets, and the plurality of data packets may be divided into an uplink data packet and a downlink data packet. It should also be appreciated that the distribution of the number of stream packets may be an average, standard deviation, variance, etc. of the number of packets in a data stream over a period of time (which may be 1 second, 0.5 second, 0.01 second, etc. of time less than the first period of time). Similarly, the size distribution of the stream data packet may be the average value, standard deviation, variance, etc. of the size of the stream data packet in a period of time; the interval distribution of the stream packets may be the average value, standard deviation, variance, etc. of the intervals of the adjacent packets.
The original data set can obtain the target data set and the target feature transformation algorithm through the data processing method provided by the embodiment of the application, the target data set can be used for training a machine learning model to obtain a target machine learning model, and the OTT business class of the current data stream of a user can be predicted based on the features of the data stream through the target feature transformation algorithm adopted by the target data set and the training to obtain the target machine learning model.
Fourth application scenario:
Communication operators wish to predict cell traffic for network planning and optimization. At this time, for each cell, the traffic of the base station may be counted, and the traffic in a plurality of continuous time periods of the cell is used as a data feature, and one time period after the plurality of continuous time periods is used as a tag to construct wide table data as training data of a machine learning model capable of realizing cell traffic prediction in a future time period.
The training data (i.e., the original data set in the embodiment of the present application) may include a plurality of data features: the traffic of the first cell in the first time period, the traffic of the first cell in the second time period, …, the traffic of the first cell in the nth time period, and the label is the traffic of the first cell in the (n+k) th time period, wherein N, K is a positive integer. The multiple time periods have equal durations, such as days or months. For example, the training data (i.e., the original data set in an embodiment of the present application) may include a plurality of data features: the traffic of the first cell in the first month, the traffic of the first cell in the second month, the traffic of the first cell in the third month, the traffic of the first cell in the fourth month, the traffic of the first cell in the fifth month and the traffic of the first cell in the sixth month are labeled as the traffic of the first cell in the seventh month. That is, the machine learning model obtained from the training data can predict the traffic of the next month from the traffic of the first six months of the first cell.
According to the data processing method provided by the embodiment of the application, the target data set and the target feature transformation algorithm can be obtained, the target data set can be used for training a machine learning model to obtain the target machine learning model, the target feature transformation algorithm adopted by the target data set and the target machine learning model obtained through training can predict the flow of the first cell in the future time period based on the flow of the first cell in a plurality of continuous time periods, and therefore, a communication operator can obtain the flow planning and network optimization of the first cell in the future time period in advance according to the prediction.
It should be understood that the embodiment of the present application is only illustrated by taking the first cell as an example, where the first cell may be any one of the cells that need to perform traffic prediction, and it is understood that different cells correspond to different target feature transformation algorithms and target machine learning models.
Fifth application scenario:
the communications carrier wishes to predict whether the user will have off-network activity in the future (i.e. no longer use his communications network services). At this time, the network usage information of the user at the first time can be used as a feature, and whether the user is off-network at the second time is used as a label, so that the wide table data is constructed, and the wide table data is used as training data for identifying a machine learning model of the potential off-network user by an operation and maintenance center (service operation center, SOC). The second time may be a time after the first time passes through L months, and the first time and the second time may be in units of months, where L is a positive integer.
The training data (i.e., the original data set in the embodiment of the present application) may include a plurality of data features: user Identification (ID), whether the user uses a fixed-mobile converged package at the first time, the online time of the user up to the first time, the total account count of the month at the first time, the accumulated flow of the month at the first time, the continuous four-month oversleeve identification, the contract time, the local voice calling time of the month at the first time, and the like.
The original data set can be used for obtaining the target data set and the target feature transformation algorithm through the data processing method provided by the embodiment of the application, the target data set can be used for training a machine learning model to obtain the target machine learning model, and whether the user is off-line in the future can be predicted based on the current network use condition of the user through the target feature transformation algorithm adopted by the target data set and the training to obtain the target machine learning model.
Because the data sets provided by the users and received by the cloud platform are various, a plurality of candidate data sets obtained by transformation aiming at the same data set are different from each other, in the prior art, when the candidate data sets are evaluated, on-line training and testing of a model are required to be carried out according to the candidate data, on-line training and testing of a model are required to be carried out aiming at each candidate data set, the time consumption is high, and the efficiency of automatic feature engineering is low.
In order to avoid or reduce online training and testing of candidate data sets, embodiments of the present application provide a method for evaluating candidate data sets by meta-features of the data sets. The meta-feature is irrelevant to specific data of the candidate data set, is used for describing the attribute of the data set or the candidate data set, can characterize the complexity of the data set or the candidate data set, and is one of main factors for realizing feature transformation, selecting acceleration and improving feature transformation efficiency. The following describes, with reference to fig. 4, a method for calculating meta-characteristics of various data sets according to an embodiment of the present application, where the method for calculating meta-characteristics may be executed by an execution device, and may include some or all of the following steps:
S42: according to the data set, first information is calculated, the data set comprises M samples, each sample in the M samples comprises N data features and a label, the first information comprises data similarity and distribution similarity of every two data features in the N data features, the data similarity and the distribution similarity of every data feature in the N data features and the label, the data distribution information of every data feature in the N data features and the data distribution information of the label are at least one, and M, N is a positive integer.
It should be appreciated that N is different in value in different data sets or candidate data sets, e.g., N is N2 when computing the meta-features of the first candidate data set; for another example, N is N1 when computing the data set of the root node.
The respective data amounts included in the first information are described below, respectively.
Data similarity:
The first data characteristic and the second data characteristic are any two data characteristics in the N data characteristics in the data set. Taking a first data feature and a second data feature as examples, the method for calculating the data similarity of any two data features in the N data features is based on the data collection of the first data feature and the data collection of the second data feature in the data set. In one specific implementation of the application, the data similarity of the first data feature and the second data feature may be represented by mutual information (mutual information, MI) of the first data feature and the second data feature.
Mutual information is an information measure in an information theory that can be seen as the amount of information contained in one random variable about another random variable, or as the uncertainty that one random variable has been reduced by knowing another random variable. Therefore, the mutual information can describe the data similarity between the data features, and when the correlation between the data features is strong, the corresponding mutual information value is larger, and vice versa. Therefore, the data similarity of the two data features can better reflect redundancy between the data features, and the data similarity of the data features and the tag can reflect the information size provided by the features to the tag.
Wherein the mutual information I (X; Y) of the first data feature and the second data feature may be expressed as:
Formula (1), X is a set of values for a first data feature in a dataset; y is a set of values for a second data feature in the dataset; p (x) represents the probability of the value of the first data feature in the data set being x, namely the ratio of the number of samples of the first data feature with the value of x to the total number of samples M; p (y) represents the probability of the second data feature in the dataset being valued y, i.e. the ratio of the number of samples of the second data feature being valued y to the total number of samples M; p (x, y) represents the probability that the first feature in the dataset has a value of x and the second data feature has a value of y, i.e., the ratio of the number of samples with the first data feature having a value of x and the second data feature having a value of y to the total number of samples M. From a mathematical perspective, p (X, Y) is the joint probability distribution function of X and Y, and p (X) and p (Y) are the edge probability distribution functions of X and Y, respectively.
Similarly, if the second data feature is replaced by a tag, Y is a set of values of the tag in the data set, and mutual information of the data feature and the tag can be calculated.
It can be seen that the mutual information of any two data features in the N data features and the mutual information of the N data features and the labels can be calculated.
It should be appreciated that the data similarity of the present application may also include other implementations, such as, without limitation, pearson product moment correlation coefficient (Pearson correlation coefficient), maximum information coefficient (maximal information coefficient, MIC), spearman rank correlation coefficient (Spearman correlation), typical correlation analysis (canonical correlation analysis, CCA), rank correlation coefficient (coefficient of rank correlation), and the like.
(II) distribution similarity:
Taking a first data feature and a second data feature as examples for illustration, the method for calculating the respective similarity of any two data features in the N data features is based on the acquisition of a set of data of the first data feature and a set of data of the second data feature in the dataset. In a specific implementation of the application, the similarity of the first data feature and the second data feature may be represented by a chi-square value and/or a t-statistic of the first data feature and the second data feature. For convenience of description, in the embodiment of the present application, the distribution similarity obtained through Chi-square test (Chi-square test) is referred to as a first distribution similarity, and the distribution similarity obtained through T test (T test) is referred to as a second distribution similarity. The first information may include a first distribution similarity (also known as chi-square value) and/or a second distribution similarity (also known as t-statistic) of the first data feature and the second data feature.
It should be noted that the T-test is performed only between continuous data features or between continuous data features and tags representing regression problems. The chi-square test may be performed between two discrete data features or between a discrete data feature and a label representing a classification problem, or may be performed after discretizing a continuous data feature and/or a label representing a regression problem.
The embodiment of the application counts the deviation degree between the first data characteristic and the second data characteristic of the sample through chi-square test or T test. The degree of deviation between the first data feature and the second data feature determines the magnitude of the chi-square value, and the larger the chi-square value is, the more inconsistent the data distribution of the first data feature and the second data feature is; conversely, the smaller the chi-square value, the smaller the deviation, and the more consistent the data distribution of the two. It can be seen that the redundancy of the data can be determined by comparing the distribution of the two features using chi-square test. Also, when a feature is more similar to the target distribution, the feature can better distinguish the targets.
The first distribution similarity χ 2 of the first data feature and the second data feature is;
In the formula (2), X k is the frequency number (also called probability) that the value of the first data feature in the data set is the level K, Y k is the frequency number that the value of the second data feature in the data set is the level K, K is the number of value grids (namely the number of the levels of the first data feature or the second data feature is divided), K and K are positive integers, and K is more than or equal to 1 and less than or equal to K.
For example, if the difference between the maximum value and the minimum value in the set of values of the first data feature in the dataset is U, the level K is the interval [ X min+(k-1)*U/K,Xmin+k*U/K],Xk ] which is the ratio of the number of samples of the value of the first data feature in the interval [ X min+(k-1)*U/K,Xmin +k×u/K ] to the total number of samples M, where X min is the minimum value in the set of values of the first data feature in the dataset.
Similarly, replacing the second data feature with a tag may calculate the chi-square value (first distribution similarity) of the data feature and the tag.
The second distribution similarity t of the first data feature and the second data feature is;
In the formula (3), d i=|xi-yi |, Mu 0 is the T test parameter; i is an index of a data set sample, i is more than or equal to 1 and less than or equal to M, x i is a value of a first data feature corresponding to the data set sample i, y i is a value of a second data feature corresponding to the data set sample i, M is the number of the data set sample, and M is a positive integer.
Similarly, the t statistics (second distribution similarity) of the data features and the tags can be calculated by replacing the second data features with tags.
It can be seen that the chi-square value of any two discrete data features in the N data features and the chi-square value of the discrete data features and the chi-square value of the tag can be calculated, and the t statistic of any two continuous data features in the N data features and the t statistic of the continuous data features and the tag can be calculated.
It should be understood that the distribution similarity may also include other implementations in the present application, such as KL divergence (Kullback-Leibler divergence, KLD), braman divergence (Bregman divergence), maximum mean difference (maximum MEAN DISCREPANCY, MMD), copula & tail-dependence based on Copula function, and the like, which are not limited thereto.
(III) distribution information:
In classification or regression problems, the more concentrated the distribution of data features, the smaller the corresponding discrimination; conversely, the flatter the distribution of data features, the greater the likelihood of distinguishing between different categories. The distribution information of the data features can be represented by adopting two indexes of skewness and kurtosis.
The skewness (skewness) is the degree of asymmetry or skew of the data distribution, is a measure of the direction and degree of skew of the statistical data distribution, and the skewness distribution is divided into two types, namely left (negative bias) and right (positive bias). In general, the defined skewness is the third order normalized moment of the sample. The bias definition includes a normal distribution (bias=0), a right bias distribution (also called positive bias distribution, bias > 0), and a left bias distribution (also called negative bias distribution, bias < 0).
Kurtosis (kurtosis) refers to the degree of concentration of the data and the degree of steepness (or flatness) of the distribution curve. Kurtosis is generally measured by using a normal distribution curve as a comparison standard, and is generally classified into a normal kurtosis, a peaked distribution and a flat-top distribution. The kurtosis degree of the distribution curve is directly related to the numerical value of the even-order central moment, and the relative number obtained by dividing the fourth-order central moment by the fourth-order square of the standard deviation is used for measuring kurtosis based on the fourth-order central moment to eliminate the influence of dimension. And the ratio of the fourth-order center moment m4 of the normal distribution curve to the fourth power of the standard deviation is equal to 3.
Taking the first data as an example, the method for calculating the respective similarity of any one of the N data features is described. The bias gamma 1 of the first data feature in the dataset is:
Kurtosis γ 2 of the first data feature in the dataset is:
Wherein, I is an index of samples in the data set, i is more than or equal to 1 and less than or equal to M, and M is the number of the samples; μ is an average value of the values of the first data features in the dataset, and x i is the value of the first data feature corresponding to the sample i in the dataset.
It should be understood that the distribution information in the present application may also be represented by one or more of mean, variance, coefficient of variation (coefficient of variation, CV), mutation point location, information entropy, kunit coefficient, etc., which is not limited thereto.
Similarly, the first data feature is replaced by a label, and the chi-square value and t statistic of the label can be calculated.
In summary, the mutual information of each two data features in the N data features and the mutual information of the N data features and the label respectively, the chi-square value of each two discrete data features in the N data features and the chi-square value of the label respectively, the t statistic of each two continuous data features in the N data features, the t statistic of each continuous data feature in the N data features and the label respectively, the distribution information of each data feature in the N data features, the distribution information of the label and the like form the first information (also referred to as a data entropy matrix in the embodiment of the application).
S44: a meta-feature of the dataset is calculated from the first information.
The meta-features of the dataset may include at least one of a base feature of the dataset, a feature of a continuous data feature, a feature of a discrete data feature, a feature of a tag, a feature of data similarity, a feature of distribution information of a data feature, and the like. Based on the obtained first information, the data entropy matrix is further calculated in a mode of statistics, association analysis, data complexity calculation and the like, and finally, the characterization features, namely meta features, of the data set are formed.
Wherein the basic features of the data set are used for describing the basic condition of the data set, and may include at least one of the total number of samples, the total number of data features, the total number of categories of labels, the ratio of the total number of data features to the total number of samples, and the like; the features of the continuous data features are features extracted based on the data of the continuous data features, and the attributes for describing the set of the continuous data features may include at least one of the total number of the continuous data features, the ratio of the total number of the continuous data features to the total number of the data features, and the like; the characteristics of the discrete data characteristics are characteristics extracted based on the data of the discrete data characteristics, and the attributes used for describing the set of the discrete data characteristics can comprise at least one of the total number of the discrete data characteristics, the ratio of the total number of the discrete data characteristics to the total number of the data characteristics and the like; the characteristics of the tag are extracted based on the data of the tag, and the attributes used for describing the tag can comprise at least one of information entropy of the tag, a coefficient of a radix of the tag (Gini coefficient), an average sample ratio of a tag class, kurtosis of the tag, skewness of the tag and the like; the feature of the data similarity is a feature extracted based on the data similarity between the data features and/or the data similarity between the data features and the tag, and the attribute used for describing the set of the data similarity may include at least one of a maximum value, a mean value, a standard deviation, and the like of the data similarity between the tag and the data features; the feature of the distribution similarity is a feature extracted based on the distribution similarity between the data features and/or the distribution similarity between the data features and the tag, and the attribute used for describing the set of the distribution similarity may include at least one of a maximum value, a mean value, a standard deviation, and the like of the distribution similarity between the tag and the data features; the distribution information is characterized by being extracted based on the distribution information (such as kurtosis, skewness, etc.) of the data features, and the attribute for expressing the set of the distribution information may include at least one of a maximum peak value, a minimum peak value, an average peak value, a maximum skewness, a minimum skewness, an average skewness, etc.
Wherein the information entropy of the tag represents the average amount of information in the tag.
For example, the information entropy of the tag is calculated by:
Where i is the index of the tag class, P (z i) is the probability that the tag class is z i in the M samples, b is the base used for logarithm, and is typically 10 or a natural constant e.
For another example, the average sample ratio of label categories: tags can be divided into categories, for example, when the tag is gender, the problem to be solved by the model is a problem of predicting "men" and "women", the tag includes two categories, namely "men" and "women", and the average sample ratio of the tag categories is 0.5.
For another example, the method for calculating the average value of the data similarity of two data features is as follows: and summing the data similarity of every two data features in the N obtained features, and dividing the sum by the number of the data similarity.
It should be appreciated that meta-characteristics may also include data items that include other attributes describing the data set, and embodiments of the present application are not limited.
In the embodiment of the present application, the meta-feature corresponding to any one data set in any one data set, the meta-feature corresponding to the candidate data set obtained by the data set through feature transformation, the meta-feature included in the first sample for training the first machine learning model, and the meta-feature included in the second sample for training the third machine learning model may all be calculated by the above-mentioned meta-feature calculation method.
The following describes a training method of a first machine learning model according to an embodiment of the present application, where it should be understood that the first machine learning model is used to predict a second evaluation value of a candidate data set, and the second machine learning model may be trained offline, and the training method specifically may include: the training device acquires a plurality of first samples, wherein any one of the plurality of first samples comprises meta-characteristics of a second data set and an evaluation value of the second data set (in the embodiment of the application, the evaluation value is called a third evaluation value, and can also be called a true evaluation value or a trusted evaluation value, and the reliability is high); further, the first machine learning model is trained by supervising the evaluation values with the meta-features of the first sample as input.
Wherein the second data set is a public data set comprising a plurality of data features and tags. It should be understood that the third data sets corresponding to the meta-features in the different first samples are different, and are specifically represented by different numbers, meanings and labels of the data features.
Through the meta-feature calculating method, the meta-features of the second data set can be calculated; an evaluation value (e.g., AUC (area under roc curve)) is calculated for the machine learning model trained on the second data set. A plurality of meta-features of the second data set and evaluation values corresponding to each meta-feature distribution constitute a plurality of first samples for training the first machine learning model.
It can be seen that the first machine learning model obtained by training may predict the evaluation value of the dataset (referred to as the second evaluation value in the embodiment of the present application, also referred to as the estimated evaluation value, which is the estimation result, with low accuracy) based on the meta-feature of the dataset.
It should be appreciated that the meta-features are independent of the specific data of the data sets, and represent the attributes of the data sets, so that the first machine learning model trained based on the meta-features may be applied to all data sets, i.e. data sets transmitted for all users, data sets generated by each order of transformation, or candidate data sets, and after the meta-features of each data set are calculated, the second evaluation values of each data set may be estimated by the first machine learning model. The second evaluation value may reflect the accuracy, generalization ability, etc. of the training derived model of the dataset.
The following describes a training method of a third machine learning model according to an embodiment of the present application, where it should be understood that the third machine learning model is used to predict a fourth evaluation value of a data set after feature transformation, and the third machine learning model may be trained offline, and the training method may specifically include the following two implementation manners:
Implementation mode (1):
The training device may acquire a plurality of second samples, where any one of the plurality of second samples includes a meta feature of the fourth data set and a difference value between an evaluation value of the data set obtained by performing second feature transformation on the fourth data set and an evaluation value of the fourth data set, where the second feature transformation is any one of B feature transformations; furthermore, the meta-features of the second sample are used as input to train a second machine learning model by supervising the differences.
It can be seen that the third machine learning model obtained in the implementation manner (1) may predict, based on meta features of the data set, a gain of an evaluation value of the data set after the B feature transformation, and further predict, before the feature transformation, whether the evaluation value of the data set after the feature transformation is improved.
Implementation mode (2):
The training device may obtain a plurality of third samples, where any one of the plurality of third samples includes a meta feature of the fourth data set and an evaluation value of the data set obtained by performing a second feature transformation on the fourth data set, where the second feature transformation is any one of B feature transformations; further, the meta-feature of the third sample is used as an input to train the second machine learning model by supervising the evaluation value.
It can be seen that the second machine learning model obtained in the implementation manner (2) may predict, based on meta-features, an evaluation value of the data set after B feature transformations (also referred to as a third evaluation value in the embodiment of the present invention), and further predict, before feature transformations, an evaluation value of the data set after feature transformations.
Wherein the fourth dataset is a published dataset comprising a plurality of data features and tags. It should be understood that the fourth data set corresponding to the meta-feature in the second sample is different, and is specifically represented by different numbers, meanings, and labels of the data features.
Through the meta-feature calculating method, the meta-feature of the fourth data set can be calculated, and an evaluation value AUC1 is calculated aiming at the machine learning model obtained by training the fourth data set; and calculating an evaluation value AUC2 aiming at a machine learning model obtained by training the candidate data set of the fourth data set after feature transformation. The plurality of meta-features of the fourth dataset and the differences in AUC2 and AUC1 for each meta-feature distribution form a plurality of second samples of the third machine learning model of implementation (1). A plurality of meta-features of the fourth dataset and the evaluation values AUC2 corresponding to each meta-feature distribution constitute a plurality of third samples of the third machine learning model of implementation (2).
It should be understood that the third machine learning model may further include other training methods, which are not described in detail in the embodiments of the present application.
It should be further understood that the meta-feature is independent of specific data of the data set, and represents an attribute of the data set, so that the first machine learning model trained based on the meta-feature may be applied to all data sets, that is, data sets generated by each order of transformation for all data sets sent by the user, and after the meta-feature of each data set is obtained by calculation, each data set may be estimated by the third machine learning model to perform B feature transformations, and fourth evaluation values of B candidate data sets may be obtained respectively. The fourth evaluation value may indicate the accuracy, generalization ability, or whether the generalization ability has a gain, etc., of the training resulting model of the B candidate data sets.
The following describes a data processing method according to an embodiment of the present application in conjunction with the flowchart of the data processing method shown in fig. 5, where the method may be executed by the execution device 120 in fig. 1, the cloud platform 210 in fig. 2, or a processor in the execution device. The method may include some or all of the following steps.
S52: a first set of data sets is acquired.
Wherein the first set of data sets comprises one data set, which is the data set corresponding to the root node of the tree structure. The first set of data sets may include M samples, any one of the M samples including N1 data features and tags, M, N being a positive integer.
The first set of data sets may be data of the original data set sent by the user device to the execution device (cloud platform) after the data preprocessing. Wherein the preprocessing of the original data set may include one or more of data cleansing, sampling, formatting, and feature digitizing, among others.
S54: a multi-order feature transformation is performed on N1 data features in the first set of data sets.
It should be appreciated that in feature transforming a data set, only the features of the data in the data set are transformed and the tags of the data are not transformed. The nth order feature transformation is a feature transformation process in the above-mentioned multi-order feature transformation, and the nth order feature transformation can be referred to the nth order feature transformation described in fig. 6A, 6B and 6C, which are not described herein.
The execution device may set a stop condition for the feature conversion, and after the stop condition is satisfied, the execution device stops performing the feature conversion, and step S56 is executed. In one implementation, the execution device may set the order of the feature transformation, e.g., 8 th order, and the execution device stops the feature transformation after the 8 th order feature transformation is performed. In another specific implementation, the performing device may determine whether the feature transformation of the current order yields gain. For example, the executing device judges whether the average value calculated by the first evaluation value of the data set obtained by the current feature transformation is larger than the average value calculated by the first evaluation value of the data set obtained by the last feature transformation, if so, the feature transformation of the current order generates gain, and the feature transformation of the next order can be performed; if not, the performing device may stop feature transformation.
It should be understood that embodiments of the present application may also include other stop conditions, which are not limited in this regard.
The average value of the first evaluation values of the nth group of data sets is smaller than the average value of the first evaluation values of the data sets obtained by the last feature transformation
S56: a target data set is determined from a first set, the first set including a data set obtained from each order feature transformation in the process of the multi-order feature transformation.
In one implementation of the embodiment of the present application, a data set corresponding to a first evaluation value that is the largest in a first set may be determined as a target data set, where the first set includes a data set obtained by each order of feature transformation in a multi-order feature transformation process.
In another implementation of the embodiment of the present application, a data set corresponding to the largest third evaluation value in the first set may be determined as the target data set.
The target data set is the optimal data set determined by the execution device according to the original data sent by the user, namely the result of selecting the transformed data features by the feature engineering. The target data can be used for model building and training to obtain a model required by a user.
The model may be built and trained by using a model building and training method in the prior art, and embodiments of the present application are not limited thereto.
The embodiment of the application takes the nth order feature transformation as an example to describe the process of the nth order feature transformation, wherein n is a positive integer. The specific implementation of the feature transformation of the nth order is described in conjunction with the flow diagram of the feature transformation of the nth order shown in fig. 6A, the schematic explanatory diagram of the feature transformation process of the nth order shown in fig. 6B, and the tree structure shown in fig. 6C, and includes some or all of the following steps:
S541: for each dataset D i in the nth set of datasets, inputting the meta-features of dataset D i to a third machine learning model, predicting to obtain fourth evaluation values corresponding to the B feature transforms respectively, and selecting the feature transform corresponding to the fourth evaluation value meeting the fourth condition from the B feature transforms to be A i feature transforms.
The fourth evaluation value corresponding to the first feature transformation is used for evaluating the accuracy of a candidate data set obtained by the data set D i through the first feature transformation, wherein the first feature transformation is any one of B feature transformations, and B is a positive integer; the nth group of data sets is a set of data sets obtained by the n-1 th order feature transformation, i is an index of the data sets in the nth group of data sets, and i is a positive integer. It should be understood that the meta-feature of the dataset D i may be calculated by the above-mentioned meta-feature calculation method, and specific implementation may be found in the related description in the above-mentioned embodiment of the meta-feature calculation method; the third machine learning model is a machine learning model obtained by training the third machine learning model training method, and specific implementation can be referred to the related description in the embodiment of the third machine learning model training method, which is not repeated here.
In fig. 6C, the nth group of data sets is illustrated as including two data sets (i.e., D1 and D2).
One specific implementation of S542 corresponding to the third machine learning model obtained in embodiment (1) may be: the execution device selects a feature transformation corresponding to a fourth evaluation value with a value larger than 0 from the B feature transformations to be A i feature transformations, namely, selects the feature transformation with the evaluation value capable of generating gain, and discards the feature transformation with the evaluation value not generating gain.
One specific implementation of S542 corresponding to the third machine learning model obtained in the embodiment (2) may be: the execution device selects the feature transformation corresponding to the fourth evaluation value with the value larger than the first evaluation value of the data set from the B feature transformations to be A i feature transformations. Another specific implementation of S542 may be: and selecting characteristic transformation corresponding to a fourth evaluation value with a value larger than a preset threshold value from the B characteristic transformation types to be A i characteristic transformation types, or selecting characteristic transformation corresponding to a fourth evaluation value with a rank of i to be A i characteristic transformation types, wherein the rank is arranged from large to small according to the fourth evaluation value.
It should be understood that S542 may also include other implementations, which are not described herein. It should also be appreciated that the variety and number of feature transformations selected by the different data sets in the nth set of data sets may be different.
Through the steps S541 and S542, between the feature transformation of the data set, the fourth evaluation value of the data set generated by each feature transformation is estimated by the third machine learning model trained offline, and the feature transformation is screened based on the fourth evaluation value, and the feature transformation is only performed on the data set by the screened feature transformation, that is, the calculation of the type of the special transformation and the first evaluation value is reduced by pre-pruning, so that the evaluation efficiency is improved. Corresponding to the branch removed ① in fig. 6C.
S542: and (3) respectively carrying out feature transformation of A i types on each data set D i in the nth group of data sets to obtain a plurality of candidate data sets.
It should be understood that, regarding the algorithm of feature transformation, reference may be made to the related description in the above embodiments, and the embodiments of the present application are not repeated. It should also be appreciated that the performing device may identify the type of data feature in the dataset and determine a feature transformation that may be performed on the high data feature based on the type of data feature.
It should also be appreciated that step S541 is not a necessary step in an embodiment of the present application, in another embodiment of the present application, a i may be a fixed value, for example, a i may be equal to B, i.e., without performing the pre-pruning operation prior to the feature transformation, and all feature transformations for which data set D i is applicable, out of the B feature transformations provided, are performed on data set D i.
In the embodiment of the present application, the candidate data set D i,j is a candidate data set obtained by the feature transformation T j of the data set D i in the nth group of data sets. Wherein j is the index of the feature transformation in A i, j is more than or equal to 1 and less than or equal to A i, and j is a positive integer.
As shown in fig. 6C, the candidate data set obtained by performing a 1 (a 1 is 5 in fig. 6C) feature transformation on the data set D 1 is D 1,1、D1,2、D1,3、D1,4、D1,5, and the candidate data set obtained by performing a 2 (a 2 is 5 in fig. 6C) feature transformation on the data set D 2 is D 2,1、D2,2、D2,3、D2,4、D2,5.
S543: a first evaluation value of each of the plurality of candidate data sets is calculated.
Wherein the first evaluation value of the dataset is used to evaluate the accuracy of the trained model of the dataset.
The embodiment of the present application is described by taking the candidate data set D i,j (corresponding to the first candidate data set in the embodiment of the present application) as an example, and the calculation method of the first evaluation value of each of the plurality of candidate data sets is the calculation method of the first evaluation value of the candidate data set D i,j as follows:
A first calculation method of a first evaluation value:
Step S5431: a meta-feature of candidate data set D i,j is calculated from candidate data set D i,j, the meta-feature being used to represent attributes of candidate data set D i,j.
The method for calculating the meta-feature of the candidate dataset D i,j may be referred to in the related description in the embodiment of the method for calculating the meta-feature, which is not described herein. It should be noted that, the number and meaning of the data features in the candidate data set D i,j obtained through the n-order feature transformation and the data features in the first group of data sets may all be transformed, but the tag does not perform feature transformation all the time in the n-order feature transformation process, and each data set in the n+1 group of data sets and the candidate data set obtained through the transformation thereof all include the same tag data.
Step S5432: the meta-features are input to the first machine learning model to predict a second evaluation value of the candidate data set D i,j.
The first machine learning model is an offline trained machine learning model configured to input data as a meta-feature and output a second evaluation value as the meta-feature. The second evaluation value of the candidate data set D i,j is used to indicate the performance, such as accuracy, of the model trained from the candidate data set D i,j. The accuracy is the degree of accuracy of the model trained on candidate dataset D i,j in predicting the input data.
It is to be understood that the second evaluation value is an estimated evaluation value, which is lower in accuracy than the evaluation value (third evaluation value) obtained by the test of the candidate data set training model.
Step S5433: a first evaluation value of candidate data set D i,j is determined based on the second evaluation value of candidate data set D i,j.
In a specific implementation of the embodiment of the present application, the first evaluation value used for screening the candidate data sets may be the second evaluation value, that is, the execution device may directly perform screening according to the second evaluation value of each candidate data set.
In another specific implementation of the embodiment of the present application, the first evaluation value for candidate data set screening may be calculated based on the second evaluation value. Optionally, the first evaluation value of the candidate data set D i,j may be obtained by a first data item and a second data item operation, for example, a sum of the first data item and the second data item; wherein the first data item is positively correlated with the first evaluation value of the candidate data set D i,j, and the second data item is determined by the historical gain times of the feature transformation T j.
The historical gain times are the number of first data sets in the first n groups of data sets, wherein the first data set is obtained through feature transformation T j for a second data set, the second data set is one data set in the first n groups of data sets, and the second evaluation value of the second data set is smaller than the second evaluation value of the first data set.
It should be appreciated that the feature transformation T j is considered to produce a gain if the first n sets of data sets and the current candidate data set described above were derived using the feature transformation T j and whose second evaluation value is greater than the second evaluation value of its parent node data set.
For example, the first evaluation value of candidate data set D i,j is:
In formula (6), P' (D i,j) is the first evaluation value of the candidate data set D i,j, and P (D i,j) is the second evaluation value of the candidate data set D i,j; n (T j) is the number of gain datasets which are obtained in the first N groups of datasets, adopt the characteristic transformation T j and generate a second evaluation value; n' (T j) is the number of gain datasets in which the feature transformation T j is employed and a second evaluation value is generated for the first N sets of datasets and the candidate dataset for the nth set of datasets.
The above calculation method can avoid the feature transformation from falling into the local optimum by adjusting the second evaluation value of the candidate data set in combination with the first evaluation value of the candidate data set and the number of times of the historical gain of the feature transformation that generates the candidate data set, with respect to the candidate data set being screened by only the second evaluation value, the calculation of the second evaluation value taking into consideration the number of times of the historical gain of the feature transformation.
S544: an n+1th set of data sets is determined based on the first evaluation value of each of the plurality of candidate data sets. Wherein the number of data sets in the n+1 group of data sets is less than the number of candidate data sets.
Specific implementations of screening out a candidate data set from a plurality of data sets as an n+1st set of data sets (which may also be referred to as an n+1st level node) may include the following three implementations:
first implementation:
The execution device may select, as the n+1th group data set, a candidate data set in which the first evaluation value is larger than the first threshold value among the plurality of candidate data sets. Wherein the first threshold may be a fixed value; and carrying out statistical analysis on the first evaluation values respectively corresponding to the plurality of candidate data sets to obtain a first threshold value applicable to the nth order feature transformation. For example, the first threshold may be an average value of first evaluation values respectively corresponding to the plurality of candidate data sets.
The second implementation mode:
The execution device may select, as the n+1th group data set, a candidate data set to which first m first evaluation values in the order of evaluation values from among the plurality of candidate data sets respectively correspond, the evaluation values being ordered as first evaluation values to which the plurality of candidate data sets arranged from large to small respectively correspond, m being a positive integer.
Third implementation mode:
S5441: the execution device selects a candidate data set for which a first evaluation value satisfies a first condition among the plurality of candidate data sets.
The implementation of screening the candidate data set satisfying the first condition (i.e., the first screening process based on the first evaluation value) from the plurality of candidate data sets may be: the execution device selects a candidate data set with a first evaluation value greater than a second threshold value from the plurality of candidate data sets; or the execution device selects candidate data sets respectively corresponding to the first g first evaluation values of the evaluation value sequence from the plurality of candidate data sets, wherein the evaluation values are sequenced into first evaluation values respectively corresponding to the plurality of candidate data sets which are arranged from large to small. Similar to the first threshold in the first implementation manner, the second threshold may be a fixed value, or a second threshold obtained by performing statistical analysis on first evaluation values corresponding to a plurality of candidate data sets, where g is a positive integer.
In fig. 6B, it is assumed that the candidate data sets satisfying the first condition are represented as F candidate data sets (candidate data set 1, candidate data sets 2, …, candidate data sets F, …, candidate data set F) respectively, F is the index of the candidate data set in the candidate data sets satisfying the first condition, F is not greater than F, and F are positive integers.
The pruning process described above is for branches removed at ② in fig. 6C.
S5442: and respectively training and testing the model for each candidate data set in the candidate data sets meeting the first condition to obtain third evaluation values respectively corresponding to each candidate data set in the candidate data sets meeting the first condition.
The third evaluation value is obtained by training and testing a model by using the candidate data set, and the reliability of the third evaluation value is higher. Thus, the third evaluation value determines.
Taking the second candidate data set as an example, the second candidate data set is any one of the candidate data sets satisfying the second condition. Wherein the second candidate data set comprises a training data set and a test data set, any one sample in the training data set and the test data set comprising N3 data features and labels (also referred to as true labels), N3 being a positive integer.
The specific implementation of the third evaluation value of the second candidate data set may be: the execution device trains a second machine learning model according to the training data set; inputting N3 data features of each sample in the test data set into a second machine learning model to obtain a prediction label of each sample in the test data set; a third evaluation value of the second candidate data set is calculated from the true label and the predicted label of each sample.
The third evaluation value of the second candidate data set is obtained by performing statistical analysis according to differences between the real labels and the predicted labels of the M samples.
The third evaluation value may be represented by one or more indexes, such as, but not limited to, F1score (F1 score), average accuracy (MAP), AUC (area under roc curve), mean-square error (MSE), root mean square error (root mean square error), recall, precision, and the like.
In another implementation of the embodiment of the present application, the second candidate data set may be divided into multiple shares (e.g., 4 shares), with three as training data sets and one as test data set. And respectively training the three training data sets to obtain 3 machine learning models, respectively testing the 3 machine learning models through test data to obtain an evaluation value corresponding to each machine learning model, and further determining a third evaluation value of the second candidate data set as an average value of the evaluation values of the 3 machine learning models.
It should be appreciated that the second machine learning model trained from the different candidate data sets and the second evaluation value tested from the different candidate data sets are different. In fig. 6B, the second machine learning model trained on the training data set in the candidate data set f is represented by the second machine learning model f.
S5443: and selecting a candidate data set with a third evaluation value meeting the second condition from the candidate data sets meeting the first condition as an n+1th group data set.
The implementation of screening the candidate data set satisfying the second condition from the candidate data set satisfying the first condition may be: the execution device selects a candidate data set whose third evaluation value is greater than a third threshold value from among candidate data sets satisfying the first condition; or the execution device selects, from among the candidate data sets satisfying the first condition, candidate data sets respectively corresponding to first h third evaluation values of the evaluation value rank, the evaluation values being ranked as third evaluation values respectively corresponding to a plurality of candidate data sets ranked from large to small. Similar to the first threshold in the first implementation manner, the third threshold may be a fixed value, or a third threshold obtained by performing statistical analysis on third evaluation values corresponding to a plurality of candidate data sets, where h is a positive integer and h < g.
It should be appreciated that the second screening process based on the third evaluation value described above corresponds to the branch removed ③ in fig. 6C.
In another implementation of the embodiment of the present application, step S5441 may not be included, and in step S5442, calculation of the third evaluation value may be performed on all of the plurality of candidate data sets, and further, the n+1st set of data sets (may also be referred to as n+1st level nodes of the tree structure) is screened out in step S5443.
It will be appreciated that where the candidate data set satisfies the second condition and the third condition, then the candidate data set satisfies the first condition.
Referring to the data processing system shown in fig. 7, the data processing system may be disposed in an execution device, where the execution device may be composed of one or more servers, computers, etc., and the system 700 may include the following units:
a first acquisition unit 701 for acquiring a first set of data sets, the first set of data sets comprising a plurality of data features;
A transformation unit 702, configured to perform a multi-order feature transformation on a plurality of data features in the first set of data sets;
A first selection unit 704, configured to determine a target data set from a first set, where the first set includes a data set obtained by each order feature transformation in the multi-order feature transformation process;
wherein, the transforming unit 702 is specifically configured to: respectively carrying out feature transformation on each data set in an nth group of data sets to obtain a plurality of candidate data sets, wherein the nth group of data sets are sets of data sets obtained by the n-1 th order feature transformation, and n is an integer greater than 1;
The system 700 further comprises:
A first evaluation unit 703 for: respectively calculating a first evaluation value of each candidate data set in the plurality of candidate data sets, wherein the first evaluation value is used for evaluating the accuracy of a model obtained through the training of the candidate data sets;
A first filtering unit 705, configured to determine an n+1st group of data sets according to a first evaluation value of each candidate data set in the plurality of candidate data sets, where the number of data sets in the n+1st group of data sets is smaller than the number of the plurality of candidate data sets.
As a possible implementation, the first candidate data set is any one of the plurality of candidate data sets;
The system further comprises a meta-feature calculation unit 706 for: calculating meta-features of the first candidate data set according to the first candidate data set, wherein the meta-features are used for representing attributes of the first candidate data set;
The first evaluation unit 703 is specifically configured to: inputting the meta-feature into a first machine learning model to predict a second evaluation value of the first candidate data set, wherein the second evaluation value of the first candidate data set is used for evaluating the accuracy of the model obtained by training the first candidate data set; and determining a first evaluation value of the first candidate data set according to a second evaluation value of the first candidate data set.
As a possible implementation manner, the first candidate data set includes a plurality of data features and a tag, and the meta-feature calculating unit 706 is specifically configured to:
Calculating first information according to the first candidate data set, wherein the first information comprises data similarity and distribution similarity of every two data features in a plurality of data features of the first candidate data set, data similarity and distribution similarity of each data feature in the plurality of data features of the first candidate data set and a label, and at least one of data distribution information of each data feature in the plurality of data features of the first candidate data set and data distribution information of the label;
And calculating the meta-characteristic of the first candidate data set according to the first information.
As a possible implementation manner, the meta features of the first candidate data set include: at least one of a basic feature of the first candidate data set, a feature of a continuous data feature of a plurality of data features of the first candidate data set, a feature of a discrete data feature of a plurality of data features of the first candidate data set, a feature of the tag, a feature of data similarity, a feature of distribution similarity, and a feature of distribution information of the data feature.
As a possible implementation manner, the first candidate data set is a first data set obtained through a first feature transformation, the first data set is one data set in the nth group of data sets, and the first evaluation value of the first candidate data set is the sum of a first data item and a second data item; wherein the first data item is positively correlated with a second evaluation value of the first candidate data set, the second data item being determined by a historical gain number of the first feature transformation.
As a possible implementation manner, the first screening unit 705 is further configured to: selecting a candidate data set of which a first evaluation value satisfies a first condition among the plurality of candidate data sets;
The system further comprises a second evaluation unit 707 for: respectively training and testing a model for each candidate data set in the candidate data sets meeting the first condition to obtain third evaluation values respectively corresponding to each candidate data set in the candidate data sets meeting the first condition;
The first screening unit 705 is further configured to: and selecting the candidate data set with the third evaluation value meeting the second condition from the candidate data sets meeting the first condition as the n+1th group data set.
As a possible implementation manner, the second candidate data set is any one candidate data set in candidate data sets meeting the first condition, the second candidate data set comprises a training data set and a test data set, and any one sample in the training data set and the test data set comprises a plurality of data features and a label; the second evaluation unit 707 specifically is configured to:
Training a second machine learning model according to the training dataset;
Inputting a plurality of data features of each sample in the test data set into the second machine learning model to obtain a predictive label of each sample in the test data set;
And calculating a third evaluation value of the second candidate data set according to the label of each sample in the test data set and the predicted label.
As a possible implementation, the system 700 further includes:
a second obtaining unit 708 configured to obtain a plurality of first samples, any one of the plurality of first samples including meta features of a second data set and an evaluation value of the second data set;
A first training unit 709 for training the first machine learning model according to the plurality of first samples.
As a possible implementation manner, the system further comprises:
A third evaluation unit 710 for: before a transformation evaluation module performs feature transformation on each dataset in an nth group of datasets to obtain a plurality of candidate datasets, inputting meta-features of a third dataset into a third machine learning model, and predicting to obtain a fourth evaluation value, wherein the fourth evaluation value is used for evaluating the accuracy of the model obtained by training the candidate datasets obtained by the third dataset through the second feature transformation, the third dataset is any dataset in the nth group of datasets, the second feature transformation is any one of B feature transformations, and B is a positive integer;
a second filtering unit 711, configured to select, from the B kinds of feature transforms, a feature transform corresponding to a fourth evaluation value that satisfies a fourth condition, as the a kind of feature transform, a being a positive integer not greater than B;
the transformation unit 702 is specifically configured to: and carrying out A feature transformation on the third data set to obtain A candidate data sets.
As a possible implementation, the system 700 further includes:
a second obtaining unit 712, configured to obtain a plurality of second samples, where any one of the plurality of second samples includes a meta feature of a fourth data set and a difference between an evaluation value of the data set after a second feature transformation of the fourth data set and an evaluation value of the fourth data set, and the second feature transformation is any one of the B feature transformations;
A second training unit 713 for training the third machine learning model based on the plurality of second samples.
Note that the first acquisition unit 701, the conversion unit 702, the first evaluation unit 703, the first selection unit 704, the first screening unit 705, the meta-feature calculation unit 706, the second evaluation unit 707, the third evaluation unit 710, and the second screening unit 711 may be provided on the execution apparatus side. The second acquisition unit 708, the first training unit 709, the second acquisition unit 712, and the second training unit 713 may be provided on the training apparatus side.
It should be further noted that, each apparatus in the above system may further include other units, and specific implementations of each device and unit may refer to related descriptions in the above method embodiments, which are not repeated herein.
As shown in fig. 8, the execution device 800 may include: a processor 801, a memory 802, a communication bus 803 and a communication interface 804, the processor 801 being connected to the memory 802 and the communication interface 803 via the communication bus.
The Processor 801 may be a central processing unit (Central Processing Unit, CPU), the Processor 801 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor 801 may be any conventional processor or the like.
The processor 801 may also be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the data processing method of the present application may be performed by integrated logic circuits of hardware in the processor 801 or by instructions in the form of software. The processor 801 described above may also be a general purpose processor, a digital signal processor (DIGITAL SIGNAL Processing, DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 801, and the processor 801 reads the information in the memory 802, and combines the hardware thereof to implement the functions required to be executed by the preprocessing module 212, the feature transformation module 213, the data set determining module 214, and the training module 215 unit included in the cloud platform 210 according to the embodiment of the present application, or execute the data processing method according to the method embodiment of the present application.
The Memory 802 may be Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), or other Memory. In the embodiment of the present application, the memory 802 is used to store data and various software programs, such as an original data set in the embodiment of the present application, a program for implementing the data processing method in the embodiment of the present application, and the like for each group of data sets.
The communication interface 804 enables communication between the execution device 800 and other devices or communication networks using a transceiving means such as, but not limited to, a transceiver. For example, the raw data set, the first set of data sets, etc. may be acquired via the communication interface 903 to enable information interaction with the training device, the client device, the user device, or the terminal device.
Optionally, the executing device may further include an artificial intelligence processor 805, where the artificial intelligence processor 805 may be a neural network processor (network processing unit, NPU), a tensor processor (tensor processing unit, TPU), or a graphics processor (graphics processing unit, GPU) or the like, all suitable for large-scale exclusive-or operation processing. The artificial intelligence processor 805 may be mounted as a coprocessor to a Host CPU (Host CPU) that is tasked with it. The artificial intelligence processor 805 can implement one or more of the operations involved in the data processing methods described above. For example, taking NPU as an example, a core portion of NPU is an arithmetic circuit, and the controller controls the arithmetic circuit to extract matrix data in the memory 802 and perform multiply-add operation.
The processor 801 is configured to call the data and the program code in the memory, and execute:
Obtaining a first set of data sets, the first set of data sets comprising a plurality of data features;
performing a multi-order feature transformation on the plurality of data features in the first set of data sets;
determining a target data set from a first set, wherein the first set comprises a data set obtained by each order of feature transformation in the process of the multi-order feature transformation;
wherein said performing a multi-order feature transformation on a plurality of data features in said first set of data sets comprises:
Respectively carrying out feature transformation on the data features in each data set in an nth group of data sets to obtain a plurality of candidate data sets, wherein the nth group of data sets are data sets obtained by carrying out n-1 order feature transformation on the first data set, and n is an integer greater than 1;
Calculating a first evaluation value for each of the plurality of candidate data sets; the first evaluation value is used for evaluating the accuracy of a model obtained through training of the candidate data set;
and determining an n+1st group of data sets according to the first evaluation value of each candidate data set in the plurality of candidate data sets, wherein the number of the data sets in the n+1st group of data sets is smaller than the number of the plurality of candidate data sets.
The first set of data sets may be obtained by receiving, through the communication interface 804, an original data set sent by the client device, and further preprocessing the original data set to obtain the first set of data sets.
After the execution device 800 obtains the target data set, a target feature transformation algorithm used by transforming the first set of data sets to obtain the target data set may be obtained, and the target machine learning model may be obtained by training the target data set with the newly built machine learning model. Further, the execution device 800 sends the target feature transformation algorithm and the target machine learning model to the client device via the communication section 804.
As a possible implementation manner, the first candidate data set is any one of the plurality of candidate data sets, and the processor 801 performs the calculating of the first evaluation value of each of the plurality of candidate data sets, respectively, including performing:
calculating meta-features of the first candidate data set according to the first candidate data set, wherein the meta-features are used for representing attributes of the first candidate data set;
Inputting the meta-feature into a first machine learning model to predict a second evaluation value of the first candidate data set, wherein the second evaluation value of the first candidate data set is used for evaluating the accuracy of the model obtained by training the first candidate data set;
A first evaluation value of the first candidate data set is determined from a second evaluation value of the first candidate data set.
As a possible implementation manner, the first candidate data set includes a plurality of data features and a tag, and the calculating the meta feature of the first candidate data set according to the first candidate data set specifically includes:
Calculating first information according to the first candidate data set, wherein the first information comprises data similarity and distribution similarity of every two data features in a plurality of data features of the first candidate data set, data similarity and distribution similarity of each data feature in the plurality of data features of the first candidate data set and a label, and at least one of data distribution information of each data feature in the plurality of data features of the first candidate data set and data distribution information of the label;
And calculating the meta-characteristic of the first candidate data set according to the first information.
As a possible implementation manner, the meta features of the first candidate data set include: at least one of a basic feature of the first candidate data set, a feature of a continuous data feature of a plurality of data features of the first candidate data set, a feature of a discrete data feature of a plurality of data features of the first candidate data set, a feature of the tag, a feature of data similarity, a feature of distribution similarity, and a feature of distribution information of the data feature.
As a possible implementation manner, the first candidate data set is obtained by a first feature transformation for a first data set, where the first data set is one data set in the nth group of data sets, and the processor 801 performs the determining the first evaluation value of the first candidate data set according to the second evaluation value of the first candidate data set, and specifically includes performing:
the first evaluation value of the first candidate data set is the sum of the first data item and the second data item; wherein the first data item is positively correlated with a second evaluation value of the first candidate data set, the second data item being determined by a historical gain number of the first feature transformation.
As a possible implementation manner, the processor 801 performs the determining the n+1st group of data sets according to the first evaluation values in the plurality of candidate data sets, specifically includes performing:
selecting a candidate data set of which a first evaluation value satisfies a first condition among the plurality of candidate data sets;
Respectively training and testing a model for each candidate data set in the candidate data sets meeting the first condition to obtain third evaluation values respectively corresponding to each candidate data set in the candidate data sets meeting the first condition;
and selecting the candidate data set with the third evaluation value meeting the second condition from the candidate data sets meeting the first condition as the n+1th group data set.
As a possible implementation manner, the second candidate data set is any one candidate data set in candidate data sets meeting the first condition, the second candidate data set comprises a training data set and a test data set, and any one sample in the training data set and the test data set comprises a plurality of data features and a label; the processor 801 performs the training and testing of the model on each candidate data set in the candidate data sets satisfying the first condition, to obtain a third evaluation value corresponding to each candidate data set in the candidate data sets satisfying the first condition, including performing:
Training a second machine learning model according to the training dataset;
Inputting a plurality of data features of each sample in the test data set into the second machine learning model to obtain a predictive label of each sample in the test data set;
And calculating a third evaluation value of the second candidate data set according to the label of each sample in the test data set and the predicted label.
As a possible implementation manner, the processor 801 is further configured to perform steps further including:
acquiring a plurality of first samples, wherein any one of the plurality of first samples comprises meta features of a second data set and evaluation values of the second data set;
the first machine learning model is trained from the plurality of first samples.
As a possible implementation manner, before the processor 801 performs the feature transformation for each data set in the nth group of data sets, and performs feature transformation for each data set in the nth group of data sets, the processor 801 is further configured to perform:
inputting meta features of a third data set into a third machine learning model, and predicting to obtain a fourth evaluation value, wherein the fourth evaluation value is used for evaluating the accuracy of a model obtained by training a candidate data set obtained by carrying out second feature transformation on the third data set, the third data set is any one data set in the nth group of data sets, the second feature transformation is any one feature transformation in B feature transformations, and B is a positive integer;
selecting a feature transformation corresponding to a fourth evaluation value meeting a fourth condition from the B feature transformations to be the A feature transformations, wherein A is a positive integer not greater than B;
the feature transformation is performed on each data set in the nth group of data sets to obtain a plurality of candidate data sets, including: and carrying out A feature transformation on the third data set to obtain A candidate data sets.
As a possible implementation, the processor 801 is further configured to perform:
Obtaining a plurality of second samples, wherein any one of the plurality of second samples comprises meta-features of a fourth data set and differences between evaluation values of the data set after second feature transformation of the fourth data set and evaluation values of the fourth data set, and the second feature transformation is any one of the B feature transformations;
training the third machine learning model based on the plurality of second samples.
It should be understood that the implementation of each device may also correspondingly refer to the corresponding description in the above method embodiment, and the embodiment of the present application is not repeated.
It is appreciated that the various units in data processing system 700 may correspond to processors 802.
As shown in fig. 9, an exercise device provided in an embodiment of the present application may include a processor 901, a memory 902, a communication bus 903, and a communication interface 904, where the processor 901 connects the memory 902 and the communication interface 903 through the communication bus.
The processor 901 may employ a general-purpose central processing unit (Central Processing Unit, CPU), microprocessor, application SPECIFIC INTEGRATED Circuit (ASIC), graphics processor (graphics processing unit, GPU), neural network processor (network processing unit, NPU), or one or more integrated circuits for executing a related program to perform the training method of the first machine learning model of the method embodiment of the present application.
Processor 901 may also be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the training method of the first machine learning model of the present application may be accomplished by instructions in the form of integrated logic circuits of hardware or software in the processor 901. The processor 801 described above may also be a general purpose processor, a digital signal processor (DIGITAL SIGNAL Processing, DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 901, and the processor 901 reads information in the memory 902, and in combination with its hardware, performs the training method of the first machine learning model according to the method embodiment of the present application.
The Memory 902 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). The memory 902 may store programs and data, such as a plurality of first samples in an embodiment of the present application, a program for implementing a training method of a first machine learning model in an embodiment of the present application, and the like. When the program stored in the memory 901 is executed by the processor 902, the processor 901 and the communication interface 904 are configured to perform the steps of the training method of the first machine learning model of the embodiment of the present application.
For example, a program for implementing the training method of the first machine learning model in the embodiment of the present application and the like in the embodiment of the present application.
The communication interface 904 enables communication between the training device 900 and other devices or communication networks using a transceiver means such as, but not limited to, a transceiver. For example, a plurality of first samples may be acquired through the communication interface 904 to enable information interaction with an execution device, a client device, a user device, or a terminal device, etc.
Optionally, the executing device may further include an artificial intelligence processor 905, where the artificial intelligence processor 905 may be a neural network processor (network processing unit, NPU), a tensor processor (tensor processing unit, TPU), or a graphics processor (graphics processing unit, GPU) or the like, which is all suitable for large-scale exclusive-or operation processing. The artificial intelligence processor 905 may be mounted as a coprocessor to a Host CPU (Host CPU) that is tasked with it. The artificial intelligence processor 905 can implement one or more of the operations involved in the training method of the first machine learning model described above. For example, taking NPU as an example, a core portion of NPU is an arithmetic circuit, and the controller controls the arithmetic circuit to extract matrix data in the memory 902 and perform multiply-add operation.
The processor 901 is configured to call the data and the program code in the memory, and execute:
acquiring a plurality of first samples, wherein any one of the plurality of first samples comprises meta features of a second data set and evaluation values of the second data set;
the first machine learning model is trained from the plurality of first samples.
Optionally, the method for calculating the meta-feature is the same as the method for calculating the meta-feature of the first candidate data set in the first aspect, which may be referred to as related description in the first aspect, and embodiments of the present application are not repeated.
It should be understood that the implementation of each device may also correspond to the corresponding description in the training method embodiment with reference to the first machine learning model, which is not repeated in the embodiments of the present application.
As shown in fig. 10, an embodiment of the present application provides a training device, which may include a processor 1001, a memory 1002, a communication bus 1003, and a communication interface 1004, where the processor 1001 connects the memory 1002 and the communication interface 1003 through the communication bus.
The processor 1001 may employ a general-purpose central processing unit (Central Processing Unit, CPU), microprocessor, application SPECIFIC INTEGRATED Circuit (ASIC), graphics processor (graphics processing unit, GPU) or one or more integrated circuits for executing associated programs to perform the functions required by the elements in the training apparatus of the third machine learning model of the present application or to perform the training method of the third machine learning model of the present application.
The processor 1001 may also be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the training method of the third machine learning model of the present application may be accomplished by instructions in the form of integrated logic circuits or software of hardware in the processor 1001. The processor 1001 described above may also be a general purpose processor, a digital signal processor (DIGITAL SIGNAL Processing, DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1002, and the processor 1001 reads the information in the memory 1002, and performs the training method of the third machine learning model according to the method embodiment of the present application in combination with its hardware.
The Memory 1002 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). The memory 1002 may store programs and data, such as a plurality of second samples or third samples in the embodiment of the present application, a program for implementing a training method of a third machine learning model in the embodiment of the present application, and the like. The processor 1001 and the communication interface 1004 are configured to perform the respective steps of the training method of the third machine learning model of the embodiment of the present application when the program stored in the memory 1001 is executed by the processor 1002.
The plurality of second samples or third samples are received through the communication interface 1004 to enable information interaction with an execution device, a client device, a user device, a terminal device, or the like.
Optionally, the executing device may further include an artificial intelligence processor 1005, where the artificial intelligence processor 1005 may be a neural network processor (network processing unit, NPU), a tensor processor (tensor processing unit, TPU), or a graphics processor (graphics processing unit, GPU) or the like, which is all suitable for large-scale exclusive-or operation processing. The artificial intelligence processor 1005 may be mounted as a coprocessor to a Host CPU (Host CPU) that is assigned tasks by the Host CPU. The artificial intelligence processor 1005 can implement one or more of the operations involved in the training method of the third machine learning model described above. For example, taking NPU as an example, the core part of NPU is an arithmetic circuit, and the controller controls the arithmetic circuit to extract matrix data in the memory 1002 and perform multiply-add operation.
The processor 1001 is configured to call data and program codes in the memory, and execute:
obtaining a plurality of second samples, wherein any one of the plurality of second samples comprises meta-features of a fourth data set and differences between evaluation values of the data set after second feature transformation of the fourth data set and evaluation values of the fourth data set, and the second feature transformation is any one of the B feature transformations; training the third machine learning model based on the plurality of second samples.
Or performs:
obtaining a plurality of third samples, wherein any one of the plurality of third samples comprises meta-characteristics of a fourth data set and fourth evaluation values of the data set of the second data set after second characteristic transformation; the third machine learning model is trained from a plurality of third samples.
Optionally, a method for calculating the meta-feature of the fourth dataset is the same as the method for calculating the meta-feature of the first candidate dataset in the first aspect, which may be referred to the related description in the first aspect, and embodiments of the present application are not repeated.
It should be understood that the implementation of each device may also correspond to the corresponding description in the training method embodiment with reference to the third machine learning model, which is not repeated in the embodiments of the present application.
The following describes a chip hardware structure provided by the embodiment of the application.
Fig. 11 is a chip hardware structure provided in an embodiment of the invention, where the chip includes an artificial intelligence processor 110. The chip may be provided in the execution device 120 shown in fig. 1, the execution device 700 shown in fig. 8, to perform part or all of the data processing work of the execution device. The chip may also be provided in the training device 110 shown in fig. 1, the execution device 800 shown in fig. 8, or the training devices 900 and 1000 shown in fig. 9 to 10 to complete the training work of the training device and output the first machine learning model or the third machine learning model.
The artificial intelligence processor 110 may be an NPU, TPU, or GPU, among other suitable processors for large-scale exclusive-or processing. Taking NPU as an example: the NPU may be mounted as a coprocessor to a Host CPU (Host CPU) that is assigned tasks by the Host CPU. The core part of the NPU is an arithmetic circuit 1103, and the controller 1104 controls the arithmetic circuit 1103 to extract matrix data in the memory and perform multiply-add operation.
In some implementations, the arithmetic circuit 1103 includes a plurality of processing units (PEs) inside. In some implementations, the operational circuit 1103 is a two-dimensional systolic array. The arithmetic circuit 1103 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1103 is a general purpose matrix processor.
For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit 1103 takes the weight data of the matrix B from the weight memory 1102 and buffers it on each PE in the arithmetic circuit 1103. The arithmetic circuit 1103 takes input data of the matrix a from the input memory 1101, performs matrix operation based on the input data of the matrix a and weight data of the matrix B, and saves the obtained partial or final result of the matrix in an accumulator (accumulator) 1108.
The unified memory 1106 is used for storing input data and output data. The weight data is directly transferred to the weight memory 1102 through the memory cell access controller (DMAC, direct Memory Access Controller) 1105. The input data is also carried into the unified memory 1106 through the DMAC.
A bus interface unit (BIU, bus Interface Unit) 1110 for interaction of the DMAC and the finger memory (Instruction Fetch Buffer) 1109; the bus interface unit 1101 is further configured to fetch the instruction from the external memory by the instruction fetch memory 1109; the bus interface unit 1101 is also configured to obtain raw data of the input matrix a or the weight matrix B from the external memory by the memory unit access controller 1105.
The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1106, or to transfer weight data to the weight memory 1102, or to transfer input data to the input memory 1101.
The vector calculation unit 1107 further processes the output of the arithmetic circuit 1103, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, as needed. The vector calculation unit 1107 is mainly used for calculating a non-convolution layer or a full connection layer (FC, fully connected layers) in the neural network, and specifically can process: pooling (pooling), normalization, etc. For example, the vector calculation unit 1107 may apply a nonlinear function to an output of the operation circuit 1103, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 1107 generates normalized values, combined values, or both.
In some implementations, the vector calculation unit 1107 stores the processed vector to the unified memory 1106. In some implementations, the vector processed by the vector computation unit 1107 can be used as an activation input to the operational circuitry 1103, e.g., for use in subsequent layers in a neural network.
An instruction fetch memory (instruction fetch buffer) 1109 connected to the controller 1104 for storing instructions used by the controller 1104;
The unified memory 1106, the input memory 1101, the weight memory 1102 and the finger memory 1109 are all On-Chip memories. The external memory is independent of the NPU hardware architecture.
When the first machine learning model, the second machine learning model, or the third machine learning model is a neural network, the operations of the layers in the neural network may be performed by the operation circuit 1103 or the vector calculation unit 1107.
It should be noted that although the execution device 800, the training device 900, and 1000 shown in fig. 8, 9, and 10 only show a memory, a processor, and a communication interface, those skilled in the art will understand that in a specific implementation, the execution device 800, the training device 900, and 1000 also include other devices necessary to achieve normal operation. Also, those skilled in the art will appreciate that the execution device 800, the training device 900, and 1000 may also include hardware devices that perform other additional functions, as desired. Furthermore, it will be appreciated by those skilled in the art that the execution device 800, the training device 900, and 1000 may also include only the necessary components to implement embodiments of the present application, and not necessarily all of the components shown in fig. 8, 9, or 10, e.g., the communication interfaces and communication buses are not necessary components of the devices shown in fig. 8, 9, or 10, and the devices shown in fig. 8, 9, or 10 may not include communication interfaces and/or communication buses.
It will be appreciated by those of ordinary skill in the art that the various exemplary elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Those of skill in the art will appreciate that the functions described in connection with the various illustrative logical blocks, modules, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware, software, firmware, or any combination thereof. If implemented in software, the functions described by the various illustrative logical blocks, modules, and steps may be stored on a computer readable medium or transmitted as one or more instructions or code and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media corresponding to tangible media, such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another (e.g., according to a communication protocol). In this manner, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium, such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this disclosure. The computer program product may include a computer-readable medium.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (22)

1. A method of data processing, comprising:
an execution device receives a request sent from a client device, the request including an original dataset, the request for requesting the execution device to generate a machine learning model based on the original dataset;
The execution device preprocesses the original data set to obtain a first group of data sets, wherein the first group of data sets comprise a plurality of data features; the plurality of data features comprise information of a user, package use information of the user at a first time, features of OTT service data flows, traffic in a plurality of continuous time periods of a cell or network use information of the user at the first time;
the execution device performs a multi-order feature transformation on the plurality of data features in the first set of data sets;
The execution device determines a target data set from a first set, wherein the first set comprises a data set obtained by each step of feature transformation in the multi-step feature transformation process;
the execution device trains a machine learning model based on the target data set to obtain a target machine learning model;
The execution device sending the target machine learning model to the client device;
wherein said performing a multi-order feature transformation on a plurality of data features in said first set of data sets comprises:
the executing equipment respectively performs feature transformation on the data features in each data set in an nth group of data sets to obtain a plurality of candidate data sets, wherein the nth group of data sets are data sets obtained by n-1 order feature transformation of the first group of data sets, and n is an integer greater than 1;
The execution device calculates a first evaluation value for each of the plurality of candidate data sets; the first evaluation value is used for evaluating the accuracy of a model obtained through training of the candidate data set;
the execution device determines an n+1st group of data sets according to a first evaluation value of each candidate data set in the plurality of candidate data sets, wherein the number of data sets in the n+1st group of data sets is smaller than the number of the plurality of candidate data sets.
2. The method of claim 1, wherein the first candidate data set is any one of the plurality of candidate data sets, and wherein the calculating the first evaluation value for each of the plurality of candidate data sets comprises:
The execution device calculates a meta-feature of the first candidate data set, the meta-feature being used to represent an attribute of the first candidate data set;
The execution device inputs the meta-feature into a first machine learning model to predict a second evaluation value of the first candidate data set, wherein the second evaluation value of the first candidate data set is used for evaluating the accuracy of the model obtained by training the first candidate data set;
the execution device determines a first evaluation value of the first candidate data set based on a second evaluation value of the first candidate data set.
3. The method according to claim 2, wherein the first candidate data set comprises a plurality of data features and a tag, and wherein the computing the meta-feature of the first candidate data set from the first candidate data set comprises:
The execution device calculates first information according to the first candidate data set, wherein the first information comprises data similarity and distribution similarity of every two data features in a plurality of data features of the first candidate data set, the data similarity and the distribution similarity of each data feature in the plurality of data features of the first candidate data set and a label, and the data distribution information of each data feature in the plurality of data features of the first candidate data set and the data distribution information of the label are at least one;
The execution device calculates meta-features of the first candidate data set from the first information.
4. A method according to claim 3, wherein the meta-features of the first candidate data set comprise: at least one of a basic feature of the first candidate data set, a feature of a continuous data feature of a plurality of data features of the first candidate data set, a feature of a discrete data feature of a plurality of data features of the first candidate data set, a feature of the tag, a feature of data similarity, a feature of distribution similarity, and a feature of distribution information of the data feature.
5. The method according to any one of claims 2-4, wherein the first candidate data set is a first data set obtained by a first feature transformation, the first data set is one data set in the nth group of data sets, and the determining the first evaluation value of the first candidate data set according to the second evaluation value of the first candidate data set specifically includes:
The first evaluation value of the first candidate data set is the sum of the first data item and the second data item; wherein the first data item is positively correlated with a second evaluation value of the first candidate data set, the second data item being determined by a historical gain number of the first feature transformation.
6. The method according to any of claims 1-4, wherein said determining an n+1st set of data sets from a first evaluation value of said plurality of candidate data sets, in particular comprises:
the execution device selects a candidate data set of which a first evaluation value satisfies a first condition among the plurality of candidate data sets;
the execution equipment respectively carries out model training and testing on each candidate data set in the candidate data sets meeting the first condition to obtain third evaluation values respectively corresponding to each candidate data set in the candidate data sets meeting the first condition;
The execution device selects, as the n+1th group data set, a candidate data set whose third evaluation value satisfies a second condition among the candidate data sets satisfying the first condition.
7. The method of claim 6, wherein the second candidate data set is any one of a candidate data set that satisfies the first condition, the second candidate data set comprising a training data set and a test data set, the training data set and any one of the test data set comprising a plurality of data features and a tag; and respectively performing model training and testing on each candidate data set in the candidate data sets meeting the first condition to obtain third evaluation values respectively corresponding to each candidate data set in the candidate data sets meeting the first condition, wherein the method comprises the following steps of:
the execution device trains a second machine learning model according to the training data set;
the execution device inputs a plurality of data features of each sample in the test data set to the second machine learning model to obtain a prediction label of each sample in the test data set;
The execution device calculates a third evaluation value of the second candidate data set according to the label of each sample in the test data set and the predicted label.
8. The method according to any one of claims 2-4, further comprising:
the execution device acquires a plurality of first samples, wherein any one of the plurality of first samples comprises meta features of a second data set and an evaluation value of the second data set;
the execution device trains the first machine learning model according to the plurality of first samples.
9. The method according to any one of claims 1-4 and 7, wherein the performing feature transformation separately for each data set in the nth set of data sets, before obtaining a plurality of candidate data sets, further comprises:
The execution equipment inputs meta-features of a third data set into a third machine learning model, predicts to obtain a fourth evaluation value, wherein the fourth evaluation value is used for evaluating the accuracy of a model obtained by training a candidate data set obtained by carrying out second feature transformation on the third data set, the third data set is any one data set in the nth group of data sets, the second feature transformation is any one feature transformation in B feature transformations, and B is a positive integer;
the execution equipment selects the feature transformation corresponding to a fourth evaluation value meeting a fourth condition from the B feature transformations to be A feature transformations, wherein A is a positive integer not more than B;
the feature transformation is performed on each data set in the nth group of data sets to obtain a plurality of candidate data sets, including: and carrying out A feature transformation on the third data set to obtain A candidate data sets.
10. The method according to claim 9, wherein the method further comprises:
The execution device acquires a plurality of second samples, wherein any one of the plurality of second samples comprises meta-characteristics of a fourth data set and a difference value between an evaluation value of the data set after second characteristic transformation of the fourth data set and an evaluation value of the fourth data set, and the second characteristic transformation is any one of the B characteristic transformations;
The execution device trains the third machine learning model according to the plurality of second samples.
11. A data processing system, comprising:
A first acquisition unit, configured to receive a request sent from a client device, where the request includes an original data set, and the request is used to request an execution device to generate a machine learning model based on the original data set; preprocessing the original data set to obtain a first group of data set, wherein the first group of data set comprises a plurality of data features; the plurality of data features comprise information of a user, package use information of the user at a first time, features of OTT service data flows, traffic in a plurality of continuous time periods of a cell or network use information of the user at the first time;
a transformation unit for performing a multi-order feature transformation on a plurality of data features in the first set of data sets;
a first selection unit, configured to determine a target data set from a first set, where the first set includes a data set obtained by each order of feature transformation in the multi-order feature transformation process;
The processing unit is used for training the machine learning model based on the target data set to obtain a target machine learning model;
A transmitting unit configured to transmit the target machine learning model to the client device;
wherein, the transformation unit is specifically configured to: respectively carrying out feature transformation on each data set in an nth group of data sets to obtain a plurality of candidate data sets, wherein the nth group of data sets are data sets obtained by carrying out n-1 order feature transformation on the first group of data sets, and n is an integer greater than 1;
the system further comprises:
A first evaluation unit configured to calculate a first evaluation value of each of the plurality of candidate data sets; the first evaluation value is used for evaluating the accuracy of a model obtained through training of the candidate data set;
And the first screening unit is used for determining an n+1st group of data sets according to the first evaluation value of each candidate data set in the plurality of candidate data sets, and the number of the data sets in the n+1st group of data sets is smaller than the number of the plurality of candidate data sets.
12. The system of claim 11, wherein the first candidate data set is any one of the plurality of candidate data sets;
the system further comprises a meta-feature calculation unit for: calculating meta-features of the first candidate data set according to the first candidate data set, wherein the meta-features are used for representing attributes of the first candidate data set;
The first evaluation unit is specifically configured to: inputting the meta-feature into a first machine learning model to predict a second evaluation value of the first candidate data set, wherein the second evaluation value of the first candidate data set is used for evaluating the accuracy of the model obtained by training the first candidate data set; and determining a first evaluation value of the first candidate data set according to a second evaluation value of the first candidate data set.
13. The system according to claim 12, wherein the first candidate data set comprises a plurality of data features and a tag, the meta-feature calculation unit being specifically configured to:
Calculating first information according to the first candidate data set, wherein the first information comprises data similarity and distribution similarity of every two data features in a plurality of data features of the first candidate data set, data similarity and distribution similarity of each data feature in the plurality of data features of the first candidate data set and a label, and at least one of data distribution information of each data feature in the plurality of data features of the first candidate data set and data distribution information of the label;
And calculating the meta-characteristic of the first candidate data set according to the first information.
14. The system of claim 13, wherein the meta-features of the first candidate data set comprise: at least one of a basic feature of the first candidate data set, a feature of a continuous data feature of a plurality of data features of the first candidate data set, a feature of a discrete data feature of a plurality of data features of the first candidate data set, a feature of the tag, a feature of data similarity, a feature of distribution similarity, and a feature of distribution information of the data feature.
15. The system according to any one of claims 12-14, wherein the first candidate data set is a first data set obtained by a first feature transformation, the first data set being one of the nth set of data sets, the first evaluation value of the first candidate data set being a sum of a first data item and a second data item; wherein the first data item is positively correlated with a second evaluation value of the first candidate data set, the second data item being determined by a historical gain number of the first feature transformation.
16. The system of any one of claims 11-14, wherein,
The first screening unit is further configured to: selecting a candidate data set of which a first evaluation value satisfies a first condition among the plurality of candidate data sets;
The system further comprises a second evaluation unit for: respectively training and testing a model for each candidate data set in the candidate data sets meeting the first condition to obtain third evaluation values respectively corresponding to each candidate data set in the candidate data sets meeting the first condition;
The first screening unit is further configured to: and selecting the candidate data set with the third evaluation value meeting the second condition from the candidate data sets meeting the first condition as the n+1th group data set.
17. The system of claim 16, wherein the second candidate data set is any one of a candidate data set that satisfies the first condition, the second candidate data set comprising a training data set and a test data set, the training data set and any one of the test data set comprising a plurality of data features and a tag; the second evaluation unit is specifically configured to:
Training a second machine learning model according to the training dataset;
Inputting a plurality of data features of each sample in the test data set into the second machine learning model to obtain a predictive label of each sample in the test data set;
And calculating a third evaluation value of the second candidate data set according to the label of each sample in the test data set and the predicted label.
18. The system according to any one of claims 12-14, wherein the system further comprises:
a second acquisition unit configured to acquire a plurality of first samples, any one of the plurality of first samples including meta features of a second data set and an evaluation value of the second data set;
and the first training unit is used for training the first machine learning model according to the first samples.
19. The system according to any one of claims 11-14 and 17, wherein the system further comprises:
A third evaluation unit for: before a transformation evaluation module performs feature transformation on each dataset in an nth group of datasets to obtain a plurality of candidate datasets, inputting meta-features of a third dataset into a third machine learning model, and predicting to obtain a fourth evaluation value, wherein the fourth evaluation value is used for evaluating the accuracy of the model obtained by training the candidate datasets obtained by the third dataset through second feature transformation, the third dataset is any dataset in the nth group of datasets, the second feature transformation is any one of B feature transformations, and B is a positive integer;
The second screening unit is used for selecting the feature transformation corresponding to the fourth evaluation value meeting the fourth condition from the B feature transformations to be A feature transformations, wherein A is a positive integer not more than B;
The transformation unit is specifically configured to: and carrying out A feature transformation on the third data set to obtain A candidate data sets.
20. The system of claim 19, wherein the system further comprises:
A third obtaining unit, configured to obtain a plurality of second samples, where any one of the plurality of second samples includes a meta feature of a fourth data set and a difference between an evaluation value of the data set of the fourth data set after a second feature transformation and an evaluation value of the fourth data set, where the second feature transformation is any one of the B feature transformations;
and a second training unit for training the third machine learning model according to the second samples.
21. An execution device, comprising: a processor and a memory for storing computer program code, the processor being adapted to invoke the computer program code to perform the data processing method according to any of claims 1-10.
22. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein computer program code which, when run on a processor, causes the processor to perform the data processing method according to any of claims 1-10.
CN201910028386.XA 2019-01-11 2019-01-11 Data processing method, related equipment and system Active CN111435463B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910028386.XA CN111435463B (en) 2019-01-11 2019-01-11 Data processing method, related equipment and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910028386.XA CN111435463B (en) 2019-01-11 2019-01-11 Data processing method, related equipment and system

Publications (2)

Publication Number Publication Date
CN111435463A CN111435463A (en) 2020-07-21
CN111435463B true CN111435463B (en) 2024-07-05

Family

ID=71580423

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910028386.XA Active CN111435463B (en) 2019-01-11 2019-01-11 Data processing method, related equipment and system

Country Status (1)

Country Link
CN (1) CN111435463B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738356A (en) * 2020-07-23 2020-10-02 平安国际智慧城市科技股份有限公司 Object feature generation method, device, equipment and storage medium for specific data
CN112053558A (en) * 2020-08-25 2020-12-08 青岛海信网络科技股份有限公司 Traffic jam state identification method, device and equipment
CN114338416B (en) * 2020-09-29 2023-04-07 ***通信有限公司研究院 Space-time multi-index prediction method and device and storage medium
CN112200667B (en) * 2020-11-30 2021-02-05 上海冰鉴信息科技有限公司 Data processing method and device and computer equipment
CN112668723B (en) * 2020-12-29 2024-01-02 杭州海康威视数字技术股份有限公司 Machine learning method and system
CN113792952A (en) * 2021-02-23 2021-12-14 北京沃东天骏信息技术有限公司 Method and apparatus for generating a model
CN113449958B (en) * 2021-05-09 2022-05-10 武汉兴得科技有限公司 Intelligent epidemic prevention operation and maintenance management method and system
CN115730640A (en) * 2021-08-31 2023-03-03 华为技术有限公司 Data processing method, device and system
CN114490697B (en) * 2022-03-28 2022-09-06 山东国赢大数据产业有限公司 Data cooperative processing method and device based on block chain
CN114818516B (en) * 2022-06-27 2022-09-20 中国石油大学(华东) Intelligent prediction method for corrosion form profile of shaft

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103730131A (en) * 2012-10-12 2014-04-16 华为技术有限公司 Voice quality evaluation method and device
CN106485259A (en) * 2015-08-26 2017-03-08 华东师范大学 A kind of image classification method based on high constraint high dispersive principal component analysiss network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6277818B2 (en) * 2014-03-26 2018-02-14 日本電気株式会社 Machine learning apparatus, machine learning method, and program
WO2016004073A1 (en) * 2014-06-30 2016-01-07 Amazon Technologies, Inc. Machine learning service
CN108090570A (en) * 2017-12-20 2018-05-29 第四范式(北京)技术有限公司 For selecting the method and system of the feature of machine learning sample

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103730131A (en) * 2012-10-12 2014-04-16 华为技术有限公司 Voice quality evaluation method and device
CN106485259A (en) * 2015-08-26 2017-03-08 华东师范大学 A kind of image classification method based on high constraint high dispersive principal component analysiss network

Also Published As

Publication number Publication date
CN111435463A (en) 2020-07-21

Similar Documents

Publication Publication Date Title
CN111435463B (en) Data processing method, related equipment and system
CN110806954B (en) Method, device, equipment and storage medium for evaluating cloud host resources
WO2021068513A1 (en) Abnormal object recognition method and apparatus, medium, and electronic device
CN107220217A (en) Characteristic coefficient training method and device that logic-based is returned
CN110852881B (en) Risk account identification method and device, electronic equipment and medium
CN110264270B (en) Behavior prediction method, behavior prediction device, behavior prediction equipment and storage medium
CN114066073A (en) Power grid load prediction method
CN116760772B (en) Control system and method for converging flow divider
CN105472631A (en) Service data quantity and/or resource data quantity prediction method and prediction system
CN113037877A (en) Optimization method for time-space data and resource scheduling under cloud edge architecture
CN113610240A (en) Method and system for performing predictions using nested machine learning models
CN112766402A (en) Algorithm selection method and device and electronic equipment
CN115562940A (en) Load energy consumption monitoring method and device, medium and electronic equipment
CN115983497A (en) Time sequence data prediction method and device, computer equipment and storage medium
CN116684330A (en) Traffic prediction method, device, equipment and storage medium based on artificial intelligence
CN113344257B (en) Prediction method for layer analysis response time in homeland space cloud platform
CN116841753B (en) Stream processing and batch processing switching method and switching device
Almomani et al. Selecting a good stochastic system for the large number of alternatives
CN113723712B (en) Wind power prediction method, system, equipment and medium
CN111654853B (en) Data analysis method based on user information
CN116861226A (en) Data processing method and related device
CN113760407A (en) Information processing method, device, equipment and storage medium
CN112906723A (en) Feature selection method and device
Singh et al. A feature extraction and time warping based neural expansion architecture for cloud resource usage forecasting
CN112836770B (en) KPI (kernel performance indicator) anomaly positioning analysis method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant