CN111783999A

CN111783999A - Data processing method and device

Info

Publication number: CN111783999A
Application number: CN202010637851.2A
Authority: CN
Inventors: 徐兵; 罗刚; 傅雨梅
Original assignee: Beijing Zhiyin Intelligent Technology Co ltd
Current assignee: Beijing Zhiyin Intelligent Technology Co ltd
Priority date: 2020-07-01
Filing date: 2020-07-01
Publication date: 2020-10-16

Abstract

The invention provides a data processing method and a data processing device, relates to the technical field of big data, and aims to obtain a target data sample for training a prediction model, wherein the target data sample comprises a target variable and a plurality of characteristic variables; respectively counting data corresponding to a plurality of characteristic variables in a target data sample to obtain a data index of each characteristic variable in the plurality of characteristic variables; and analyzing the data indexes of the characteristic variables to determine the characteristic variables related to the target variables in the characteristic variables. The method analyzes the data samples, and explores the characteristic variables related to the target variables, so that the characteristic variables useful for the target variables are selected when the prediction model is trained, and the prediction capability of the model is improved.

Description

Data processing method and device

Technical Field

The present invention relates to the field of big data technologies, and in particular, to a data processing method and apparatus.

Background

With the rapid development of computer science and big data era, the importance of big data analysis in various industries is more and more prominent. In the information-oriented era, information in data is effectively mined and timely applied in practice, so that the processing of big data is a key requirement of each enterprise.

Currently, machine learning models are typically used to process data, and need to be trained on data samples before use. Because the most original data information of the data sample is input into the model for training, the model obtained by training has weak prediction capability.

Disclosure of Invention

The invention aims to provide a data processing method and a data processing device, which are used for relieving the technical problem that the model obtained by training is not strong in prediction capability because the most original data information of a data sample is input into a model for training.

In a first aspect, an embodiment of the present invention provides a data processing method, where the method includes:

acquiring a target data sample for training a prediction model, wherein the target data sample comprises a target variable and a plurality of characteristic variables;

respectively counting data corresponding to the characteristic variables in the target data sample to obtain a data index of each characteristic variable in the characteristic variables;

and analyzing the data indexes of the characteristic variables to determine the characteristic variables related to the target variable in the characteristic variables.

In an alternative embodiment, the step of analyzing the data indexes of the plurality of characteristic variables and determining the characteristic variable related to the target variable in the plurality of characteristic variables includes:

analyzing the data indexes of the numerical variables, and determining the relation between a single numerical variable and the target variable;

and analyzing the data indexes of the typing variables, and determining the relation between the single typing variable and the target variable.

In an optional embodiment, the step of analyzing the data indexes of the plurality of characteristic variables and determining a characteristic variable related to the target variable in the plurality of characteristic variables includes:

determining an important characteristic variable in the characteristic variables according to a data index of the characteristic variables on the basis of an importance evaluation index, and taking the important characteristic variable as a characteristic variable related to the target variable;

wherein the importance evaluation index comprises one or more of information gain, information gain ratio, Gini impurity coefficient, proportion of feature weight in classification, chi-square test and relevance quick filtering feature selection.

performing combined analysis on data indexes of every two characteristic variables in the plurality of characteristic variables to obtain target combined characteristic variables;

and determining a combined feature variable related to the target variable from the target combined feature variables.

In an alternative embodiment, the method further comprises:

and respectively carrying out abnormal value detection on the statistical indexes of the characteristic variables to obtain abnormal data.

In an alternative embodiment, the method further comprises:

and calculating linear correlation relations among the characteristic variables according to the data indexes of the characteristic variables to obtain the target characteristic variables with the linear correlation relations.

In an alternative embodiment, the method further comprises:

and carrying out dimensionality reduction analysis on the characteristic variables by adopting Principal Component Analysis (PCA) and t-distribution random neighborhood embedding (t-SNE).

In a second aspect, an embodiment of the present invention provides a data processing apparatus, where the apparatus includes:

the system comprises an acquisition module, a prediction module and a prediction module, wherein the acquisition module is used for acquiring a target data sample for training a prediction model, and the target data sample comprises a target variable and a plurality of characteristic variables;

the statistical module is used for respectively carrying out statistics on data corresponding to the characteristic variables in the target data sample to obtain a data index of each characteristic variable in the characteristic variables;

and the characteristic analysis module is used for analyzing the data indexes of the characteristic variables and determining the characteristic variables related to the target variables in the characteristic variables.

In a third aspect, an embodiment of the present invention provides a smart ship, including a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions executable by the processor, and the processor executes the machine-executable instructions to implement the method described in any one of the foregoing embodiments.

In a fourth aspect, embodiments of the invention provide a machine-readable storage medium having stored thereon machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement a method as in any one of the preceding embodiments.

According to the data processing method and device provided by the embodiment of the invention, data corresponding to a plurality of characteristic variables in a target data sample are counted to obtain a data index of each characteristic variable in the plurality of characteristic variables, and then the data indexes of the plurality of characteristic variables are analyzed to determine the characteristic variable related to the target variable in the plurality of characteristic variables. By analyzing the data samples, the characteristic variables related to the target variables are explored, so that the characteristic variables useful for the target variables are selected when the prediction model is trained, and the prediction capability of the model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a data processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another data processing method according to an embodiment of the present invention;

FIG. 3 is a diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Currently, machine learning models are typically used to process data, and need to be trained on data samples before use. Because the most original data information of the data sample is input into the model for training, the model obtained by training has weak prediction capability. Based on this, the data processing method and the data processing device provided by the embodiment of the invention can be used for analyzing the data sample and detecting the characteristic variables related to the target variable so as to select the characteristic variables useful for the target variable during the training of the prediction model, thereby improving the prediction capability of the model.

Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Fig. 1 shows a flowchart of a data processing method according to an embodiment of the present invention. As shown in fig. 1, an embodiment of the present invention provides a data processing method, including the following steps:

step S101, obtaining a target data sample for training a prediction model, wherein the target data sample comprises a target variable and a plurality of characteristic variables;

specifically, the prediction model is used for predicting a target variable according to a plurality of characteristic variables in data to be predicted. The target data sample may include data for multiple dimensions of a large number of objects, each dimension of data may be viewed as a characteristic variable, and each object corresponds to a category of the target variable.

For example, with people in the tatanian number as objects, the data of the multiple dimensions corresponding to each object includes name, gender, age, identity ID, cabin class, i.e. multiple characteristic variables, and the corresponding target variable is survived or not survived.

Step S102, respectively counting data corresponding to a plurality of characteristic variables in a target data sample to obtain a data index of each characteristic variable in the plurality of characteristic variables;

in this step, statistics may be performed on data corresponding to the same characteristic variables of different objects, such as the name, gender, age, identity ID, and cabin space. Specifically, each feature variable corresponds to a data name, a data type, a variable type, and a variable role, and information such as the number of missing values, the missing value proportion, the unique value number, the unique value proportion, the median, the mode percentage, the skewness, the minimum, the lower quartile, the median, the average, the upper quartile, the maximum, and the standard deviation of each feature variable can be counted. Wherein, the skewness is used for measuring the asymmetry degree of the distribution density curve of the data relative to the average value, and the complete symmetry value is 0.

The data names can include name, gender, age, identity ID, cabin space and the like; data types may include integers, strings, floating point numbers, and the like; the variable types can comprise numerical variables, namely classification variables, such as names, cabin levels and the like, and the ages are numerical variables; variable roles such as metadata, goals, features, etc.

By counting the data indexes of the characteristic variables, the data distribution condition can be displayed, and the underlying basis is selected for the preprocessing of the data used in the prediction model and the data selection. The embodiment can also visually display the data indexes of the characteristic variables and the target variables.

Step S103, analyzing the data indexes of the characteristic variables, and determining the characteristic variables related to the target variables in the characteristic variables.

In this step, the characteristic variables having a correlation with the target variable can be determined by counting the data indexes of the plurality of characteristic variables of a large number of objects and combining the target variables corresponding to the respective objects. For example, by analyzing the data index of the characteristic variable of age, it is found that the ratio of the subjects aged 10 to 20 years to all the subjects born is high, and the ratio of the subjects aged 50 to 60 years to all the subjects born is low, so that the age can be regarded as the characteristic variable having a correlation with the target variable.

In the embodiment, the data samples are analyzed, and the characteristic variables related to the target variables are explored, so that the characteristic variables useful for the target variables are selected when the prediction model is trained, and the prediction capability of the model is improved.

In some embodiments, as can be seen from the above, the characteristic variables may include numerical variables and classification variables, as shown in fig. 2, and step S103 may be implemented by:

step S1031, analyzing the data indexes of the numerical variables, and determining the relationship between a single numerical variable and a target variable;

specifically, the data indexes of the characteristic variables may be displayed in a box line diagram form, so as to obtain the relationship between a single numerical variable and a target variable.

Step S1032, the data indexes of the typing variables are analyzed, and the relation between a single typing variable and the target variable is determined.

Specifically, for a categorical variable, such as a cabin class, a data index of the categorical variable can be shown in a list form, so as to obtain a relationship between a single numerical variable and a target variable.

In addition, for the numerical characteristic variable, if the skewness of the data is too large, the prediction capability of the prediction model is affected, so that the skewness of the numerical characteristic variable can be displayed in this embodiment, and particularly, the data distribution can be visually displayed in the form of a statistical histogram.

In some embodiments, step S103 may also be implemented by:

determining an important characteristic variable in the plurality of characteristic variables according to the data index of the plurality of characteristic variables based on the importance evaluation index, and taking the important characteristic variable as a characteristic variable related to the target variable;

wherein, the importance evaluation index comprises one or more of information gain, information gain ratio, Gini impurity coefficient, proportion of characteristic weight in classification, chi-square test and correlation rapid filtering characteristic selection.

The information gain represents the degree of inaccuracy of the classification information caused by the current characteristic variable; the information gain ratio representation is consistent with the information gain target, but the problem that the information gain is biased to take a little characteristic variable is corrected; the kini impurity coefficients represent the uncertainty of the segmented set; chi-square test indicates whether the two classification variables are independent of each other; correlation fast filtering feature selection represents a measure of feature correlation using a symmetry metric instead of information gain. By selecting the important characteristic variables in particular, a certain number of relatively useless characteristic variables can be removed.

In some embodiments, abnormal value detection may be performed on the statistical indicators of the plurality of characteristic variables, respectively, to obtain abnormal data. The abnormal value detection process specifically comprises the following steps:

firstly, filtering infinite values and null values;

calculating an upper quartile point q1 and a lower quartile point q2 of each characteristic variable;

calculating the difference iqr of q1 and q 2;

calculate the upper bound of the outlier: q1+3 iqr, calculate the lower boundary of the outlier: q2-3 × iqr;

regarding a value which is not between the upper boundary and the lower boundary as an abnormal value, namely abnormal data;

and displaying the names of the characteristic variables corresponding to the abnormal values, the distribution thumbnails of the characteristic variables, the number of the abnormal values, the percentage of the abnormal values, the upper boundary, the lower boundary, the abnormal types of the abnormal values and other information.

In some embodiments, step S103 may also be implemented by:

step 1) carrying out combined analysis on data indexes of every two characteristic variables in a plurality of characteristic variables to obtain target combined characteristic variables;

and 2) determining characteristic variables related to the target variables from the target combination characteristic variables.

The combination analysis is to combine two feature variables into one, for example, to combine two feature variables into one new feature variable by cross multiplication. In particular, a certain number of relatively useless characteristic variables can be removed from the combined analysis. The combined analysis may adopt different combination modes of the numerical characteristic variables and the classification type characteristic variables, such as combined analysis among the numerical characteristic variables, combined analysis among the classification type characteristic variables, combined analysis among the numerical characteristic variables and the classification type characteristic variables (which may be shown in the form of a violin diagram), and the like.

In the numerical characteristic variable combination analysis, the numerical characteristic variable combination analysis can be displayed in a scatter diagram mode, the value of a target variable is represented as 0 or 1, if the scatter diagram formed by the analysis result of two-two combination can realize the separation of two kinds of data of the target variable in a straight line dividing mode, the two kinds of data of the target variable are clearly separated with low error rate, and the two-two combination effect of the two characteristic variables is relatively excellent.

In the combined analysis among the classified characteristic variables, the target variables (classified variable data in a classification model) can be displayed in a FactorPlot mode, the target variables are uniformly arranged on a vertical axis, a horizontal axis is the combination of the other classified characteristic variable and the target variable in each pairwise combination, and the target variables are analyzed in a multi-dimensional multi-angle mode.

In the combination analysis between the numerical characteristic variables and the classification type characteristic variables, the combination analysis can be displayed in the form of a violin graph, the violin graph has the characteristics of a line box graph and a nuclear density graph, the wave shapes on two sides display the distribution state of data, and meanwhile the content of the line box graph is displayed in the middle part. The classification type characteristic variables can be arranged on the horizontal axis, the numerical type characteristic variables are arranged on the vertical axis, two values (0 or 1) of the target variable are respectively displayed on the left side and the right side of the line box diagram, and the combination of the numerical type characteristic variables and the classification type characteristic variables is comprehensively analyzed.

Illustratively, when two characteristic variables of gender and age are analyzed in combination, if a female between 10 and 20 years of age accounts for a higher proportion of all survivors and a male between 50 and 60 years of age accounts for a lower proportion of all survivors, the combined characteristic variable pair has a correlation with the target variable.

It should be noted that if the number of numerical characteristic variables is too large, for example, more than 25, the combination analysis may not be performed.

In some embodiments, since there may be a correlation between the characteristic variables, the present embodiment may further calculate a linear correlation between the characteristic variables according to data indexes of a plurality of characteristic variables, so as to obtain a target characteristic variable having the linear correlation.

In this embodiment, a linear correlation between numerical characteristic variables may be calculated according to data indexes of a plurality of characteristic variables, and then displayed in a matrix diagram, for example, a dark color may be used to indicate that the correlation is strong, and a light color may be used to indicate that the correlation is weak.

Specifically, the characteristic correlation may be calculated by using a pearson correlation coefficient, and the specific formula is shown in formula (1):

wherein X, Y is two characteristic variables, ρ_XYRepresents the degree of linear correlation between X and Y, and is in the range of [ -1,1]The linear correlation between variables is measured, simply and quickly, X_iSample of X, Y_iIs the number of samples of Y, n is the number of samples,

is a sample X_iIs determined by the average value of (a) of (b),

is a sample Y_iAverage value of (a). In some embodiments, Principal Component Analysis (PCA) and t-distributed random neighborhood embedding (t-SNE) may also be used to perform dimensionality reduction Analysis on the plurality of feature variables.

Specifically, the machine learning has the characteristic of processing multidimensional large data, different characteristic variables have different contribution degrees to target variable prediction, and the multidimensional characteristic variables are subjected to dimensionality reduction analysis by adopting t-SNE and PCA. t-SNE is a non-linear dimension reduction algorithm for exploring high-dimensional data with the goal of mapping multi-dimensional data into two or more dimensions suitable for human observation. PCA is a linear dimension reduction algorithm for solving high-dimensional data, and high-dimensional variables with correlation are synthesized into low-dimensional variables with linear independence. In this embodiment, a plurality of feature variables may be subjected to dimensionality reduction, and feature variables with correlation may be combined, so that not only may information be retained to the maximum extent, but also interpretability of the prediction model may be enhanced, because it is difficult to provide reasonable popular explanation for the ultra-high-dimensional model in application.

In practical application, the dimension reduction analysis process comprises the following steps:

a) checking the vacancy value, and if the vacancy value exists, filling the vacancy value default strategy.

b) The string attributes are checked and if so, encoded using a default policy.

c) The dimensionality reduction is carried out by using two modes of PCA and t-SNE, and the dimensionality reduction is displayed by drawing.

In addition, the embodiment can also analyze the target variable, count the number of samples corresponding to each category in the target variable, and perform visual display. When the number of samples between the plurality of categories is largely different, a prompt may be given so that the relevant adjustment is made according to the prompt. The specific process for analyzing the target variable comprises the following steps:

1) and counting the number of the target variables and the classification of the target variables and performing ascending arrangement.

2) And judging the difference between the number of the first category and the number of the last category. An unbalanced classification is determined when the number of first classes 2< the number of last classes.

3) Showing the quantity histogram, the category, the sample number, the proportion and the like of each category.

In the embodiment, data are analyzed and probed from various aspects and dimensions, and deep analysis is performed on the data, so that the aim of comprehensively understanding the data is fulfilled.

As shown in fig. 3, an embodiment of the present invention provides a data processing apparatus, including:

an obtaining module 31, configured to obtain a target data sample for training a prediction model, where the target data sample includes a target variable and a plurality of feature variables;

the statistical module 32 is configured to perform statistics on data corresponding to a plurality of characteristic variables in the target data sample, respectively, to obtain a data index of each characteristic variable in the plurality of characteristic variables;

the characteristic analysis module 33 is configured to analyze the data indexes of the plurality of characteristic variables and determine a characteristic variable related to the target variable in the plurality of characteristic variables.

In a possible embodiment, the feature variables include a numerical variable and a classification variable, and the feature analysis module 33 includes:

the numerical variable analysis unit is used for analyzing data indexes of the numerical variables and determining the relation between a single numerical variable and a target variable;

and the classification type variable analysis unit is used for analyzing the data indexes of the classification type variables and determining the relation between the single classification type variable and the target variable.

In a possible embodiment, the feature analysis module 33 is further configured to:

determining an important characteristic variable in the plurality of characteristic variables according to the data indexes of the plurality of characteristic variables on the basis of the importance evaluation indexes, and taking the important characteristic variable as a characteristic variable related to the target variable;

In a possible embodiment, the device further comprises:

and the detection module is used for respectively carrying out abnormal value detection on the statistical indexes of the characteristic variables to obtain abnormal data.

In a possible embodiment, the feature analysis module 33 comprises:

the combined analysis unit is used for performing combined analysis on the data indexes of every two characteristic variables in the plurality of characteristic variables to obtain a target combined characteristic variable;

and the determining unit is used for determining the characteristic variable related to the target variable from the target combination characteristic variables.

In a possible embodiment, the device further comprises:

and the calculation module is used for calculating the linear correlation among the characteristic variables according to the data indexes of the characteristic variables to obtain the target characteristic variables with the linear correlation.

In a possible embodiment, the device further comprises:

and the dimensionality reduction module is used for carrying out dimensionality reduction analysis on the plurality of characteristic variables by adopting Principal Component Analysis (PCA) and t-distribution random neighborhood embedding (t-SNE).

The data processing apparatus provided in the embodiment of the present invention may be specific hardware on the device, or software or firmware installed on the device, or the like. The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.

Referring to fig. 4, an embodiment of the present invention further provides an electronic device 400, including: a processor 401, a memory 402, a bus 403 and a communication interface 404, wherein the processor 401, the communication interface 404 and the memory 402 are connected through the bus 403; the memory 402 is used to store programs; the processor 401 is configured to call a program stored in the memory 402 through the bus 403 to execute the data processing method of the above-described embodiment.

The Memory 402 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 404 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

Bus 403 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 4, but that does not indicate only one bus or one type of bus.

The memory 402 is used for storing a program, the processor 401 executes the program after receiving an execution instruction, and the method executed by the apparatus defined by the flow disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 401, or implemented by the processor 401.

The processor 401 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 401. The Processor 401 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 402, and the processor 401 reads the information in the memory 402 and completes the steps of the method in combination with the hardware.

Embodiments of the present invention also provide a machine-readable storage medium storing machine-executable instructions, which when invoked and executed by a processor, cause the processor to implement the data processing method as above.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of data processing, the method comprising:

2. The method according to claim 1, wherein the characteristic variables include numerical variables and classification variables, and the step of analyzing the data indexes of the plurality of characteristic variables to determine the characteristic variable related to the target variable among the plurality of characteristic variables comprises:

3. The method of claim 1, wherein the step of analyzing the data indicators of the plurality of characteristic variables to determine the characteristic variable of the plurality of characteristic variables related to the target variable comprises:

4. The method according to claim 1 or 2, wherein the step of analyzing the data indexes of the plurality of characteristic variables to determine the characteristic variable related to the target variable in the plurality of characteristic variables comprises:

5. The method of claim 1, further comprising:

6. The method of claim 1, further comprising:

7. The method of claim 1, further comprising:

8. A data processing apparatus, characterized in that the apparatus comprises:

9. An electronic device comprising a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor to perform the method of any of claims 1-7.

10. A machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of any of claims 1-7.