CN111783999A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN111783999A
CN111783999A CN202010637851.2A CN202010637851A CN111783999A CN 111783999 A CN111783999 A CN 111783999A CN 202010637851 A CN202010637851 A CN 202010637851A CN 111783999 A CN111783999 A CN 111783999A
Authority
CN
China
Prior art keywords
characteristic
variables
data
characteristic variables
variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010637851.2A
Other languages
Chinese (zh)
Inventor
徐兵
罗刚
傅雨梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhiyin Intelligent Technology Co ltd
Original Assignee
Beijing Zhiyin Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhiyin Intelligent Technology Co ltd filed Critical Beijing Zhiyin Intelligent Technology Co ltd
Priority to CN202010637851.2A priority Critical patent/CN111783999A/en
Publication of CN111783999A publication Critical patent/CN111783999A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a data processing method and a data processing device, relates to the technical field of big data, and aims to obtain a target data sample for training a prediction model, wherein the target data sample comprises a target variable and a plurality of characteristic variables; respectively counting data corresponding to a plurality of characteristic variables in a target data sample to obtain a data index of each characteristic variable in the plurality of characteristic variables; and analyzing the data indexes of the characteristic variables to determine the characteristic variables related to the target variables in the characteristic variables. The method analyzes the data samples, and explores the characteristic variables related to the target variables, so that the characteristic variables useful for the target variables are selected when the prediction model is trained, and the prediction capability of the model is improved.

Description

Data processing method and device
Technical Field
The present invention relates to the field of big data technologies, and in particular, to a data processing method and apparatus.
Background
With the rapid development of computer science and big data era, the importance of big data analysis in various industries is more and more prominent. In the information-oriented era, information in data is effectively mined and timely applied in practice, so that the processing of big data is a key requirement of each enterprise.
Currently, machine learning models are typically used to process data, and need to be trained on data samples before use. Because the most original data information of the data sample is input into the model for training, the model obtained by training has weak prediction capability.
Disclosure of Invention
The invention aims to provide a data processing method and a data processing device, which are used for relieving the technical problem that the model obtained by training is not strong in prediction capability because the most original data information of a data sample is input into a model for training.
In a first aspect, an embodiment of the present invention provides a data processing method, where the method includes:
acquiring a target data sample for training a prediction model, wherein the target data sample comprises a target variable and a plurality of characteristic variables;
respectively counting data corresponding to the characteristic variables in the target data sample to obtain a data index of each characteristic variable in the characteristic variables;
and analyzing the data indexes of the characteristic variables to determine the characteristic variables related to the target variable in the characteristic variables.
In an alternative embodiment, the step of analyzing the data indexes of the plurality of characteristic variables and determining the characteristic variable related to the target variable in the plurality of characteristic variables includes:
analyzing the data indexes of the numerical variables, and determining the relation between a single numerical variable and the target variable;
and analyzing the data indexes of the typing variables, and determining the relation between the single typing variable and the target variable.
In an optional embodiment, the step of analyzing the data indexes of the plurality of characteristic variables and determining a characteristic variable related to the target variable in the plurality of characteristic variables includes:
determining an important characteristic variable in the characteristic variables according to a data index of the characteristic variables on the basis of an importance evaluation index, and taking the important characteristic variable as a characteristic variable related to the target variable;
wherein the importance evaluation index comprises one or more of information gain, information gain ratio, Gini impurity coefficient, proportion of feature weight in classification, chi-square test and relevance quick filtering feature selection.
In an optional embodiment, the step of analyzing the data indexes of the plurality of characteristic variables and determining a characteristic variable related to the target variable in the plurality of characteristic variables includes:
performing combined analysis on data indexes of every two characteristic variables in the plurality of characteristic variables to obtain target combined characteristic variables;
and determining a combined feature variable related to the target variable from the target combined feature variables.
In an alternative embodiment, the method further comprises:
and respectively carrying out abnormal value detection on the statistical indexes of the characteristic variables to obtain abnormal data.
In an alternative embodiment, the method further comprises:
and calculating linear correlation relations among the characteristic variables according to the data indexes of the characteristic variables to obtain the target characteristic variables with the linear correlation relations.
In an alternative embodiment, the method further comprises:
and carrying out dimensionality reduction analysis on the characteristic variables by adopting Principal Component Analysis (PCA) and t-distribution random neighborhood embedding (t-SNE).
In a second aspect, an embodiment of the present invention provides a data processing apparatus, where the apparatus includes:
the system comprises an acquisition module, a prediction module and a prediction module, wherein the acquisition module is used for acquiring a target data sample for training a prediction model, and the target data sample comprises a target variable and a plurality of characteristic variables;
the statistical module is used for respectively carrying out statistics on data corresponding to the characteristic variables in the target data sample to obtain a data index of each characteristic variable in the characteristic variables;
and the characteristic analysis module is used for analyzing the data indexes of the characteristic variables and determining the characteristic variables related to the target variables in the characteristic variables.
In a third aspect, an embodiment of the present invention provides a smart ship, including a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions executable by the processor, and the processor executes the machine-executable instructions to implement the method described in any one of the foregoing embodiments.
In a fourth aspect, embodiments of the invention provide a machine-readable storage medium having stored thereon machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement a method as in any one of the preceding embodiments.
According to the data processing method and device provided by the embodiment of the invention, data corresponding to a plurality of characteristic variables in a target data sample are counted to obtain a data index of each characteristic variable in the plurality of characteristic variables, and then the data indexes of the plurality of characteristic variables are analyzed to determine the characteristic variable related to the target variable in the plurality of characteristic variables. By analyzing the data samples, the characteristic variables related to the target variables are explored, so that the characteristic variables useful for the target variables are selected when the prediction model is trained, and the prediction capability of the model is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a data processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another data processing method according to an embodiment of the present invention;
FIG. 3 is a diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 4 is a schematic diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Currently, machine learning models are typically used to process data, and need to be trained on data samples before use. Because the most original data information of the data sample is input into the model for training, the model obtained by training has weak prediction capability. Based on this, the data processing method and the data processing device provided by the embodiment of the invention can be used for analyzing the data sample and detecting the characteristic variables related to the target variable so as to select the characteristic variables useful for the target variable during the training of the prediction model, thereby improving the prediction capability of the model.
Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Fig. 1 shows a flowchart of a data processing method according to an embodiment of the present invention. As shown in fig. 1, an embodiment of the present invention provides a data processing method, including the following steps:
step S101, obtaining a target data sample for training a prediction model, wherein the target data sample comprises a target variable and a plurality of characteristic variables;
specifically, the prediction model is used for predicting a target variable according to a plurality of characteristic variables in data to be predicted. The target data sample may include data for multiple dimensions of a large number of objects, each dimension of data may be viewed as a characteristic variable, and each object corresponds to a category of the target variable.
For example, with people in the tatanian number as objects, the data of the multiple dimensions corresponding to each object includes name, gender, age, identity ID, cabin class, i.e. multiple characteristic variables, and the corresponding target variable is survived or not survived.
Step S102, respectively counting data corresponding to a plurality of characteristic variables in a target data sample to obtain a data index of each characteristic variable in the plurality of characteristic variables;
in this step, statistics may be performed on data corresponding to the same characteristic variables of different objects, such as the name, gender, age, identity ID, and cabin space. Specifically, each feature variable corresponds to a data name, a data type, a variable type, and a variable role, and information such as the number of missing values, the missing value proportion, the unique value number, the unique value proportion, the median, the mode percentage, the skewness, the minimum, the lower quartile, the median, the average, the upper quartile, the maximum, and the standard deviation of each feature variable can be counted. Wherein, the skewness is used for measuring the asymmetry degree of the distribution density curve of the data relative to the average value, and the complete symmetry value is 0.
The data names can include name, gender, age, identity ID, cabin space and the like; data types may include integers, strings, floating point numbers, and the like; the variable types can comprise numerical variables, namely classification variables, such as names, cabin levels and the like, and the ages are numerical variables; variable roles such as metadata, goals, features, etc.
By counting the data indexes of the characteristic variables, the data distribution condition can be displayed, and the underlying basis is selected for the preprocessing of the data used in the prediction model and the data selection. The embodiment can also visually display the data indexes of the characteristic variables and the target variables.
Step S103, analyzing the data indexes of the characteristic variables, and determining the characteristic variables related to the target variables in the characteristic variables.
In this step, the characteristic variables having a correlation with the target variable can be determined by counting the data indexes of the plurality of characteristic variables of a large number of objects and combining the target variables corresponding to the respective objects. For example, by analyzing the data index of the characteristic variable of age, it is found that the ratio of the subjects aged 10 to 20 years to all the subjects born is high, and the ratio of the subjects aged 50 to 60 years to all the subjects born is low, so that the age can be regarded as the characteristic variable having a correlation with the target variable.
In the embodiment, the data samples are analyzed, and the characteristic variables related to the target variables are explored, so that the characteristic variables useful for the target variables are selected when the prediction model is trained, and the prediction capability of the model is improved.
In some embodiments, as can be seen from the above, the characteristic variables may include numerical variables and classification variables, as shown in fig. 2, and step S103 may be implemented by:
step S1031, analyzing the data indexes of the numerical variables, and determining the relationship between a single numerical variable and a target variable;
specifically, the data indexes of the characteristic variables may be displayed in a box line diagram form, so as to obtain the relationship between a single numerical variable and a target variable.
Step S1032, the data indexes of the typing variables are analyzed, and the relation between a single typing variable and the target variable is determined.
Specifically, for a categorical variable, such as a cabin class, a data index of the categorical variable can be shown in a list form, so as to obtain a relationship between a single numerical variable and a target variable.
In addition, for the numerical characteristic variable, if the skewness of the data is too large, the prediction capability of the prediction model is affected, so that the skewness of the numerical characteristic variable can be displayed in this embodiment, and particularly, the data distribution can be visually displayed in the form of a statistical histogram.
In some embodiments, step S103 may also be implemented by:
determining an important characteristic variable in the plurality of characteristic variables according to the data index of the plurality of characteristic variables based on the importance evaluation index, and taking the important characteristic variable as a characteristic variable related to the target variable;
wherein, the importance evaluation index comprises one or more of information gain, information gain ratio, Gini impurity coefficient, proportion of characteristic weight in classification, chi-square test and correlation rapid filtering characteristic selection.
The information gain represents the degree of inaccuracy of the classification information caused by the current characteristic variable; the information gain ratio representation is consistent with the information gain target, but the problem that the information gain is biased to take a little characteristic variable is corrected; the kini impurity coefficients represent the uncertainty of the segmented set; chi-square test indicates whether the two classification variables are independent of each other; correlation fast filtering feature selection represents a measure of feature correlation using a symmetry metric instead of information gain. By selecting the important characteristic variables in particular, a certain number of relatively useless characteristic variables can be removed.
In some embodiments, abnormal value detection may be performed on the statistical indicators of the plurality of characteristic variables, respectively, to obtain abnormal data. The abnormal value detection process specifically comprises the following steps:
firstly, filtering infinite values and null values;
calculating an upper quartile point q1 and a lower quartile point q2 of each characteristic variable;
calculating the difference iqr of q1 and q 2;
calculate the upper bound of the outlier: q1+3 iqr, calculate the lower boundary of the outlier: q2-3 × iqr;
regarding a value which is not between the upper boundary and the lower boundary as an abnormal value, namely abnormal data;
and displaying the names of the characteristic variables corresponding to the abnormal values, the distribution thumbnails of the characteristic variables, the number of the abnormal values, the percentage of the abnormal values, the upper boundary, the lower boundary, the abnormal types of the abnormal values and other information.
In some embodiments, step S103 may also be implemented by:
step 1) carrying out combined analysis on data indexes of every two characteristic variables in a plurality of characteristic variables to obtain target combined characteristic variables;
and 2) determining characteristic variables related to the target variables from the target combination characteristic variables.
The combination analysis is to combine two feature variables into one, for example, to combine two feature variables into one new feature variable by cross multiplication. In particular, a certain number of relatively useless characteristic variables can be removed from the combined analysis. The combined analysis may adopt different combination modes of the numerical characteristic variables and the classification type characteristic variables, such as combined analysis among the numerical characteristic variables, combined analysis among the classification type characteristic variables, combined analysis among the numerical characteristic variables and the classification type characteristic variables (which may be shown in the form of a violin diagram), and the like.
In the numerical characteristic variable combination analysis, the numerical characteristic variable combination analysis can be displayed in a scatter diagram mode, the value of a target variable is represented as 0 or 1, if the scatter diagram formed by the analysis result of two-two combination can realize the separation of two kinds of data of the target variable in a straight line dividing mode, the two kinds of data of the target variable are clearly separated with low error rate, and the two-two combination effect of the two characteristic variables is relatively excellent.
In the combined analysis among the classified characteristic variables, the target variables (classified variable data in a classification model) can be displayed in a FactorPlot mode, the target variables are uniformly arranged on a vertical axis, a horizontal axis is the combination of the other classified characteristic variable and the target variable in each pairwise combination, and the target variables are analyzed in a multi-dimensional multi-angle mode.
In the combination analysis between the numerical characteristic variables and the classification type characteristic variables, the combination analysis can be displayed in the form of a violin graph, the violin graph has the characteristics of a line box graph and a nuclear density graph, the wave shapes on two sides display the distribution state of data, and meanwhile the content of the line box graph is displayed in the middle part. The classification type characteristic variables can be arranged on the horizontal axis, the numerical type characteristic variables are arranged on the vertical axis, two values (0 or 1) of the target variable are respectively displayed on the left side and the right side of the line box diagram, and the combination of the numerical type characteristic variables and the classification type characteristic variables is comprehensively analyzed.
Illustratively, when two characteristic variables of gender and age are analyzed in combination, if a female between 10 and 20 years of age accounts for a higher proportion of all survivors and a male between 50 and 60 years of age accounts for a lower proportion of all survivors, the combined characteristic variable pair has a correlation with the target variable.
It should be noted that if the number of numerical characteristic variables is too large, for example, more than 25, the combination analysis may not be performed.
In some embodiments, since there may be a correlation between the characteristic variables, the present embodiment may further calculate a linear correlation between the characteristic variables according to data indexes of a plurality of characteristic variables, so as to obtain a target characteristic variable having the linear correlation.
In this embodiment, a linear correlation between numerical characteristic variables may be calculated according to data indexes of a plurality of characteristic variables, and then displayed in a matrix diagram, for example, a dark color may be used to indicate that the correlation is strong, and a light color may be used to indicate that the correlation is weak.
Specifically, the characteristic correlation may be calculated by using a pearson correlation coefficient, and the specific formula is shown in formula (1):
Figure BDA0002565913450000091
wherein X, Y is two characteristic variables, ρXYRepresents the degree of linear correlation between X and Y, and is in the range of [ -1,1]The linear correlation between variables is measured, simply and quickly, XiSample of X, YiIs the number of samples of Y, n is the number of samples,
Figure BDA0002565913450000092
is a sample XiIs determined by the average value of (a) of (b),
Figure BDA0002565913450000093
is a sample YiAverage value of (a). In some embodiments, Principal Component Analysis (PCA) and t-distributed random neighborhood embedding (t-SNE) may also be used to perform dimensionality reduction Analysis on the plurality of feature variables.
Specifically, the machine learning has the characteristic of processing multidimensional large data, different characteristic variables have different contribution degrees to target variable prediction, and the multidimensional characteristic variables are subjected to dimensionality reduction analysis by adopting t-SNE and PCA. t-SNE is a non-linear dimension reduction algorithm for exploring high-dimensional data with the goal of mapping multi-dimensional data into two or more dimensions suitable for human observation. PCA is a linear dimension reduction algorithm for solving high-dimensional data, and high-dimensional variables with correlation are synthesized into low-dimensional variables with linear independence. In this embodiment, a plurality of feature variables may be subjected to dimensionality reduction, and feature variables with correlation may be combined, so that not only may information be retained to the maximum extent, but also interpretability of the prediction model may be enhanced, because it is difficult to provide reasonable popular explanation for the ultra-high-dimensional model in application.
In practical application, the dimension reduction analysis process comprises the following steps:
a) checking the vacancy value, and if the vacancy value exists, filling the vacancy value default strategy.
b) The string attributes are checked and if so, encoded using a default policy.
c) The dimensionality reduction is carried out by using two modes of PCA and t-SNE, and the dimensionality reduction is displayed by drawing.
In addition, the embodiment can also analyze the target variable, count the number of samples corresponding to each category in the target variable, and perform visual display. When the number of samples between the plurality of categories is largely different, a prompt may be given so that the relevant adjustment is made according to the prompt. The specific process for analyzing the target variable comprises the following steps:
1) and counting the number of the target variables and the classification of the target variables and performing ascending arrangement.
2) And judging the difference between the number of the first category and the number of the last category. An unbalanced classification is determined when the number of first classes 2< the number of last classes.
3) Showing the quantity histogram, the category, the sample number, the proportion and the like of each category.
In the embodiment, data are analyzed and probed from various aspects and dimensions, and deep analysis is performed on the data, so that the aim of comprehensively understanding the data is fulfilled.
As shown in fig. 3, an embodiment of the present invention provides a data processing apparatus, including:
an obtaining module 31, configured to obtain a target data sample for training a prediction model, where the target data sample includes a target variable and a plurality of feature variables;
the statistical module 32 is configured to perform statistics on data corresponding to a plurality of characteristic variables in the target data sample, respectively, to obtain a data index of each characteristic variable in the plurality of characteristic variables;
the characteristic analysis module 33 is configured to analyze the data indexes of the plurality of characteristic variables and determine a characteristic variable related to the target variable in the plurality of characteristic variables.
In a possible embodiment, the feature variables include a numerical variable and a classification variable, and the feature analysis module 33 includes:
the numerical variable analysis unit is used for analyzing data indexes of the numerical variables and determining the relation between a single numerical variable and a target variable;
and the classification type variable analysis unit is used for analyzing the data indexes of the classification type variables and determining the relation between the single classification type variable and the target variable.
In a possible embodiment, the feature analysis module 33 is further configured to:
determining an important characteristic variable in the plurality of characteristic variables according to the data indexes of the plurality of characteristic variables on the basis of the importance evaluation indexes, and taking the important characteristic variable as a characteristic variable related to the target variable;
wherein, the importance evaluation index comprises one or more of information gain, information gain ratio, Gini impurity coefficient, proportion of characteristic weight in classification, chi-square test and correlation rapid filtering characteristic selection.
In a possible embodiment, the device further comprises:
and the detection module is used for respectively carrying out abnormal value detection on the statistical indexes of the characteristic variables to obtain abnormal data.
In a possible embodiment, the feature analysis module 33 comprises:
the combined analysis unit is used for performing combined analysis on the data indexes of every two characteristic variables in the plurality of characteristic variables to obtain a target combined characteristic variable;
and the determining unit is used for determining the characteristic variable related to the target variable from the target combination characteristic variables.
In a possible embodiment, the device further comprises:
and the calculation module is used for calculating the linear correlation among the characteristic variables according to the data indexes of the characteristic variables to obtain the target characteristic variables with the linear correlation.
In a possible embodiment, the device further comprises:
and the dimensionality reduction module is used for carrying out dimensionality reduction analysis on the plurality of characteristic variables by adopting Principal Component Analysis (PCA) and t-distribution random neighborhood embedding (t-SNE).
The data processing apparatus provided in the embodiment of the present invention may be specific hardware on the device, or software or firmware installed on the device, or the like. The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.
Referring to fig. 4, an embodiment of the present invention further provides an electronic device 400, including: a processor 401, a memory 402, a bus 403 and a communication interface 404, wherein the processor 401, the communication interface 404 and the memory 402 are connected through the bus 403; the memory 402 is used to store programs; the processor 401 is configured to call a program stored in the memory 402 through the bus 403 to execute the data processing method of the above-described embodiment.
The Memory 402 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 404 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
Bus 403 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 4, but that does not indicate only one bus or one type of bus.
The memory 402 is used for storing a program, the processor 401 executes the program after receiving an execution instruction, and the method executed by the apparatus defined by the flow disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 401, or implemented by the processor 401.
The processor 401 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 401. The Processor 401 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 402, and the processor 401 reads the information in the memory 402 and completes the steps of the method in combination with the hardware.
Embodiments of the present invention also provide a machine-readable storage medium storing machine-executable instructions, which when invoked and executed by a processor, cause the processor to implement the data processing method as above.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of data processing, the method comprising:
acquiring a target data sample for training a prediction model, wherein the target data sample comprises a target variable and a plurality of characteristic variables;
respectively counting data corresponding to the characteristic variables in the target data sample to obtain a data index of each characteristic variable in the characteristic variables;
and analyzing the data indexes of the characteristic variables to determine the characteristic variables related to the target variable in the characteristic variables.
2. The method according to claim 1, wherein the characteristic variables include numerical variables and classification variables, and the step of analyzing the data indexes of the plurality of characteristic variables to determine the characteristic variable related to the target variable among the plurality of characteristic variables comprises:
analyzing the data indexes of the numerical variables, and determining the relation between a single numerical variable and the target variable;
and analyzing the data indexes of the typing variables, and determining the relation between the single typing variable and the target variable.
3. The method of claim 1, wherein the step of analyzing the data indicators of the plurality of characteristic variables to determine the characteristic variable of the plurality of characteristic variables related to the target variable comprises:
determining an important characteristic variable in the characteristic variables according to a data index of the characteristic variables on the basis of an importance evaluation index, and taking the important characteristic variable as a characteristic variable related to the target variable;
wherein the importance evaluation index comprises one or more of information gain, information gain ratio, Gini impurity coefficient, proportion of feature weight in classification, chi-square test and relevance quick filtering feature selection.
4. The method according to claim 1 or 2, wherein the step of analyzing the data indexes of the plurality of characteristic variables to determine the characteristic variable related to the target variable in the plurality of characteristic variables comprises:
performing combined analysis on data indexes of every two characteristic variables in the plurality of characteristic variables to obtain target combined characteristic variables;
and determining a combined feature variable related to the target variable from the target combined feature variables.
5. The method of claim 1, further comprising:
and respectively carrying out abnormal value detection on the statistical indexes of the characteristic variables to obtain abnormal data.
6. The method of claim 1, further comprising:
and calculating linear correlation relations among the characteristic variables according to the data indexes of the characteristic variables to obtain the target characteristic variables with the linear correlation relations.
7. The method of claim 1, further comprising:
and carrying out dimensionality reduction analysis on the characteristic variables by adopting Principal Component Analysis (PCA) and t-distribution random neighborhood embedding (t-SNE).
8. A data processing apparatus, characterized in that the apparatus comprises:
the system comprises an acquisition module, a prediction module and a prediction module, wherein the acquisition module is used for acquiring a target data sample for training a prediction model, and the target data sample comprises a target variable and a plurality of characteristic variables;
the statistical module is used for respectively carrying out statistics on data corresponding to the characteristic variables in the target data sample to obtain a data index of each characteristic variable in the characteristic variables;
and the characteristic analysis module is used for analyzing the data indexes of the characteristic variables and determining the characteristic variables related to the target variables in the characteristic variables.
9. An electronic device comprising a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor to perform the method of any of claims 1-7.
10. A machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of any of claims 1-7.
CN202010637851.2A 2020-07-01 2020-07-01 Data processing method and device Pending CN111783999A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010637851.2A CN111783999A (en) 2020-07-01 2020-07-01 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010637851.2A CN111783999A (en) 2020-07-01 2020-07-01 Data processing method and device

Publications (1)

Publication Number Publication Date
CN111783999A true CN111783999A (en) 2020-10-16

Family

ID=72758757

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010637851.2A Pending CN111783999A (en) 2020-07-01 2020-07-01 Data processing method and device

Country Status (1)

Country Link
CN (1) CN111783999A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112511372A (en) * 2020-11-06 2021-03-16 新华三技术有限公司 Anomaly detection method, device and equipment
CN112613755A (en) * 2020-12-25 2021-04-06 北京知因智慧科技有限公司 Method and device for evaluating enterprise risk by using confidence coefficient and electronic equipment
CN112614203A (en) * 2020-12-25 2021-04-06 北京知因智慧科技有限公司 Correlation matrix visualization method and device, electronic equipment and storage medium
CN112613983A (en) * 2020-12-25 2021-04-06 北京知因智慧科技有限公司 Feature screening method and device in machine modeling process and electronic equipment
CN112677438A (en) * 2020-12-08 2021-04-20 上海微亿智造科技有限公司 Segmented construction and screening method and system based on injection molding and machine debugging data characteristics
CN112749623A (en) * 2020-12-08 2021-05-04 上海微亿智造科技有限公司 Processing and feature extraction method and system for high-frequency sensor data of injection molding process
CN112767103A (en) * 2020-12-30 2021-05-07 北京知因智慧科技有限公司 Financial data analysis method and device and electronic equipment
CN113782121A (en) * 2021-08-06 2021-12-10 中国中医科学院中医药信息研究所 Random grouping method, device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106408325A (en) * 2016-08-29 2017-02-15 深圳市爱贝信息技术有限公司 User consumption behavior prediction analysis method based on user payment information and system
CN110570030A (en) * 2019-08-22 2019-12-13 国网山东省电力公司经济技术研究院 Wind power cluster power interval prediction method and system based on deep learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106408325A (en) * 2016-08-29 2017-02-15 深圳市爱贝信息技术有限公司 User consumption behavior prediction analysis method based on user payment information and system
CN110570030A (en) * 2019-08-22 2019-12-13 国网山东省电力公司经济技术研究院 Wind power cluster power interval prediction method and system based on deep learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
明日科技等: "《Python数据分析从入门到实践》", 30 June 2020, 吉林大学出版社, pages: 183 - 185 *
王宇韬等: "《Python大数据分析与机器学习商业案例实战》", 31 May 2020, 机械工业出版社, pages: 352 - 353 *
纪宇楠: "基于随机森林构建滤泡型甲状腺癌远处转移预测模型", 《中国优秀硕士学位论文全文数据库(医药卫生科技辑)》, no. 1, 15 January 2019 (2019-01-15), pages 3 *
纪宇楠: "基于随机森林构建滤泡型甲状腺癌远处转移预测模型", 《中国优秀硕士学位论文全文数据库(医药卫生科技辑)》, no. 1, pages 3 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112511372B (en) * 2020-11-06 2022-03-01 新华三技术有限公司 Anomaly detection method, device and equipment
CN112511372A (en) * 2020-11-06 2021-03-16 新华三技术有限公司 Anomaly detection method, device and equipment
CN112677438B (en) * 2020-12-08 2021-11-16 上海微亿智造科技有限公司 Segmented construction and screening method and system based on injection molding and machine debugging data characteristics
CN112749623B (en) * 2020-12-08 2023-04-07 上海微亿智造科技有限公司 Processing and feature extraction method and system for high-frequency sensor data of injection molding process
CN112677438A (en) * 2020-12-08 2021-04-20 上海微亿智造科技有限公司 Segmented construction and screening method and system based on injection molding and machine debugging data characteristics
CN112749623A (en) * 2020-12-08 2021-05-04 上海微亿智造科技有限公司 Processing and feature extraction method and system for high-frequency sensor data of injection molding process
CN112613983A (en) * 2020-12-25 2021-04-06 北京知因智慧科技有限公司 Feature screening method and device in machine modeling process and electronic equipment
CN112614203A (en) * 2020-12-25 2021-04-06 北京知因智慧科技有限公司 Correlation matrix visualization method and device, electronic equipment and storage medium
CN112613755A (en) * 2020-12-25 2021-04-06 北京知因智慧科技有限公司 Method and device for evaluating enterprise risk by using confidence coefficient and electronic equipment
CN112614203B (en) * 2020-12-25 2023-07-04 北京知因智慧科技有限公司 Correlation matrix visualization method and device, electronic equipment and storage medium
CN112613983B (en) * 2020-12-25 2023-11-21 北京知因智慧科技有限公司 Feature screening method and device in machine modeling process and electronic equipment
CN112767103A (en) * 2020-12-30 2021-05-07 北京知因智慧科技有限公司 Financial data analysis method and device and electronic equipment
CN113782121A (en) * 2021-08-06 2021-12-10 中国中医科学院中医药信息研究所 Random grouping method, device, computer equipment and storage medium
CN113782121B (en) * 2021-08-06 2024-03-19 中国中医科学院中医药信息研究所 Random grouping method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111783999A (en) Data processing method and device
WO2019214248A1 (en) Risk assessment method and apparatus, terminal device, and storage medium
CN112258093A (en) Risk level data processing method and device, storage medium and electronic equipment
US20220147023A1 (en) Method and device for identifying industry classification of enterprise and particular pollutants of enterprise
US20100274833A1 (en) Monitoring device and a server
CN110544155A (en) User credit score acquisition method, acquisition device, server and storage medium
CN110705718A (en) Model interpretation method and device based on cooperative game and electronic equipment
CN111080117A (en) Method and device for constructing equipment risk label, electronic equipment and storage medium
CN103544299B (en) A kind of construction method of business intelligence cloud computing system
CN113468034A (en) Data quality evaluation method and device, storage medium and electronic equipment
CN112016618A (en) Measurement method for generalization capability of image semantic segmentation model
CN110689230B (en) Regional poverty degree determining method, electronic device and storage medium
Chang et al. Ranking journal quality by harmonic mean of ranks: An application to ISI Statistics & Probability
CN117272145A (en) Health state evaluation method and device of switch machine and electronic equipment
CN117035563B (en) Product quality safety risk monitoring method, device, monitoring system and medium
WO2019218482A1 (en) Big data-based population screening method and apparatus, terminal device and readable storage medium
CN110059749B (en) Method and device for screening important features and electronic equipment
CN112632000A (en) Log file clustering method and device, electronic equipment and readable storage medium
CN116705310A (en) Data set construction method, device, equipment and medium for perioperative risk assessment
US20210209503A1 (en) Method and electronic device for selecting influence indicators by using automatic mechanism
CN116778210A (en) Teaching image evaluation system and teaching image evaluation method
CN112749742A (en) Source risk score quantification method and device and electronic equipment
CN111061942A (en) Search ranking monitoring method and system
CN111259209B (en) User intention prediction method based on artificial intelligence, electronic device and storage medium
CN114765624B (en) Information recommendation method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination