CN112070239B

CN112070239B - Analysis method, system, medium, and device based on user data modeling

Info

Publication number: CN112070239B
Application number: CN202011250084.6A
Authority: CN
Inventors: 薛颜波; 蔡俊杰
Original assignee: Shanghai Synyi Medical Technology Co ltd
Current assignee: Shanghai Synyi Medical Technology Co ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2021-07-09
Anticipated expiration: 2040-11-11
Also published as: CN112070239A

Abstract

The invention provides an analysis method, a system, a medium and equipment based on user data modeling, wherein the analysis method based on the user data modeling comprises the following steps: carrying out characteristic analysis on the user data to generate a characteristic analysis result; performing stability inspection on the characteristic analysis result along with time change to detect abnormal data, and judging whether the characteristic analysis result is reliable according to the abnormal data detection result; screening characteristics required by the preprocessed data when modeling the user data; modeling user data by using the screened features to generate a user data model; and carrying out model analysis on the user data model to obtain a reliability analysis result of the user data model. In the modeling process, the invention provides reliable calculation and analysis links required by the machine learning model so as to realize complete automation, and provides enough analysis information for business personnel to help the business personnel to diagnose whether the model training process has problems.

Description

Analysis method, system, medium, and device based on user data modeling

Technical Field

The invention belongs to the technical field of machine model data analysis, relates to an analysis method for user data modeling, and particularly relates to an analysis method, a system, a medium and equipment based on user data modeling.

Background

At present, some automatic machine learning model experiment systems are available, so that the whole process can be completed only by configuring parameters without much manual participation and understanding of a machine learning underlying principle, and the modeling work can be completed. The professional threshold of machine learning modeling is reduced. For example, the machine learning automation service provided by Aliskiu realizes the automation work of links such as data preprocessing, algorithm modeling (including automatic parameter adjustment), model evaluation and the like.

However, the auto-ml (automatic Machine Learning) tool currently has many defects and shortcomings in producing a reliable Machine Learning model, process and link. With existing computational and analytical tools, it is not sufficient to produce a reliable machine learning model. The following is a common problem with an existing automatic Machine Learning (automl) system: (1) the degree of automation of the process is insufficient. For example, only an automation link of model training is provided, but a modeling feature screening link is not automated. (2) The business personnel can not obtain enough information to identify whether the modeling data set is abnormal or judge whether the basis for the model to make prediction is reasonable, so that the produced model is unreliable. (3) And a model link simulating an actual operation link is lacked, so that the performance of the real online environment is difficult to evaluate. Sometimes, the sample construction mode of the training data is inconsistent with the sample construction mode of the model in online application during model training. It is important to simulate the on-line usage after model training is complete, collect model performance data, and evaluate model performance. Otherwise, the situation that the performance evaluation is good after the model training is finished, but the performance evaluation drops greatly after the model training is on line easily occurs. It follows that auto-ml (automatic Machine Learning) systems based on the prior art are not sufficient to support automatic acquisition of reliable models.

Therefore, how to provide an analysis method, system, medium and device based on user data modeling to solve the defects that the prior art cannot automatically generate a reliable machine learning model and provide sufficient analysis information for business personnel, and the like, is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide an analysis method, system, medium and apparatus based on user data modeling, which is used to solve the problem that the prior art cannot automatically generate a reliable machine learning model and provide sufficient analysis information to business personnel.

To achieve the above and other related objects, an aspect of the present invention provides an analysis method based on user data modeling, including: carrying out characteristic analysis on the user data to generate a characteristic analysis result; performing stability inspection on the characteristic analysis result along with time change to detect abnormal data, and judging whether the characteristic analysis result is reliable according to the abnormal data detection result; if yes, executing the next step, otherwise, returning to the previous step; after preprocessing the characteristic analysis result, screening characteristics required by the user data during modeling by combining preprocessed data; modeling user data by using the screened features to generate a user data model; and performing model analysis on the user data model to obtain a reliability analysis result of the user data model, wherein the reliability analysis result is used for at least presenting the rationality of a judgment basis of the user data model and the influence degree of each characteristic used in user data modeling on a prediction result to business personnel.

In an embodiment of the present invention, the step of performing feature analysis on the user data to generate a feature analysis result includes: performing appearance frequency index analysis on the user data to generate an appearance frequency analysis result; performing numerical index analysis on the user data to generate a numerical analysis result; and carrying out logic type index analysis on the user data to generate a logic type analysis result.

In an embodiment of the present invention, the step of preprocessing the feature analysis result includes: carrying out scaling mapping and principal component analysis on numerical characteristic data in the characteristic analysis result; and carrying out one-hot encoding and independent component analysis on the classified data in the feature analysis result.

In an embodiment of the present invention, the step of, after preprocessing the feature analysis result, screening features required for modeling the user data by combining the preprocessed data includes: and screening the characteristics required by the user data modeling through the technical indexes of the missing rate information value and the evidence weight correlation so as to automatically select the characteristic group set which has the maximum help for the user data modeling.

In an embodiment of the present invention, the reliability analysis result includes a model effect analysis result, a modeling characteristic analysis result, and a model interpretability analysis result; performing model effect analysis on the user data model to generate a model effect analysis result; and the model effect analysis comprises the calculation of the overall analysis value of the accuracy, the precision, the recall rate, the precision and the recall rate of the user data model and the area under the curve.

In an embodiment of the present invention, a model entering feature analysis is performed on the user data model to generate the model entering feature analysis result; the in-mold feature analysis provides a relational analysis between features entering the user data model and predicted targets, and identifies whether this relationship is stable over time; the input features in the input feature analysis refer to features which are finally identified after feature screening and are important to the user data model and are adopted by the user data model.

In an embodiment of the present invention, a model interpretability analysis is performed on the user data model to generate a model interpretability analysis result; the model interpretable analysis includes analyzing a prediction of a representative sample in the user data model and a criterion of the user data model.

Another aspect of the present invention provides an analysis system based on user data modeling, including: the data analysis module is used for carrying out characteristic analysis on the user data to generate a characteristic analysis result; the abnormal detection module is used for carrying out stability detection on the characteristic analysis result along with time change so as to detect abnormal data and judging whether the characteristic analysis result is reliable or not according to the abnormal data detection result; if so, calling a feature screening module for screening features required by the user data during modeling by combining the preprocessed data after preprocessing the feature analysis result; if not, calling the data analysis module; the model generation module is used for modeling the user data by utilizing the screened features to generate a user data model; and the reliability analysis module is used for carrying out model analysis on the user data model to obtain a reliability analysis result of the user data model, and the reliability analysis result is used for at least presenting the rationality of the judgment basis of the user data model and the influence degree of each characteristic used in user data modeling on a prediction result to business personnel.

A further aspect of the invention provides a medium on which a computer program is stored which, when being executed by a processor, carries out the method of analysis based on user data modeling.

A final aspect of the invention provides an apparatus comprising: a processor and a memory; the memory is configured to store a computer program and the processor is configured to execute the computer program stored by the memory to cause the apparatus to perform the analysis method based on user data modeling.

As described above, the analysis method, system, medium, and apparatus based on user data modeling according to the present invention have the following advantageous effects:

aiming at the machine learning application process, sufficient information is provided for business personnel, whether data adopted by model training is abnormal or not is judged, the requirement information of whether the data is abnormal or not is provided for the model to judge, and the targeted analysis and detection technology is integrated into the automatic machine learning process. The method helps business personnel to know the reliability of the model, and avoids the one-sidedness brought by measuring the model effect through accuracy indexes such as model effect analysis and the like. The method has great significance in medical treatment, finance and other scenes with high requirements on model reliability, and can avoid negative results brought by mistaken application of unreliable models.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an analysis method based on user data modeling in accordance with an embodiment of the present invention.

FIG. 2 is a flow chart illustrating a feature analysis of an embodiment of the analysis method based on user data modeling.

FIG. 3 is a schematic diagram illustrating a missing rate analysis of an embodiment of the analysis method based on user data modeling.

FIG. 4 is a schematic diagram of a reliability analysis of the analysis method based on user data modeling according to an embodiment of the invention.

FIG. 5 is a flow chart of an automated machine learning process of an embodiment of the analysis method based on user data modeling of the present invention.

FIG. 6 is a diagram illustrating the model interpretability analysis result of the analysis method according to the present invention.

FIG. 7 is a schematic diagram of an embodiment of an analysis system based on user data modeling.

FIG. 8 is a schematic diagram illustrating the structural connection of an analysis device based on user data modeling according to an embodiment of the present invention.

Description of the element reference numerals

7-analysis System based on user data modeling;

71-data analysis Module;

72-anomaly detection Module;

73-feature screening module;

74-model generation Module;

75-reliable analysis Module;

8-equipment;

81-a processor;

82-a memory;

83-communication interface;

84-System bus.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the drawings only show the components related to the present invention rather than the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The analysis method based on user data modeling provides reliable calculation and analysis links required by a machine learning model in a modeling process so as to realize complete automation, provides sufficient analysis information for business personnel, and helps the business personnel to diagnose whether a problem exists in a model training process.

The principle and implementation of the analysis method, system, medium and apparatus based on user data modeling according to the present embodiment will be described in detail below with reference to fig. 1 to 8, so that those skilled in the art can understand the analysis method, system, medium and apparatus based on user data modeling without creative work.

Please refer to fig. 1, which is a schematic flow chart illustrating an analysis method based on user data modeling according to an embodiment of the present invention. The analysis method based on user data modeling is applied to scenes generated by relevant user models in medical treatment or finance. As shown in fig. 1, the analysis method based on user data modeling specifically includes the following steps.

And S11, performing characteristic analysis on the user data to generate a characteristic analysis result.

Specifically, the basic situation of each feature in the modeling dataset is known, and business personnel are allowed to judge whether potential data quality problems exist. For example, the modeled dataset may be a dataset of a patient with a certain disease in the medical domain or a dataset of a business related user in the financial domain.

Please refer to fig. 2, which is a flowchart illustrating a feature analysis of the analysis method based on user data modeling according to an embodiment of the present invention. As shown in fig. 2, S11 includes the following steps.

And S111, performing appearance frequency index analysis on the user data to generate an appearance frequency analysis result.

Specifically, the appearance frequency index analysis includes statistics of user data common indicators such as a deletion rate (the number of deletion feature samples in a sample), the number of values after deduplication, a value that appears most, a value occupation ratio that appears most, a value that appears second most, and a value occupation ratio that appears second most.

And S112, performing numerical index analysis on the user data to generate a numerical analysis result.

Specifically, the numerical index analysis includes statistics of user data numerical indexes such as mean, median, minimum, maximum, 25% quantile, 75% quantile, skewness, kurtosis, and the like.

And S113, performing logic type index analysis on the user data to generate a logic type analysis result.

Specifically, the logical indicator analysis includes statistics of user data logical indicators such as a true value rate (true number/full sample number), a false value rate (false number/full sample number), a non-empty true value rate (true number/non-empty sample number), a non-empty false value rate (true number/non-empty sample number), and the like.

S12, performing stability test on the feature analysis result along with time change to detect abnormal data, and judging whether the feature analysis result is reliable according to the abnormal data detection result; if yes, executing the next step, otherwise, returning to the previous step.

In particular, anomaly data detection includes, but is not limited to, detection of single feature (numerical) outliers and single feature stability tests over time.

The detection of the single-feature (numerical type) outlier refers to the identification of an abnormal numerical value in a single mode of passing through a median plus or minus 3 times a standard deviation. Values that exceed the median by 3 standard deviations above and below are outliers.

The stability test of the single feature over time means that for statistical indexes such as a deletion rate, a median and the like, under the condition that a service scene does not change suddenly, the stability test should be stable over time. If the indexes change suddenly in a certain time period, the indexes are often represented as characteristic data abnormity.

Please refer to fig. 3, which is a schematic diagram illustrating a missing rate analysis of the analysis method based on user data modeling according to an embodiment of the present invention. As shown in fig. 3, the missing rate in months 7 and 8 suddenly rises, and it is likely that the data system is abnormal, and the modeling link can be entered after investigation and analysis. For the measure of the degree of abnormality, a hypothesis test method in statistics can be adopted for calculation.

The business personnel can obtain enough information through the step to identify whether the modeling data set has the abnormity. For example: in a disease prediction scenario, a large number of defects appear in a data set due to characteristics of routine examination items in a certain hospital, which indicates that the construction of the data set may be problematic, and modeling work should be performed after data is repaired. The service personnel can judge whether the modeling characteristic data has problems only after seeing the full analysis of the aimed characteristics.

And S13, preprocessing the characteristic analysis result, and screening the characteristics required by the user data modeling by combining the preprocessed data.

In this embodiment, the feature analysis result is preprocessed, and on one hand, scaling mapping and principal component analysis are performed on numerical feature data in the feature analysis result; and on the other hand, performing independent thermal coding and independent component analysis on the classified data in the feature analysis result. The scaling mapping means that the gradient calculation of the learning and training process of the subset is interfered due to the large difference of different characteristic numerical value ranges, so that the model training becomes difficult. Scaling of the numerical features is generally required. For example, all values are linearly mapped, and the value range is unified to be in the range of-1 to 1.

In the embodiment, the features required when the user data is modeled are screened through the technical indicators Of the missing rate, the IV Value (Information Value) and the WOE (Weight Of Evidence) correlation, so as to automatically select the feature group set which is most helpful for modeling the user data. It should be noted that, the feature screening by the combination of the deficiency rate, the information value and the evidence weight correlation is only one specific embodiment of the present invention, and in addition, the feature screening by one or more combination of the deficiency rate, the information value and the evidence weight correlation or other feature subsets that can be obtained as small as possible is performed, and the feature screening method that does not significantly reduce the classification accuracy, does not affect the classification distribution, and the feature subsets can have stability and strong adaptability is within the protection scope of the present invention.

Specifically, the method for calculating the deletion rate comprises the following steps: miss rate = ratio of missing sample amount/total sample amount in user data.

The IV value is calculated by the following method: if the numerical variable is the numerical variable, the single-feature decision tree algorithm is used for automatic grouping processing, and the numerical variable is converted into a discrete variable. For discrete variables, the IV value is calculated as follows:

the WOE correlation is calculated by:

wherein, the meanings of the variables in the formulas (1) to (3) are as follows: n is the number of variable groups, i takes the value between 0 and n and represents the ith group,

representing the proportion of responsive users in a packet to all responsive users in the sample,

representing the proportion of non-responsive users in the packet to all non-responsive users in the sample,

indicating the number of responding users in the packet,

indicating the number of all responding users in the sample,

indicating the number of unresponsive users in the packet,

representing the number of all unresponsive users in the sample.

Then will be

Pearson Correlation (Person Correlation) is calculated between every two, and then WOE Correlation can be obtained.

After obtaining data of the deletion rate, the IV value and the WOE correlation, screening out a characteristic group meeting the following conditions: (1) deletion ratio < T1, (2) IV value > T2, (3) for the remaining features, if the WOE correlation between features V1, V2 > T3, the feature with the lower IV among V1, V2 is excluded.

In one embodiment, a feature set is selected that satisfies the following conditions: (1) when T1=15%, the deletion rate of feature 1 is 11%, and the deletion rate of feature 2 is 25%, in feature screening, feature 1 is retained and feature 2 is removed. (2) When T2=0.7, the IV value of feature 3 is 0.2, and the IV value of feature 4 is 0.9, feature 4 is retained and feature 3 is removed in the feature screening. (3) Let T3=1.2, the WOE correlation value of feature 3 and feature 4 is 1.34, and since the IV value of feature 3 is 0.2 and the IV value of feature 4 is 0.9, the correlation degree of feature 3 and feature 4 is high, only one feature is retained, and feature 4 with a high IV value is retained for subsequent data processing and analysis.

The system supports automatic optimization algorithms such as a genetic algorithm and a simulated annealing algorithm to automatically select appropriate values or value combinations of the T1, the T2 and the T3 for selection of three preset thresholds T1, T2 and T3. In the automatic selection process, all the value combinations of T1, T2 and T3 can be exhausted and then the model effect under each combination mode is estimated, or the optimization direction and the value direction of T1, T2 and T3 can be determined according to the combination of T1, T2 and T3 used in the previous round of feature screening and the model effect obtained by the combination mode.

And S14, modeling the user data by using the screened features to generate a user data model.

Specifically, experiments of various machine learning algorithms are automatically completed in batches. Machine learning algorithms include, but are not limited to, distributed gradient enhanced library-machine learning algorithm (xgboost), random forest classifier (random forest), deep learning (deep learning), logistic regression-machine learning algorithm (logistic regression), decision tree-machine learning prediction (decision tree), support vector machine-classifier (svm), and the like.

Further, the system automatically configures the range algorithm in a plurality of machine learning algorithms, automatically runs the modeling programs one by one, and then performs machine learning modeling calculation.

The step can avoid the defects that the prior art lacks a model link simulating an actual operation link and is difficult to evaluate the performance of a real online environment. Sometimes, the sample construction mode of the training data is inconsistent with the sample construction mode of the model in online application during model training. It is important to simulate the on-line usage after model training is complete, collect model performance data, and evaluate model performance. Otherwise, the situation that the performance evaluation is good after the model training is finished, but the performance evaluation drops greatly after the model training is on line easily occurs.

And S15, performing model analysis on the user data model to obtain a reliability analysis result of the user data model, wherein the reliability analysis result is used for at least presenting the rationality of the judgment basis of the user data model and the influence degree of each characteristic used in user data modeling on a prediction result to business personnel.

Please refer to fig. 4, which is a schematic diagram illustrating a reliability analysis of the analysis method based on user data modeling according to an embodiment of the present invention. As shown in fig. 4, the reliability analysis result includes a model effect analysis result, a model-in feature analysis result, and a model interpretability analysis result.

(1) And carrying out model effect analysis on the user data model to generate a model effect analysis result.

The model effect analysis comprises calculating the accuracy, precision, recall, precision and recall integral analysis value (F1 value), the area under the ROC curve and the area under the PR curve of the user data model. Meanwhile, the calculation of the indexes after grouping is also supported.

(2) And performing mold entering characteristic analysis on the user data model to generate a mold entering characteristic analysis result.

The in-mold feature analysis provides a relational analysis between features entering the user data model and predicted targets, and identifies whether this relationship is stable over time; the input features in the input feature analysis refer to features which are finally identified after feature screening and are important to the user data model and are adopted by the user data model.

Specifically, for each feature, and the relationship to the predicted objective, the importance of this feature to the model can be measured by the following criteria. (1) IV Value (Information Value) calculation. (2) The feature importance calculated by models such as a distributed gradient enhancement library-machine learning algorithm (xgboost), a random forest classifier (random forest) and the like. (3) AUC (area under curve) and ROC (Receiver Operating Characteristic) curves corresponding to the univariate decision tree model are also called Receiver Operating Characteristic curves.

The service personnel can obtain enough information through the step to judge whether the basis of the model for making the prediction is reasonable or not, so that the output model is unreliable. For example, in a disease prediction model, the important features identified by the model are heart rate-based features. Further analysis has found, however, that the model actually only makes use of whether the heart rate features are missing, rather than the values of the heart rate features themselves. This suggests that what the model captures may be the hospital's medical actions, not the relationship of the objective physiological indicators to the model. This may deviate from the modeling goal at the beginning. The service personnel cannot identify such problems without providing relevant analysis information to the service personnel.

(3) And carrying out model interpretability analysis on the user data model to generate a model interpretability analysis result.

The model interpretable analysis includes analyzing a prediction of a representative sample in the user data model and a criterion of the user data model.

Specifically, the prediction result and Model judgment basis of the representative sample are analyzed by using techniques such as lime (Local Interpretable Model-Agnostic extensions, locally understandable Model-independent interpretation), shape and the like. And the service personnel can judge whether the model judgment basis is reasonable. The feature importance of a distributed gradient enhancement library-machine learning algorithm (xgboost) model is explained by a Shap value of a Shap (Additive explantations of a machine model) technology, and model prediction results of a single sample are analyzed to indicate which features are positive factors for the prediction results and which are negative factors.

Please refer to fig. 5, which is a flowchart illustrating an automated machine learning method for analyzing a model based on user data according to an embodiment of the present invention. As shown in fig. 5, in an embodiment, taking analysis in modeling of the thrombus disease result cue model as an example, first, a model data exploratory analysis is performed on the patient case and the detected data. Then, abnormal data detection is performed on the patient cases and the detection data, and if a large amount of D-2 mer information is missing in a plurality of patient data, it is suggested that there may be a problem in constructing the data set. On the other hand, if the detected abnormal data is numerical type, the numerical value exceeding the preset range is regarded as abnormal, at this time, service personnel needs to specifically judge by using the analysis data, the numerical value abnormality is whether numerical value abnormality caused by the patient suffering from the disease or unreasonable numerical value abnormality, if the numerical value abnormality caused by the patient suffering from the disease, the data can be used for subsequent data processing, and if the numerical value abnormality is unreasonable, the modeling data needs to be analyzed again or obtained again, so that the subsequent data processing is not affected by unreasonable abnormal data. Secondly, after abnormal data is detected, the case of the patient is determined, the detected data is reasonable, and the data can be preprocessed after being normally used, so that the numerical range of different characteristic numerical values is balanced. Then, the data and the processed patient data are subjected to automatic feature screening, wherein the patient data comprise sex, age, living habits, blood vessel detection indexes and the like, for example, thrombus disease patient data are also subjected to other detections, such as skin allergy detection, vision detection and the like, and the features can be filtered out in the automatic feature screening process. And performing a batch machine learning modeling experiment based on the screened features, and performing model effect analysis, model interpretability analysis and model entering feature analysis based on the generated machine model.

Please refer to fig. 6, which is a diagram illustrating the model interpretability analysis result of the analysis method according to an embodiment of the present invention. As shown in FIG. 6, the mass concentration of D-dimer is important for diagnosis, evaluation of therapeutic effect and prognosis of thrombotic diseases.

The Model interpretability analysis is implemented by virtue of the Local interpretation Model-Agnostic extensions (locally understandable Model-independent interpretations) technology as a specific embodiment. The result prompt model in fig. 6 shows the predicted result value of the life analysis, and the life analysis value is positive, and the age and D-2 mer information of a patient with a larger life analysis value are determined as stronger disease risk factors, and the maximum value of C-reactive protein in one week is considered as weaker risk factors.

Furthermore, an integrated disease prediction model solution can be formed by combining a data set construction method, a data set analysis method, an analysis method based on user data modeling, an evaluation optimization method of a training model by constructing a simulation data set, a unified management method of the training model and a monitoring method for monitoring whether a prediction effect and a data source are abnormal when the training model is applied to an actual scene after being online, so that the production efficiency and quality of the disease prediction model are improved. The integrated modeling system based on machine learning can realize that parameters and results of all links related to a project can be recorded and stored in a centralized manner, and can not be missed and lost. The defects that tools are isolated from each other, unified process management is lacked, errors are easy to occur in the connection process, operation parameters in the process are difficult to be completely recorded and the like are avoided. Especially for medical data, due to the sensitivity of medical data in hospitals, the data in hospitals cannot be copied out of the hospitals, and links such as data analysis, modeling and the like are performed; all tasks need to be deployed in a hospital in advance to develop an integrated system, after the tasks are deployed in a hospital server, the user data modeling and the further formed integrated modeling system based on machine learning can be automatically completed in one go, the situation that the multi-party data are pieced together and completed by using other existing software is avoided, the operation is complex, and the matching of different links of the system cannot be realized.

The protection scope of the analysis method based on user data modeling according to the present invention is not limited to the execution sequence of the steps listed in this embodiment, and all the solutions implemented by adding, subtracting, and replacing the steps in the prior art according to the principles of the present invention are included in the protection scope of the present invention.

The present embodiment provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the analysis method based on user data modeling.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned computer-readable storage media comprise: various computer storage media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The analysis system based on user data modeling provided by the present embodiment will be described in detail with reference to the drawings. It should be noted that the division of the modules of the following system is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity or may be physically separated. And the modules can be realized in a form that all software is called by the processing element, or in a form that all the modules are realized in a form that all the modules are called by the processing element, or in a form that part of the modules are called by the hardware. For example: a module may be a separate processing element, or may be integrated into a chip of the system described below. Further, a certain module may be stored in the memory of the following system in the form of program code, and a certain processing element of the following system may call and execute the function of the following certain module. Other modules are implemented similarly. All or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, the steps of the above method or the following modules may be implemented by hardware integrated logic circuits in a processor element or instructions in software.

The following modules may be one or more integrated circuits configured to implement the above methods, for example: one or more Application Specific Integrated Circuits (ASICs), one or more Digital Signal Processors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), and the like. When some of the following modules are implemented in the form of a program code called by a Processing element, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling the program code. These modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).

Please refer to fig. 7, which is a schematic structural diagram of an analysis system based on user data modeling according to an embodiment of the present invention. As shown in fig. 7, the analysis system 7 based on user data modeling includes: data analysis module 71, anomaly detection module 72, feature screening module 73, model generation module 74, and reliability analysis module 75.

The data analysis module 71 is configured to perform feature analysis on the user data to generate a feature analysis result.

In this embodiment, the data analysis module 71 is specifically configured to perform appearance frequency index analysis on the user data to generate an appearance frequency analysis result; performing numerical index analysis on the user data to generate a numerical analysis result; and carrying out logic type index analysis on the user data to generate a logic type analysis result.

The anomaly detection module 72 is configured to perform a stability check on the feature analysis result over time to detect abnormal data, and determine whether the feature analysis result is reliable according to the abnormal data detection result.

If so, calling a feature screening module 73 for screening features required by the user data modeling in combination with the preprocessed data after preprocessing the feature analysis result; if not, the data analysis module 71 is called.

In this embodiment, the feature screening module 73 is specifically configured to screen features required when the user data is modeled according to technical indicators of a missing rate, an information value, and an evidence weight correlation, so as to automatically select a feature group set that is most helpful for modeling the user data.

The model generation module 74 is configured to perform user data modeling by using the filtered features, and generate a user data model.

The reliability analysis module 75 is configured to perform model analysis on the user data model to obtain a reliability analysis result of the user data model, where the reliability analysis result is used to at least present, to business personnel, the rationality of the judgment basis of the user data model and the degree of influence of each feature used in user data modeling on a prediction result.

In this embodiment, the reliability analysis result includes a model effect analysis result, a model-entering feature analysis result, and a model interpretability analysis result. The reliable analysis module 75 is specifically configured to perform model effect analysis on the user data model to generate a model effect analysis result; performing mold entering characteristic analysis on the user data model to generate a mold entering characteristic analysis result; and carrying out model interpretability analysis on the user data model to generate a model interpretability analysis result.

The analysis system based on user data modeling according to the present invention can implement the analysis method based on user data modeling according to the present invention, but the implementation apparatus of the analysis method based on user data modeling according to the present invention includes, but is not limited to, the structure of the analysis system based on user data modeling listed in this embodiment, and all structural modifications and substitutions in the prior art made according to the principle of the present invention are included in the protection scope of the present invention.

Please refer to fig. 8, which is a schematic structural connection diagram of an analysis apparatus based on user data modeling according to an embodiment of the present invention. As shown in fig. 8, the present embodiment provides an apparatus 8, the apparatus 8 including: a processor 81, memory 82, communication interface 83, or/and system bus 84; the memory 82 and the communication interface 83 are connected to the processor 81 via the system bus 84 and communicate with each other, the memory 82 is used for storing computer programs, the communication interface 83 is used for communicating with other devices, and the processor 81 is used for running the computer programs to enable the device 8 to execute the steps of the analysis method based on user data modeling.

The system bus 84 mentioned above may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. The communication interface 83 is used to realize communication between the database access device and other devices (such as a client, a read-write library, and a read-only library). The memory 82 may include a Random Access Memory (RAM), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor 81 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component.

In summary, the analysis method, system, medium and device based on user data modeling according to the present invention provide sufficient information to business personnel for machine learning application process, determine whether data used for model training is abnormal, provide information required for model determination based on whether the data is abnormal, and integrate the targeted analysis and detection technology into the automatic machine learning process. The method helps business personnel to know the reliability of the model, and avoids the one-sidedness brought by measuring the model effect through accuracy indexes such as model effect analysis and the like. The method has great significance in medical treatment, finance and other scenes with high requirements on model reliability, and can avoid negative results brought by mistaken application of unreliable models. The invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. An analysis method based on user data modeling is characterized in that the analysis method based on user data modeling comprises the following steps:

carrying out characteristic analysis on the user data to generate a characteristic analysis result;

performing stability inspection on the characteristic analysis result along with time change to detect abnormal data, and judging whether the characteristic analysis result is reliable according to the abnormal data detection result; if yes, executing the next step, otherwise, returning to the previous step;

after preprocessing the characteristic analysis result, screening characteristics required by the user data during modeling by combining preprocessed data; screening the characteristics required by the user data during modeling through technical indexes of the missing rate, the information value and the evidence weight correlation so as to automatically select a characteristic group set which has the maximum help for the user data modeling; comparing the missing rate with a first threshold, taking the feature of which the missing rate is smaller than the first threshold as a first screening result, comparing the information value with a second threshold, taking the feature of which the information value is larger than the second threshold in the first screening result as a second screening result, comparing the evidence weight correlation with a third threshold, and rounding off two features of which the evidence weight correlation is larger than the third threshold in the second screening result so as to retain the feature of which the missing rate is minimum and/or the information value is maximum; the first threshold, the second threshold and the third threshold are used as a value combination, and the optimal values of the first threshold, the second threshold and the third threshold in the value combination are determined in an exhaustive mode or an iterative mode of a model prediction process, so that the model prediction effect is optimal;

modeling user data by using the screened features to generate a user data model;

performing model analysis on the user data model to obtain a reliability analysis result of the user data model, wherein the reliability analysis result is used for at least presenting the reasonability of a judgment basis of the user data model and the influence degree of each characteristic used in user data modeling on a prediction result to business personnel; the model analysis comprises the step of carrying out model entering characteristic analysis on the user data model to generate a model entering characteristic analysis result; the in-mold feature analysis provides a relational analysis between features entering the user data model and predicted targets, and identifies whether this relationship is stable over time; the relational analysis is to detect the rationality of the attribute values of the model entering characteristics applied to the model prediction process.

2. The analysis method based on user data modeling according to claim 1, wherein the step of performing feature analysis on the user data to generate a feature analysis result comprises:

performing appearance frequency index analysis on the user data to generate an appearance frequency analysis result;

performing numerical index analysis on the user data to generate a numerical analysis result;

and carrying out logic type index analysis on the user data to generate a logic type analysis result.

3. The analytical method based on user data modeling according to claim 1, wherein the step of preprocessing the feature analysis results comprises:

carrying out scaling mapping and principal component analysis on numerical characteristic data in the characteristic analysis result;

and carrying out one-hot encoding and independent component analysis on the classified data in the feature analysis result.

4. The analytical method based on user data modeling according to claim 1, wherein the reliability analysis result further includes a model effect analysis result and a model interpretability analysis result;

performing model effect analysis on the user data model to generate a model effect analysis result;

and the model effect analysis comprises the calculation of the overall analysis value of the accuracy, the precision, the recall rate, the precision and the recall rate of the user data model and the area under the curve.

5. The analytical method based on user data modeling according to claim 1, wherein:

the input features in the input feature analysis refer to features which are finally identified after feature screening and are important to the user data model and are adopted by the user data model.

6. The analytical method based on user data modeling according to claim 4, wherein:

performing model interpretability analysis on the user data model to generate a model interpretability analysis result;

7. An analytics system based on user data modeling, comprising:

the data analysis module is used for carrying out characteristic analysis on the user data to generate a characteristic analysis result;

the abnormal detection module is used for carrying out stability detection on the characteristic analysis result along with time change so as to detect abnormal data and judging whether the characteristic analysis result is reliable or not according to the abnormal data detection result;

if so, calling a feature screening module for screening features required by user data modeling in combination with preprocessed data after preprocessing the feature analysis result, and screening the features required by the user data modeling through technical indexes of deletion rate, information value and evidence weight correlation so as to automatically select a feature group set which has the greatest help for the user data modeling; comparing the missing rate with a first threshold, taking the feature of which the missing rate is smaller than the first threshold as a first screening result, comparing the information value with a second threshold, taking the feature of which the information value is larger than the second threshold in the first screening result as a second screening result, comparing the evidence weight correlation with a third threshold, and rounding off two features of which the evidence weight correlation is larger than the third threshold in the second screening result so as to retain the feature of which the missing rate is minimum and/or the information value is maximum; the first threshold, the second threshold and the third threshold are used as a value combination, and the optimal values of the first threshold, the second threshold and the third threshold in the value combination are determined in an exhaustive mode or an iterative mode of a model prediction process, so that the model prediction effect is optimal; if not, calling the data analysis module;

the model generation module is used for modeling the user data by utilizing the screened features to generate a user data model;

the reliability analysis module is used for carrying out model analysis on the user data model to obtain a reliability analysis result of the user data model, and the reliability analysis result is used for at least presenting the rationality of the judgment basis of the user data model and the influence degree of each characteristic used in user data modeling on a prediction result to business personnel; the model analysis comprises the step of carrying out model entering characteristic analysis on the user data model to generate a model entering characteristic analysis result; the in-mold feature analysis provides a relational analysis between features entering the user data model and predicted targets, and identifies whether this relationship is stable over time; the relational analysis is to detect the rationality of the attribute values of the model entering characteristics applied to the model prediction process.

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the analysis method based on user data modeling according to any one of claims 1 to 6.

9. An analysis device based on user data modeling, comprising: a processor and a memory;

the memory is for storing a computer program and the processor is for executing the computer program stored by the memory to cause the user data modeling based analysis apparatus to perform the user data modeling based analysis method of any of claims 1 to 6.