CN106777891B

CN106777891B - A kind of selection of data characteristics and prediction technique and device

Info

Publication number: CN106777891B
Application number: CN201611043691.9A
Authority: CN
Inventors: 吴书; 王亮; 谭铁牛
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2016-11-21
Filing date: 2016-11-21
Publication date: 2019-06-07
Anticipated expiration: 2036-11-21
Also published as: CN106777891A

Abstract

The invention discloses data characteristics selection and prediction technique and devices.Method includes: step S1, acquisition user information and corresponding blood pressure observation data, forms data set, and remove from the data set outlier；Step S2, user characteristics are extracted from the user information in the data set；Step S3, blood pressure characteristics are extracted from the blood pressure observation data in the data set；Step S4, extracted user characteristics and blood pressure characteristics are normalized, processing result forms training set as training sample, it is input among supporting vector machine model and/or Gradient Iteration decision-tree model using the training sample in the training set, training obtains prediction model.The present invention chooses work, the accuracy of effective lift scheme using the cleaning of medical knowledge guide data and Feature Engineering.

Description

A kind of selection of data characteristics and prediction technique and device

Technical field

The present invention relates to machine learning and area of pattern recognition, the mainly feature selection approach in machine learning, and tie Gradient Iteration decision tree and supporting vector machine model are closed, the method and device of data characteristics selection and prediction is carried out.

Background technique

With the development of computer technology, computer can handle a variety of different data at present, help people more Add the task of being efficiently completed.Especially in artificial intelligence field, machine learning has been widely applied to as a core technology In many particular problems.Support vector machines (SVM) is one of the model of machine learning classics, it can also efficiently be obtained simultaneously very much Obtain good prediction result.Gradient Iteration decision tree (GBDT) is the in recent years very popular machine learning method of current industry, it From classical decision tree (Decision Tree) model.

In recent years, portable medical is a global in recent years market focus, and transboundary fusion is its essential characteristic, big data Prediction and application be even more Bright Prospect.

Summary of the invention

Based on the above issues, the screening model of relevant user blood pressure data sequence is established in present invention exploitation, is striven for individual character Change user and optimization strategy and intuitive quantization guidance are provided, assists the intervening measure for realizing maximum efficiency, provide individual character for user The Feature Selection service of change.

According to an aspect of the present invention, a kind of selection of data characteristics and prediction technique are provided, the method comprising the steps of:

Step S1, it acquires user information and corresponding blood pressure observes data, form data set, and pick from the data set Except outlier；

Step S2, user characteristics are extracted from the user information in the data set；

Step S3, blood pressure characteristics are extracted from the blood pressure observation data in the data set；

Step S4, extracted user characteristics and blood pressure characteristics are normalized, processing result is as training sample This formation training set is input to supporting vector machine model and/or Gradient Iteration decision using the training sample in the training set Among tree-model, training obtains prediction model.

Wherein, the user characteristics include age, gender and the body-mass index of user；The blood pressure characteristics include height Pressure, low pressure, heart rate and medication situation.

Wherein, the extraction of blood pressure characteristics described in step S3 includes: the blood pressure characteristics extracted under different prediction tasks；It is described Different prediction tasks include long period, short cycle, coarseness and fine granularity prediction task.

Wherein, support vector machines and/or gradient are input to using the training sample in the training set described in step S4 Among iteration decision-tree model, training obtains prediction model, comprising:

The user characteristics of same user, the average value of the blood pressure characteristics of Dan Yue, the blood of half a month are extracted from the training set The average value of feature and the average value of the blood pressure characteristics in the first predetermined acquisition time are pressed, is input in supporting vector machine model, The supporting vector machine model uses regression model, and the kernel function of the regression model uses linear kernel；

By same user in the output of the supporting vector machine model and the training set in the second predetermined acquisition time Blood pressure characteristics be compared, and then update the parameter of the supporting vector machine model；The second predetermined acquisition time is later than The first predetermined acquisition time；

Iteration executes above-mentioned steps, until the parameter of the supporting vector machine model restrains, obtains the first prediction model.

The user characteristics of same user, the average value of the blood pressure characteristics of Dan Yue, the blood of half a month are extracted from the training set The average value of feature and the average value of the blood pressure characteristics in the predetermined acquisition time of third are pressed, Gradient Iteration decision-tree model is input to In, the loss function of the Gradient Iteration decision-tree model is adopted as least square difference function；

By same user in the output of the Gradient Iteration decision-tree model and the training set in the 4th predetermined acquisition Interior blood pressure characteristics are compared, and then update the parameter of the Gradient Iteration decision-tree model；Described 4th predetermined acquisition Time is later than the predetermined acquisition time of the third；

Iteration executes above-mentioned steps, until the parameter of the Gradient Iteration decision tree restrains, obtains the second prediction model.

The user characteristics of same user, the average value of the blood pressure characteristics of Dan Yue, the blood of half a month are extracted from the training set The average value of feature and the average value of the blood pressure characteristics in the first predetermined acquisition time are pressed, is input to supporting vector machine model in Gradient Iteration decision-tree model, the supporting vector machine model use regression model, and the kernel function of the regression model uses line Property core；The loss function of the Gradient Iteration decision-tree model is adopted as least square difference function；

By the output of the supporting vector machine model and the Gradient Iteration decision-tree model respectively and in the training set Blood pressure characteristics of the same user in the second predetermined acquisition time are compared, and then update the supporting vector machine model respectively With the parameter of the Gradient Iteration decision-tree model；The second predetermined acquisition time is later than the described first predetermined acquisition time；

Iteration executes above-mentioned steps, until the parameter of the supporting vector machine model and the Gradient Iteration decision-tree model Convergence, obtains the first prediction model.

Wherein, step S1 further includes removing from the data set outlier, comprising:

Remove the age not user information in predetermined the range of age and the corresponding blood pressure data of user；

Remove the height not user information in predetermined height ranges and the corresponding blood pressure data of user；

Remove the weight not user information in predetermined weight range and the corresponding blood pressure data of user；

Remove the pressure value not user information in predetermined blood pressure range and the corresponding blood pressure data of user；

Remove user information and corresponding blood pressure data of the heart rate of user not within the scope of target heart rate.

According to a second aspect of the present invention, a kind of selection of data characteristics and prediction meanss are provided, comprising:

Acquisition module forms data set, and from the data for acquiring user information and corresponding blood pressure observation data Concentrate excluding outlier point；

User characteristics extraction module, for extracting user characteristics from the user information in the data set；

Blood pressure characteristics extraction module, for extracting blood pressure characteristics from the blood pressure observation data in the data set；

Training module, for extracted user characteristics and blood pressure characteristics to be normalized, processing result conduct Training sample forms training set, is input to supporting vector machine model using the training sample in the training set and/or gradient changes Among decision-tree model, training obtains prediction model.

Wherein, blood pressure characteristics extraction module includes:

Blood pressure characteristics extracting sub-module, for extracting the blood pressure characteristics under different prediction tasks；The difference prediction task Task is predicted including long period, short cycle, coarseness and fine granularity.

The present invention using medical knowledge guide data cleaning and Feature Engineering choose work, effective lift scheme it is accurate Property.

Detailed description of the invention

Fig. 1 is the flow chart of data characteristics selection and prediction technique proposed by the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and referring to attached Figure, the present invention is described in more detail.

As shown in Figure 1, the method comprising the steps of the invention proposes a kind of selection of data characteristics and prediction technique:

In one embodiment, the user characteristics include age, gender and the body-mass index of user；The blood pressure is special Sign includes high pressure, low pressure, heart rate.

The extraction of blood pressure characteristics described in step S3 includes: the blood pressure characteristics extracted under different prediction tasks；The difference Prediction task includes long period, short cycle, coarseness and fine granularity prediction task.

In one embodiment, the present invention can train SVM model and GBDT model simultaneously, and utilize above-mentioned two mould simultaneously Type predicts user's blood pressure；In another embodiment, SVM model or GBDT model can also be individually trained, and utilizes instruction The SVM model or GBDT model perfected are predicted.

In one embodiment, support vector machines is input to using the training sample in the training set described in step S4 And/or among Gradient Iteration decision-tree model, training obtains prediction model, comprising:

In another embodiment, support vector machines is input to using the training sample in the training set described in step S4 And/or among Gradient Iteration decision-tree model, training obtains prediction model, comprising:

In other embodiments, support vector machines is input to using the training sample in the training set described in step S4 And/or among Gradient Iteration decision-tree model, training obtains prediction model, comprising:

In one embodiment, step S1 further includes removing from the data set outlier, comprising:

Technical solution of the present invention is discussed in detail below by specific embodiment.

In one embodiment, the invention proposes a kind of selection of data characteristics and prediction techniques comprising:

Step 101, userspersonal information's data and blood pressure are collected and observes data, and by collected userspersonal information and Blood pressure is observed data and is imported among database, and the users personal data includes age of user, gender, height, weight, body matter Volume index (BMI), time of measuring etc.；The blood pressure observation data include high pressure, low pressure, heart rate, medication situation, measurement month letter Breath etc..Data are cleaned, data are observed to userspersonal information's data and blood pressure according to relevant medical knowledge, leave out outlier (i.e. abnormal userspersonal information's data and blood pressure observe data), data set is become to can be used for machine learning training pattern Target data.

The specific screening rule of outlier: the age is not in predetermined the range of age in removal userspersonal information's data Data, such as age are greater than 110 years old and the user less than 10 years old；The height not data in predetermined height ranges are removed, such as Data of the height less than 120 centimetres or greater than 200 centimetres；Remove the data in the no longer predetermined weight range of weight, such as body It is less than 20kg or the data greater than 130kg again；Remove blood pressure not data in predetermined blood pressure range, for example, low pressure be less than and Greater than the observation data of the user's history averaged blood pressure measurements 40, high pressure is removed smaller and larger than the user's history blood pressure measurement The observation data of average value 40；Remove the observation data that heart rate is 0.

Step 102, the feature of user, including age, gender and body-mass index are chosen from database.According to authority Known to medical information: age of user is bigger, and blood pressure is higher；Male's blood pressure is generally slightly above women；Body-mass index (BMI) is more High (approximation represents fatter), blood pressure is higher.Extracting feature includes: the age, gender in userspersonal information's data are (with 0 table Show women, 1 indicates male), and BMI (weight/height square) is converted by height, weight.

Step S3, chooses blood pressure characteristics from database, and including the blood pressure characteristics under different prediction tasks, difference prediction is appointed Business includes the prediction task of the different accuracies such as long period, short cycle, coarseness and fine granularity, selected under different prediction tasks The blood pressure characteristics taken include high pressure, low pressure, heart rate, medication situation.It includes user's high pressure, low pressure, heart rate, clothes that blood pressure, which observes data, Medicine situation, measurement month information.In this step, it has been further introduced into different prediction tasks.Such as long period and short cycle Prediction is respectively indicated and is inputted continuous 6 months or 3 months blood pressure datas of user as feature, if having it is of that month without measuring if use Vacancy value replaces.When coarseness is predicted, inputted using 2 months or 3 months user's averaged blood pressure measurements as feature, fine granularity is pre- When survey, inputted using one month or half of user's averaged blood pressure measurements as feature.

Step 103, to characteristic (BMI, age, the gender etc. of high pressure, low pressure, heart rate and user including measurement, I.e. from the characteristic in the predetermined time obtained in training data) and target data (be later than obtained in the training data The pressure value of a period of time of the predetermined time is as target data) normalized is done, by the scope control of data in 0 He Between 1.Normalized processing formula is as follows:

Wherein minimum value refers to this feature existing the smallest value in the database, and maximum value is wherein most A big value.The processing of month information is encoded using one-hot, integer data is expanded into 0 and 1 coding, passes through 1 The value encoded is expressed in position in the sequence, so that 12 month information is all converted to same status.

Step 104, using support vector machines (SVM) and Gradient Iteration decision tree (GBDT) to treated characteristic (including user characteristics and blood pressure measurement feature) and target data carry out recurrence learning, construct the prediction mould of user's future blood pressure Type.Using the above user characteristics, blood pressure measurement feature and the corresponding month information of every blood pressure measurement feature as training data Normalized is done, is put among support vector machines (SVM) and Gradient Iteration decision tree (GBDT) model, until the parameter of model Convergence, the parameter obtained at this time make model relative to being optimal of training data.It is experimentally confirmed in SVM model, It is regression model when choosing training pattern, it is best that kernel function is selected as effect when linear kernel (linear kernel).In Gradient Iteration In decision-tree model, loss function is chosen for least square difference function (least square error), will with predict function Prediction label output.

In order to verify implementation result of the invention, next made further with the experimental result on truthful data It is bright.Specific step is as follows:

Step 201, due to single blood pressure measurement can not the accurate description user blood pressure situation because for a use The average blood pressure that family acquires one month is arranged into data set.

Step 202, first the initial data in the data set is converted to the feature of suitable training pattern, chosen later There is within continuous six months the user of observation data out, can guarantee the continuity of user's measurement in this way, promote the accuracy of prediction.Example (the N-5 month to the N+1 month) is such as selected continuous seven month there are the data of the user of observational record to do training (for example, by using August part and 9 The user that month occurs simultaneously does training), the last one month N+1 month is as training objective；Using continuous (the N-4 month in seven months To the N+2 month) user that has observational record tests (such as being tested with September And October while the user that occurs), the last one The N+2 month moon is as test target.

Step S3, SVM Experiment Training integrates target as the average low pressures of the N+1 month, by the prediction result of model output and the N+1 month Data compare to update model parameter.Next we extract 1) with 2) two kinds of strategies as short cycle and long period Typical case.Specific training set feature extraction rule is as follows:

1) the N-2-N month: BMI (weight/height square) that the height and weight for extracting user are converted to, gender, age； Individually be averaged N-2, N-1, the N month high pressure, low pressure, heart rate, medication situation；N-2, N-1, the N month per two weeks are averaged high pressure, low Pressure, heart rate, medication situation；Be averaged N-2, N-1, N March high pressure, low pressure, heart rate, medication situation.

2) the N-5-N month: BMI (weight/height square) that the height and weight for extracting user are converted to, gender, age； User is in the average high pressure of the N-5-N month list moon, low pressure, heart rate, medication situation；N-5-N per two weeks is averaged high pressure, low pressure, the heart Rate, situation of taking medicine；Quarter-yearly average high pressure, low pressure, heart rate, situation of taking medicine.

Step S4, it is as follows that SVM tests test set extracting rule:

1) the N-1-N+1 month: the corresponding training set N-2-N month, the BMI (weight/body that the height and weight for extracting user are converted to High square), gender, age；N-1, N, the N+1 month, individually averagely high pressure, low pressure, heart rate, N-1, N, the N+1 month per two weeks were flat Equal high pressure, low pressure, heart rate；Be averaged N-1, N, N+1 March high pressure, low pressure, heart rate.

2) the N-4-N+1 month: the corresponding training set N-5-N month, BMI (weight/height is converted by the height of user and weight Square), gender, age；Average high pressure, low pressure, heart rate, medication of the user in the N-4-N+1 month list moon；N-4-N+1 per two weeks Average high pressure, low pressure, heart rate, medication；Quarter-yearly average high pressure, low pressure, heart rate, medication.

Training set is input among lib-SVM model by step S5, does training until model convergence, Optimized model parameter. I.e. exportable prediction result in trained model is input the feature into, and compared with test set target, obtains what low pressure returned Mean error.

SVM model construction is as follows:

Firstly, defining the function interval of hyperplane (w, b) about training datasetAre as follows:

Wherein, x is characteristic, and y is target data；

Therefore largest interval classifier objective function can be with is defined as:

It is further rewritten as:

Wherein, n is number of samples, y_iIndicate the target data of i-th of sample, x_iIndicate the characteristic of i-th of sample；

Objective function can be merged by Lagrangian method later with restrictive condition, be rewritten into general convex optimization Problem is in order to calculating.It, can be by this hyperplane according to the available optimum regression hyperplane of this objective function Row prediction.

It needs to be arranged accordingly in lib-SVM, suitable support vector machines kernel function is selected by input instruction And training setting.- s indicates the setting type of SVM, and 4 (nu-SVR, regression) of selection are regression model, and-t represents core The selection of function, selecting 0 (linear kernel) is kernel function, and it is best to be experimentally confirmed this setting effect.

Lib-SVM can store the resulting model parameter of training, can be to survey using svm_predict function Examination collection predict and evaluation model performance.Step S6, GBDT experiment test identical feature extraction rule, weight using with SVM Multiple S3, S4, S5 step.Training set feature and target are input among GBDT model.

Realize that GBDT is returned using the GBDT kit encapsulated in open source Machine learning tools scikit-learn, data It only needs to import and store into list format from file with Python.Data and label respectively correspond a list, identical bits It sets corresponding.

GBDT model construction:

The core of GBDT is decision tree (Decision Tree), and the overall procedure of decision tree is such that each of tree Node can all obtain a predicted value, this predicted value is equal to the average value for belonging to all features of this node.It measures best Standard be minimize mean square deviation.The branch foundation near spectrum can be found by minimizing mean square deviation.

The core concept of Gradient Iteration (Gradient Boosting) is by iteration more trees come Shared Decision Making.Therefore, The training method of available GBDT, i.e., every one tree is the residual errors for setting conclusion sums all before, this residual error is exactly one The accumulation amount of true value can be obtained after a plus predicted value.By this method, GBDT can integrate the prediction of multiple decision trees simultaneously Obtain more accurate prediction result.

The GradientBoostingRegressot function in scikit-learn is called to carry out training pattern, decision tree Depth is 3 layers, and learning rate is set as 0.005.It is best to be experimentally confirmed this setting effect.Model parameter can quilt after the completion of training It stores, by calling predict function that can predict using the model parameter come out is learned test set, and comments Valence model performance.

Blood pressure is obtained classification error with 10 for interval division by step S7, and specific hierarchical policy is as shown in table 1.Obtain SVM With the experimental result of GBDT respectively as shown in table 2, table 3, object of experiment month is October.

Evaluation index explanation:

Mean error: the average value of all data predicted values and true value difference.

Be classified error: all data obtain the average value of classification results Yu true classification results difference.

Relatively accurate rate: mean predicted value/average true value

1 blood pressure low voltage value category level of table

Low voltage value	Category level
		< 80	1
80-90	2
		90-100	3
100-110	4
		> 110	5

2 support vector machines of table (SVM) experimental result

SVM predicts that user tests in the average low pressures in October, 2015

Table 3 Gradient Iteration decision tree (GBDT) experimental result

GBDT predicts that user tests in the average low pressures in October, 2015

Step S8 compares experimental results in table 2,3 and fitted data basic (Baseline).Baseline is The numerical value in October is directly fitted with the low pressure data of user's September, as shown in table 4.

4 fitted data of table is basic (Baseline)

Month	Mean error	Average error rate	It is classified error	Sample number
					October	5.27692	0.0638	0.43691	3012

By the experimental result in table it can be concluded that, compared with the baseline of fitted data basis under, it is average in low pressure It is obviously improved in terms of error, SVM model short cycle and long period prediction improve 10.37% and 11.14% respectively；GBDT Model short cycle and macrocyclic prediction improve 10.75% and 11.45% respectively.In terms of being classified error, with baseline It compares, SVM model short cycle and long period prediction improve 2.85% and 8.43% respectively；GBDT model short cycle and long period Prediction improve 8.43% and 10.48% respectively.

Particular embodiments described above has carried out further specifically the purpose of the present invention, technical solution and effect It is bright, it should be understood that the above is only a specific embodiment of the present invention, it is not intended to restrict the invention, it is all at this Within the spirit and principle of invention, any modification, equivalent substitution, improvement and etc. done should be included in protection model of the invention Within enclosing.

Claims

1. a kind of data characteristics selection and prediction technique, the method comprising the steps of:

Step S1, it acquires user information and corresponding blood pressure observes data, form data set, and remove from the data set different Constant value point；

Step S4, extracted user characteristics and blood pressure characteristics are normalized, processing result is as training sample shape At training set, it is input to Gradient Iteration decision-tree model using the training sample in the training set, is specifically included:

It is special that the user characteristics of same user, the average value of the blood pressure characteristics of Dan Yue, the blood pressure of half a month are extracted from the training set The average value of blood pressure characteristics in the predetermined acquisition time of average value and third of sign, is input in Gradient Iteration decision-tree model, The loss function of the Gradient Iteration decision-tree model is adopted as least square difference function；

By same user in the output of the Gradient Iteration decision-tree model and the training set in the 4th predetermined acquisition time Blood pressure characteristics be compared, and then update the parameter of the Gradient Iteration decision-tree model；The 4th predetermined acquisition time It is later than the predetermined acquisition time of the third；

2. the method according to claim 1, wherein the user characteristics include age, gender and the body of user Body mass index；The blood pressure characteristics include high pressure, low pressure, heart rate and medication situation.

3. according to the method described in claim 2, it is characterized in that, the extraction of blood pressure characteristics described in step S3 includes: to extract Blood pressure characteristics under different prediction tasks；The difference prediction task includes long period, short cycle, coarseness and fine granularity prediction Task.

4. the method as described in claim 1, which is characterized in that remove from the data set outlier in step S1, wrap It includes:

5. a kind of data characteristics selection and prediction technique, the method comprising the steps of:

Step S4, extracted user characteristics and blood pressure characteristics are normalized, processing result is as training sample shape At training set, using the training sample in the training set be input to supporting vector machine model and Gradient Iteration decision-tree model it In, training obtains prediction model, it specifically includes:

It is special that the user characteristics of same user, the average value of the blood pressure characteristics of Dan Yue, the blood pressure of half a month are extracted from the training set The average value of the average value of sign and the blood pressure characteristics in the first predetermined acquisition time, is input to supporting vector machine model and gradient changes For decision-tree model, the supporting vector machine model uses regression model, and the kernel function of the regression model uses linear kernel；Institute The loss function for stating Gradient Iteration decision-tree model is adopted as least square difference function；

By the output of the supporting vector machine model and the Gradient Iteration decision-tree model respectively with it is same in the training set Blood pressure characteristics of the user in the second predetermined acquisition time are compared, and then update the supporting vector machine model and institute respectively State the parameter of Gradient Iteration decision-tree model；The second predetermined acquisition time is later than the described first predetermined acquisition time；

Iteration executes above-mentioned steps, until the parameter of the supporting vector machine model and the Gradient Iteration decision-tree model is received It holds back, obtains the first prediction model.

6. according to the method described in claim 5, it is characterized in that, the user characteristics include age, gender and the body of user Body mass index；The blood pressure characteristics include high pressure, low pressure, heart rate and medication situation.

7. according to the method described in claim 6, it is characterized in that, the extraction of blood pressure characteristics described in step S3 includes: to extract Blood pressure characteristics under different prediction tasks；The difference prediction task includes long period, short cycle, coarseness and fine granularity prediction Task.

8. method as claimed in claim 5, which is characterized in that remove from the data set outlier in step S1, wrap It includes:

9. a kind of data characteristics selection and prediction meanss characterized by comprising

Acquisition module forms data set, and from the data set for acquiring user information and corresponding blood pressure observation data Excluding outlier point；

Training module, for extracted user characteristics and blood pressure characteristics to be normalized, processing result is as training Sample forms training set, is input among Gradient Iteration decision-tree model using the training sample in the training set, trained To prediction model, specifically include:

10. device according to claim 9, which is characterized in that the user characteristics include age, gender and the body of user Body mass index；The blood pressure characteristics include high pressure, low pressure, heart rate.

11. device according to claim 9, which is characterized in that blood pressure characteristics extraction module includes:

Blood pressure characteristics extracting sub-module, for extracting the blood pressure characteristics under different prediction tasks；It is described difference prediction task include Long period, short cycle, coarseness and fine granularity predict task.

12. a kind of data characteristics selection and prediction meanss characterized by comprising

Training module, for extracted user characteristics and blood pressure characteristics to be normalized, processing result is as training Sample forms training set, is input to supporting vector machine model and Gradient Iteration decision tree using the training sample in the training set Among model, training obtains prediction model, specifically includes:

13. device according to claim 12, which is characterized in that the user characteristics include age of user, gender and Body-mass index；The blood pressure characteristics include high pressure, low pressure, heart rate.

14. device according to claim 12, which is characterized in that blood pressure characteristics extraction module includes: