CN111599477A

CN111599477A - Model construction method and system for predicting diabetes based on eating habits

Info

Publication number: CN111599477A
Application number: CN202010664488.3A
Authority: CN
Inventors: 李平; 杜乐
Original assignee: Wuzheng Intelligent Technology Beijing Co ltd
Current assignee: Wuzheng Intelligent Technology Beijing Co ltd
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-08-28

Abstract

The invention relates to a model construction method for predicting diabetes based on eating habits, which comprises the following steps: obtaining a first sample set comprising food material information of a sample multi-day meal; extracting a plurality of data in the first sample set, and forming a second sample set by using the data as features; dividing the second sample set into a training set and a verification set, and taking the training set as the input of a decision tree modeler; and training the decision tree model until the information gain of the features is lower than a threshold value to obtain the decision tree model. The method and the device analyze the main causes of diabetes, evaluate the eating behaviors of the user, and utilize the accumulative weighting of the nutrient content in the food materials and the decision tree algorithm, so that the user can quickly and conveniently know the intake condition of the nutrients and predict the diabetes risk index, and the user experience is improved.

Description

Model construction method and system for predicting diabetes based on eating habits

Technical Field

The invention relates to the field of medical information processing, and relates to a model construction method and a system for predicting diabetes based on eating habits.

Background

At present, the diabetes diagnosis standard is uniformly established by the world health organization, has no relation with people, age and sex, and takes blood sugar (fasting blood sugar, blood sugar at any time or glucose tolerance test) as the only diagnosis standard. Studies have shown that high blood glucose values are an important criterion for the diagnosis of diabetes, but absolutely not the only criterion. Diabetes is not only a problem of high blood sugar, but also a problem of blood sugar going to, and actually in the process of metabolism, blood sugar becomes fat. Therefore, all diabetics need to strictly control blood sugar and blood fat and actively adopt a scientific method to treat the diabetics to avoid harm.

With the continuous improvement of living standard and the change of life style of people, the prevalence rate of diabetes is on the rise. Diabetes is a group of metabolic diseases characterized by chronic increases in blood glucose levels. The Postprandial Glycemic Response (PGR) of a human is influenced by a variety of factors, and for a single food, is significantly correlated with food composition, Glycemic Index (GI) and Glycemic Load (GL) values; however, for mixed diets, there was no significant correlation between food composition and postprandial blood glucose, but there was some correlation between GL and GI values. Secondly, the protein and fat in the food also have some effect on blood glucose. The magnitude of their effect on blood glucose is in turn: GL > protein > fat.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a model construction method for predicting diabetes based on eating habits.

The technical scheme for solving the technical problems is as follows: a model construction method for predicting diabetes based on eating habits comprises the following steps: obtaining a first sample set comprising food material information of a sample multi-day meal; extracting a plurality of data in the first sample set, and forming a second sample set by using the data as features; dividing the second sample set into a training set and a verification set, and taking the training set as the input of a decision tree modeler; and training the decision tree model until the information gain of the features is lower than a threshold value to obtain the decision tree model.

In some embodiments of the invention, the first sample set further comprises the age, sex, weight, height, past history of diabetes, history of allergies of the sample population.

In some embodiments of the invention, the nutrient intake profile is calculated from the age, sex, and food material information of the multi-day meal of each sample in the first sample set, and is characterized by the second data set.

In some embodiments of the invention, the nutrients include protein, fat, glucose.

In some embodiments of the invention, the characteristics of the second sample set include age, glycemic load, fat, protein, obesity, genetic history of diabetes.

In some embodiments of the invention, the blood glucose load is taken as the root node of the decision tree model.

The invention provides a system for predicting diabetes based on eating habits, which comprises an acquisition module, a matching module, a calculation module and a decision tree model, wherein the acquisition module is used for acquiring the food material information of the user such as age, sex, weight, height, past history of diabetes, allergy history and multi-day meals; the matching module is used for searching the daily intake of nutrients and the content of the nutrients according to the gender and the age of the user; the calculation module is used for carrying out weighted calculation on the content of the nutrients retrieved by the matching module and comparing the content of the nutrients with the daily intake to obtain the characteristics of the intake of the nutrients; the decision tree model predicts the probability of the user suffering from diabetes based on the nutrient intake characteristics.

In some embodiments of the invention, the decision tree model includes a model constructed by the model construction method for predicting gout based on eating habits.

In some embodiments of the invention, the nutrient is glucose, fat, protein.

In some embodiments of the invention, the decision tree model is optimized by a random forest tree.

The invention has the beneficial effects that: the main causes of diabetes are analyzed, the dietary behavior of the user is monitored and evaluated in real time, less resources are consumed, and the user can quickly and conveniently know the nutrient intake condition and predict the diabetes risk index by using the nutrient content accumulation weighting and decision tree (ID3) algorithm in the food materials, so that the disease speculation speed is increased, and the user experience is improved.

By recording or acquiring the data of the diet behaviors for multiple days, the daily blood sugar load (GL), protein and fat intake conditions are analyzed, statistics is carried out, and whether a balance relation is established between the human body demand and the supply quantity is analyzed. If this balance is disturbed, the risk of diabetes is increased. And (3) evaluating the risk index of the diabetes by using the ingestion conditions of the blood Glucose Load (GL), protein and fat in a certain period of time.

Drawings

FIG. 1 is a basic flow diagram of a model building method for predicting diabetes based on eating habits in some embodiments of the present invention;

FIG. 2 is a schematic diagram of the architecture of a system for predicting diabetes based on eating habits in some embodiments of the present invention;

FIG. 3 is an example of a portion of samples in a second sample set in some embodiments of the invention;

FIG. 4 is a decision tree model in some embodiments of the invention;

fig. 5a and 5b are tables of common food ingredients in 100 grams per food with reference.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

First, some necessary concepts of the present application are explained:

labeling: the labels are what we want to predict, i.e. the y variables in a simple linear regression. The label may be future price of wheat, animal species shown in the picture, meaning of an audio clip or anything, in this application the label may refer to whether the population in the sample has diabetes.

Is characterized in that: the features are input variables, i.e., x variables in a simple linear regression. A simple machine learning item may use a single feature, while a more complex machine learning item may use millions of features, specified as follows: x1, x 2.. xN. In this application, a characteristic may refer to a numerical value or boolean value corresponding to age, glucose, fat, protein, obesity, genetic history of diabetes.

Sample preparation: samples refer to specific instances of data: x. (x is a vector.) we classify samples into the following two categories: labeled swatches, unlabeled swatches, labeled swatches contain both features and labels. Namely: labeled examples { features, label }: x, y), we used labeled samples to train the model. In this application, a labeled sample is a user explicitly labeled as "having diabetes" or "not having diabetes". For example, a patient or user sample may include characteristics such as age, gender, weight, height, obesity, a genetic history of diabetes, etc.

The decision tree (ID3) algorithm is described as follows:

let the training data set be D, and | D | represent the sample capacity, i.e., the number of samples. Is provided with K classes C_k，k＝1，2，…，K，|C_kIs of class C_kThe number of samples of (a) to (b),

provided with a feature of_*There are V different values

According to the characteristic a_*Is to divide D into V subsets D₁，D₂，…，D_V，|D_tL is D_tThe number of samples of (a) to (b),

memory set D_iIn the class C_kSet of samples of D_ik. I.e. D_ik＝D_i∩C_k，|D_ikL is D_ikNumber of samples of the block. The method of calculating the information gain is then as follows:

(1) calculating empirical entropy of data set D H (D)

(2) Computing empirical conditional entropy of features on a data set

Assuming a given training data set: d { (x)₁，y₁)，(x₂，y₂)，...，(x_N，y_N)},

Wherein the content of the first and second substances,

for the input example, i.e. the feature vector, N is the number of features, i is 1, 2, 3 … … N, N is the number of samples, y is_i∈ {1, 2.., K } is a class label.

The technical scheme of the invention is specifically described as follows:

referring to fig. 1, a model construction method for predicting diabetes based on eating habits includes the following steps: s101, obtaining a first sample set comprising food material information of sample multi-day meals; s102, extracting a plurality of data in the first sample set, and forming a second sample set by using the data as characteristics; s103, dividing the second sample set into a training set and a verification set, and taking the training set as the input of the decision tree modeler; s104, training the decision tree model until the information gain of the features is lower than a threshold value to obtain the decision tree model.

In some embodiments of the invention, the first sample set further comprises the age, sex, weight, height, past history of diabetes, history of allergies of the sample population. The past history of diabetes comprises the hereditary history of diabetes and the history of diabetes patients.

In some embodiments of the invention, the nutrient intake profile is calculated from the age, sex, and food material information of the multi-day meal of each sample in the first sample set, and the nutrient intake profile is characterized by the second data set. The nutrient intake is divided into three cases: if the value is lower than the standard value, the value is recorded as low; equal to the standard value, and is recorded as medium or moderate; higher than the standard value, high or slightly high.

Referring to fig. 3, in some embodiments of the invention, the nutrients include protein, fat, glucose.

Preferably, the characteristics of the second sample set include age, glycemic load, fat, protein, obesity, genetic history of diabetes. Note that, glycemic load index (GL): GL ═ food GI × the amount of actually available carbohydrate (g) ingested for that food. Glycemic Index (GI): refers to the percentage value of the blood glucose response level in vivo after eating a carbohydrate 50g meal or an equivalent amount of standard meal (glucose or white bread). The calculation formula is as follows: glycemic index is the area under the curve for the rise in blood glucose two hours after eating a certain food containing 100g of glucose equivalent in sugar/the area under the curve for the rise in blood glucose two hours after eating 100g of glucose x 100. The GI value of glucose is usually set to 100. Therefore, the glycemic load is somewhat related to other nutrients such as glucose, proteins, fats, etc.

Referring to fig. 3, 5a and 5b, in some embodiments of the invention, to prevent under-or over-fitting of the model, the features of the sample need to be normalized. Because the food material information and the nutrients in the first sample set or the second sample set have a plurality of characteristics and the span range of the values is large, the classification results with relatively small other values are dominated by the characteristics, the influence of the other characteristics is weakened, and the data needs to be normalized. The characteristic dispersion is normalized by linear transformation of the original data, so that the result falls into a range from [0,1] to [0,10], and the range can be adjusted according to actual conditions.

Referring to FIG. 4, in some embodiments of the invention, the decision tree model has the blood Glucose Load (GL) as the root node of the decision tree model.

Referring to fig. 2, another aspect of the present invention provides a system 1 for predicting diabetes based on eating habits, including an obtaining module 11, a matching module 12, a calculating module 13, and a decision tree model 14, where the obtaining module 11 is configured to obtain food material information of a user, such as age, gender, weight, height, past history of diabetes, allergy history, and multi-day meal; the matching module 12 is used for searching the daily intake of nutrients and the content of the nutrients according to the gender and the age of the user; the calculation module 13 is used for performing weighted calculation on the content of the nutrients retrieved by the matching module and comparing the content of the nutrients with the daily intake to obtain the characteristics of the intake of the nutrients; the decision tree model 14 predicts the probability of the user suffering from diabetes based on the nutrient intake characteristics.

In some embodiments of the invention, the decision tree model 14 includes a model constructed by the aforementioned model construction method for predicting gout based on eating habits.

In some embodiments of the invention, the nutrient is glucose, fat, protein.

In some embodiments of the invention, the decision tree model 14 is optimized by a random forest tree.

In some embodiments of the invention, a system 1 for predicting diabetes based on eating habits comprises an obtaining module 11, a matching module 12, a calculating module 13, a decision tree model 14,

the acquisition module 11: acquiring the age, sex, weight, height, past history, allergy history and food material information of multiple-day meals of a user;

the matching module 12: according to data input by a user, searching the daily intake of elements and the content of nutrients in food materials according to age and gender in each nutrient (protein, fat and glucose) table in sequence;

the calculation module 13: and respectively carrying out weighted calculation on the content of the nutrients according to the result of data retrieval, and benchmarking the weighted result with the daily intake. Judging the nutrient intake conditions (low, high and moderate) according to the benchmarking result, and performing classification statistics;

decision tree model 14: recording the diet conditions of 800 users for 60 days continuously, and analyzing the protein, fat and glucose intake conditions; the selected user characteristics are as follows: "protein", "fat", "glycemic load", "obesity", "genetic history of diabetes". And calculating the information gain value of each characteristic according to the characteristic information, selecting the characteristic of the maximum result as a root node according to the result of the information gain, using the characteristics with sequentially reduced results as child nodes, recursively calling the method for the child nodes to construct a decision tree until the information gain of all the characteristics is very small or no characteristic can be selected, and finally obtaining the decision tree. It should be noted that the above 800 users (samples) are only for illustration, and the number of samples may be adjusted as appropriate.

Diabetes risk index prediction: and (3) according to the user information, applying a decision tree (ID3) model to realize the prediction of the diabetes risk index or the prevalence probability.

The technical solution of the present application will be described below with reference to specific examples.

Example (c): recording the eating behavior of a week after the age of 40 years, women, height of 164cm and weight of 58KG, analyzing the eating behavior, counting the intake condition of each element, and predicting the diabetes risk index by using GL, protein and fat element intake conditions (high, medium and low) as characteristics.

1. Calculating daily element intake: by recording the daily food materials and weight, the contents of protein and fat (fat unit is%, the content of food materials is g, and conversion to percentage is required) (for example, 100g of eggs, 8.8g of fat, 8.8/100 of fat, and 8.8% of conversion to standard calculation) are searched out by using a knowledge base food material table), and the total daily intake of protein and fat is respectively set as X1 and Y1 by weighting calculation.

2. Similarly, according to the step 1, respectively calculating the protein intake of the rest six days to be respectively set as X2, X3, X4, X5, X6 and X7; the fat intake was Y2, Y3, Y4, Y5, Y6 and Y7, respectively. The total protein intake in one week is X ═ X1+ X2+ … … X7, and the average daily protein intake is X/7; the total intake of fat Y is Y1+ Y2+ … … Y7, and the average intake of fat per day is Y/7.

3. The intake is labeled, and according to the daily intake scale of knowledge base elements, the daily intake of protein of a 40-year-old female is searched to be B1, and the daily intake of fat is searched to be B2. Comparing the X/7 and B1 values, there are three possibilities, greater than, less than, equal, corresponding to higher, lower, moderate protein intake, respectively. In the same way, the intake of fat is higher, lower and moderate.

GL ═ carbohydrate (g) × GI/100, e.g. the carbohydrate content of 100 grams of watermelons is 7.5 grams, the glycemic index of watermelons is 72%, and its Glycemic Load (GL) is 7.5 × 72/100 ═ 5.4; the blood Glucose Load (GL) of 500 g watermelon is 37.5 × 72/100 is 27. GL >20 are high GL foods; GL is 10-20 is middle GL food; GL <10 is a low GL food. GL does not relate to intake amount, GL values corresponding to all food materials are sequentially searched in a knowledge base food material table, the GL values of the food materials are compared in which interval, and if the GL value of the food material is 22, the result is higher; if GL value is 15, the result is moderate; if GL is 2, the result is lower. The GL value in the food material table corresponds to 100g of food, and the GL value is scaled according to the weight of the user food material in an equal ratio mode. And (3) utilizing a decision tree (ID3) algorithm to realize the diabetes risk index prediction.

By combining the sample data set (analysis statistics on nutrient intake conditions within continuous 60 days), the selected characteristic information is as follows: "age", "obesity", "sugar intake status", "protein intake status", "fat intake status", "family history of diabetes"; the number of samples is 800. Category labels fall into two categories: is diabetes and not diabetes.

According to the calculation steps, there are illustrated:

1) calculating the information entropy required by the classification of the given sample (the smaller the information entropy is, the smaller the uncertainty is, the larger the certainty is, and the higher the purity of the information is);

according to the sample set, the number of people with diabetes is 123, and the number of people without diabetes is 800-.

2) And respectively calculating the information entropy and the information gain of each characteristic (the age, the obesity, the GL intake condition, the protein intake condition, the fat intake condition and the diabetes family genetic history).

Taking the "GL intake" characteristic as an example, the intake was divided into three groups: low, high and moderate. Information entropy with lower GL: as can be seen from the data set, the number of persons with low GL is 177, the number of samples is 800, the probability P0 that the low GL accounts for the total samples is calculated to be 177/800, and the information entropy i0 that the low GL is calculated by the formula (2). Similarly, calculating the information entropy i1 with higher GL, and the probability P1 that the higher GL accounts for the total samples; the information entropy i2 with moderate GL accounts for the probability P2 of the total sample. The information entropy of GL intake is: e (GL uptake) ═ P0 × i0+ P1 × i1+ P2 × i 2;

1. calculating an information gain, wherein the information gain of GL intake is as follows: g (GL intake) ═ i-E. Similarly, other characteristic information gains, an information gain of age, an information gain of protein intake, an information gain of fat intake, and an information gain of the family genetic history of diabetes were calculated. The calculation result is as follows: the information gain of GL is: 0.2667; the information gain of the genetic history of diabetes is: 0.2033, respectively; the information gain of fat is: 0.1624, respectively; the information gain for obesity was: 0.1273, respectively; the information gain of the protein is: 0.0968, respectively; the information gain of age is: 0.0183.

and according to the result of the information gain, selecting the characteristic GL with the largest result as a root node, sequentially reducing the result as child nodes, recursively calling the method for the child nodes to construct a decision tree until the information gain of all the characteristics is very small or no characteristics can be selected, and finally obtaining the decision tree. In particular, the decision tree model is optimized by a random forest tree.

Calculating the obesity (No) and the diabetes family genetic history (No) by using the height and the weight according to the information input by the user, wherein the protein is moderate, the fat is higher, and the GL is higher according to the statistical result in the step 3; with the decision tree model, the user's risk of diabetes was predicted to be 56.9%.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit or a module is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium.

Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A model construction method for predicting diabetes based on eating habits is characterized by comprising the following steps:

obtaining a first sample set comprising food material information of a sample multi-day meal;

extracting a plurality of data in the first sample set, and forming a second sample set by using the data as features;

dividing the second sample set into a training set and a verification set, and taking the training set as the input of a decision tree modeler;

and training the decision tree model until the information gain of the features is lower than a threshold value to obtain the decision tree model.

2. The method of claim 1, wherein the first sample set further comprises age, gender, weight, height, past history of diabetes, and history of allergies of the sample population.

3. The method of claim 1, wherein the nutrient intake is calculated from the age, sex, and food material information of the multiple-day meal of each sample in the first sample set, and the nutrient intake is used as the characteristic of the second data set.

4. The method of claim 3, wherein the nutrients comprise protein, fat, and glucose.

5. The method of any of claims 3 or 4, wherein the characteristics of the second sample set include age, glucose load, fat, protein, obesity, genetic history of diabetes.

6. The method of any one of claims 1-4, wherein the glucose load is used as a root node of the decision tree model.

7. A system for predicting diabetes based on eating habits is characterized by comprising an acquisition module, a matching module, a calculation module and a decision tree model,

the acquisition module is used for acquiring the age, sex, weight, height, past history of diabetes, allergy history and food material information of multi-day meals of a user;

the matching module is used for searching the daily intake of nutrients and the content of the nutrients according to the gender and the age of the user;

the calculation module is used for carrying out weighted calculation on the content of the nutrients retrieved by the matching module and comparing the content of the nutrients with the daily intake to obtain the characteristics of the intake of the nutrients;

the decision tree model predicts the probability of the user suffering from diabetes based on the nutrient intake characteristics.

8. The system for predicting diabetes based on eating habits according to claim 7, wherein the decision tree model comprises a model constructed by the method for constructing a model for predicting gout based on eating habits according to any one of claims 1 to 6.

9. The system for predicting diabetes based on eating habits of claim 8, wherein the nutrient is glucose, fat, protein.

10. The system for predicting diabetes based on eating habits of claim 9, wherein the decision tree model is optimized by random forest trees.