CN109300549B

CN109300549B - Food-disease association prediction method based on disease weighting and food category constraint

Info

Publication number: CN109300549B
Application number: CN201811180791.5A
Authority: CN
Inventors: 王嫄; 张耀功; 陈赠光; 王靖寰; 杨巨成; 赵青; 陈亚瑞; 孔娜; 王洁
Original assignee: Tianjin University of Science and Technology
Current assignee: Tianjin University of Science and Technology
Priority date: 2018-10-09
Filing date: 2018-10-09
Publication date: 2020-03-17
Anticipated expiration: 2038-10-09
Also published as: CN109300549A

Abstract

The invention relates to a food-disease association prediction method based on disease weighting and food category constraint, which comprises the following steps: constructing a disease weighting relation by using international disease classification data; constructing a food similarity network by using the ingredient list; constructing a food group relationship using a food classification system; constructing a known binary food-disease association network; randomly initializing the representation of food and disease in the underlying space; introducing a disease weighting relationship and a food group relationship, and learning the representation of the food and the disease latent space; and outputting the correlation result of the predicted food and the disease by using the representation of the food and the disease potential space. The method has reasonable design, overcomes the problem of sparseness of food disease associated data, improves the accuracy of a food and disease associated prediction model, simultaneously leads the computation time complexity of the model to be in a linear relation with the number of foods in a food group, reduces the computation complexity and reduces the consumption of computation resources.

Description

Food-disease association prediction method based on disease weighting and food category constraint

Technical Field

The invention belongs to the technical field of food safety, and particularly relates to a food-disease association prediction method based on disease weighting and food category constraint.

Background

With the improvement of the consumption capacity of residents and the enhancement of health consciousness, people no longer meet the life needs of basic substances, and the requirements on life quality and healthy life are higher and higher. Of these, most typically, there is an increasing demand for healthy dietary guidelines. It has been proved that diet has a close relationship with the occurrence and development of diseases, and the relationship is usually surprising and has profound influence, for example, diet mainly based on animal food can cause the occurrence of chronic diseases (such as obesity, coronary heart disease, tumor, osteoporosis, etc.); diets based on vegetable foods are most beneficial for health and most effective in preventing and controlling chronic diseases.

To study the above relationships, statistical analysis is usually performed by taking local demographic samples, questionnaires, dictations, or in vivo studies to obtain relevant data. However, the correlation acquisition method needs to consume huge manpower and material resources, especially living experiments with high confidence coefficient have huge risk, and it is difficult to satisfy the informed demand of detailed food-disease correlation of people. The typical risk mainly lies in the filling of the error information of the questionnaire by the questionnaire, the biased statistics of the indexes in the questionnaire, the comprehensive action of various factors of the respondents, and is not a single food variable factor. The handling of experimenters in vivo experiments is also one source of risk. Meanwhile, with the rapid growth of food types, the cost of experiments and investigation exponentially increases, and due to the limitation of manpower and material resources, the fact research cannot be updated in time, and only can be focused on a few diseases and a few food categories. Further, the fine-grained relationship between the amount of food and disease, the interaction of eating methods, is not clear, the global statistics is extremely difficult due to the huge number of variables, and the analysis of the amount of fine-grained and eating methods is an important aspect of specifically causing diseases.

In summary, the association of food with disease is a currently topical area of concern. At present, a prediction method with high confidence and guiding significance does not appear in the association of a wide range of foods and diseases. How to provide research guidelines for the association research of diseases and foods, narrow the investigation range and reduce the consumption of a large amount of manpower and resources caused by random tests is a problem which needs to be solved urgently at present.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a food-disease association prediction method based on disease weighting and food category constraint, which combines the hierarchical relationship of the categories of diseases with the action of food groups through a computer prediction method of the relevance of food and diseases, enhances the robustness of food disease association prediction and overcomes the problem of sparseness of food disease association data.

The technical problem to be solved by the invention is realized by adopting the following technical scheme:

a food-disease association prediction method based on disease weighting and food category constraint comprises the following steps:

step 1, constructing a disease weighting relation by using international disease classification data;

step 2, constructing a food similarity network by using the ingredient table;

step 3, constructing a food group relation by using a food classification system;

step 4, constructing a known binary food-disease association network;

step 5, randomly initializing the representation of the food and the diseases in the potential space;

step 6, introducing a disease weighting relation and a food group relation, and learning the representation of the potential space of the food and the disease;

and 7, outputting the correlation result of the predicted food and the disease by using the representation of the food and the disease potential space.

Further, the specific implementation method of step 1 is as follows:

first, a disease correlation matrix S is constructed using international disease classification data₁Setting an element S in a disease relevancy matrix if the expressions of the two diseases are in a parent-child relationship in the international disease classification_ij1, otherwise S_ij0, wherein disease i is the parent list of disease j, which is a specific subclass of disease i;

then, define the depth of the father node i_(i,j)And the weight C (depth) of the depth of the edge formed by the parent node i and the child node j_(i,j)) The definition is as follows:

C(depth_(i,j))＝1+log(depth_(i,j))

finally, weighted based on the hierarchy, the matrix of relevance of the disease is represented as follows:

(S′₁)_ij＝(S₁)_ij*C(depth_(i，j))。

further, the specific implementation method of step 2 is as follows: in the food similarity network, each node is a combination of 'food-quantity-eating method'; under the condition that the 'measuring-eating method' is different, the relation of every two nodes is set as 0; under the condition that the 'quantity-use method' is the same, calculating the similarity between every two foods by using a cosine formula according to a food ingredient table to be used as a node relation value to obtain a food similarity network S₂

Further, the specific implementation method of step 3 is as follows: as a relation of foods according to a food classification system stipulated by the country and using 20 classes specifically classified; food with the same quantity and the same eating method and classification is divided into a group, namely each element is a triad of 'food name-quantity-eating method'; the food is classified into different food groups according to different food properties and component ratios.

Further, the specific implementation method of step 4 is as follows: combining known food-disease associations with a binary matrix R_(n×m)The expression is that the 'food name-quantity-edible method' is used as a refinement item of the food, modeling is carried out by using the 'food name-quantity-edible method-disease', the four-tuple of the verified association is set to be 1, and otherwise, the four-tuple is 0, wherein n 'food name-quantity-edible methods' are arranged in a matrix, and m diseases are listed.

Further, the specific implementation method of step 5 is as follows: random initialization of food and disease representation in underlying space R_n×KAnd V_K×m: initialization is done by assigning any number between 0-1 to each value in the two matrices.

Further, the specific implementation method of step 6 is as follows:

decomposing the food-disease association matrix R into the product of the food vector U and the disease vector V, the decomposition objective function is defined as:

defining a hierarchical relationship after disease weighting, limiting two diseases with adjacent parent-child relationships to a potential space to keep a relatively close distance:

wherein tr (-) represents trace, S ', of the matrix corresponding to the parenthesis'₁Is a symmetry matrix of the disease; diagonal matrix (D'₁)_ii＝∑_j(S′₁)_ijGraph Laplacian L₁＝D'₁-S′₁，||A||²Is L of the A matrix₂A regularization value; v_.i、V_.jColumn vectors of ith and jth columns in the V matrix; a. the^TRefers to the transposition of the A matrix;

applying the common graph laplacian operator to the food similarity:

S₂is a network of food similarity (D)₂)_ii＝∑_j(S₂)_ij，D₂Is a diagonal matrix and the elements on the diagonal are S₂A row of₂＝D₂-S₂；

Introducing a food group relationship, taking the geometric center point of all foods in the potential space as a group center point, and all group members in the group should be close to the group center point; in each iteration, the center point of each group is calculated using the U and V used in the last iteration that has occurred, these points being used as fixed variables in the current iteration; the group-centered constraint is expressed as follows:

wherein

Is the jth element in food group G,

is the geometric center of food group G;

representing Euclidean distance between a member j in a group G and the center point of the group in which the member j is located; r is to be₀、R₁、R₂Merging into underlying matrix factorization targets

In (3), the objective function is obtained as follows:

wherein λ₀、λ₁And λ₂For a specified parameter, a person is selected with a value range of: lambda [ alpha ]₀And λ₁Selected from the set {0, 0.001, 0.01, 0.1, 1, 10, 100, 1000}, λ₂And selecting from the set {1, 10, 100, 1000}, and solving to obtain representations U and V of potential spaces of food and diseases by using a gradient descent method.

Further, the specific implementation method of step 7 is as follows: and performing dot multiplication on the ith row in the representation U of the food potential space and the jth column in the representation V of the disease potential space to obtain a possible relation value between the food-quantity-edible method i and the disease j.

The invention has the advantages and positive effects that:

1. according to the invention, under a matrix decomposition framework, the hierarchical relationship of disease classification and the group relationship of food categories are considered, and a weighting strategy and a group center strategy are applied, namely, the disease weighting relationship is calculated according to the disease classification hierarchy and the food group is constructed by utilizing food classification information, so that the food group is used as the prior constraint of the modeling of the association of diseases and foods, the problem of sparseness of food disease associated data is solved, the robustness of prediction is enhanced by using prior knowledge in an auxiliary manner, and the accuracy of a prediction model of the association of foods and diseases is improved. Meanwhile, the invention defines the group center concept, so that the model calculation time complexity and the number of foods in the food group are in a linear relation, the calculation complexity is reduced, and the consumption of calculation resources is reduced.

2. The invention combines the hierarchical relation of disease category with the action of food group, is helpful to identify new food-disease association, can further guide the research of healthy diet, and meanwhile, the potential space representation of food and disease of the invention can also be widely applied to other researches related to food and disease.

Drawings

FIG. 1 is an overall process flow diagram of the present invention;

fig. 2 is a flowchart of the algorithm of step 6 of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The design idea of the invention is as follows: in the fields of nutrition and food safety, matrix decomposition in machine learning and semantic space theory and technology are utilized, food-disease association matrix decomposition is used as a basic framework, and two kinds of prior knowledge information, namely a weighting relation contained in a disease classification level and food group information, are introduced. The present invention does not consider food-borne diseases due to pathogenic agents and is based on the following two assumptions: (1) classifying diseases in the higher-level directory, and representing the more abstract meanings; the more specific the disease in the lower category it means. (2) The food is classified into different groups according to different properties and component ratios, the food groups reveal similar related information on the food level, and the nutrient substances and the content effects provided by the food on human bodies are related after the food is eaten. Therefore, the method provided by the invention designs the center of the group in the concrete solving process by constructing the loss function, thereby reducing the complexity of model calculation time, reducing the consumption of calculation resources and overcoming the problem of sparse food-disease association obtained by early exploration.

Based on the above design concept, the food-disease association prediction method of the present invention, as shown in fig. 1, includes the following steps:

step 1, constructing a disease weighting relation by using international disease classification data.

In this step, a disease correlation matrix S is constructed using the international disease classification data₁: if the expression of two diseases is in a parent-child relationship in the international classification of diseases, S_ijFor example, if disease i is the parent of disease j and disease j is the specific subclass of disease i, S is set_ij1, otherwise S_ij＝0。

Further, considering the hierarchical tree structure of disease classification, general diseases are loosely related at a higher level, while specific diseases are more closely related at a deeper level. To better capture this feature in disease data, the present invention introduces the variable depth_(i,j)And an auxiliary function C (depth)_(i,j))。depth_(i,j)Representing the depth of the parent node i, if the parent node i is the root node, defining depth_(i,j)1, the following is defined:

C(depth_(i,j)) Is the weight of the depth of the edge formed by the parent node i and the child node j. C (depth)_(i,j)) The specific definition of (A) is as follows:

C(depth_(i,j))＝1+log(depth_(i,j))

after weighting based on the hierarchy, the correlation matrix of the disease is represented as follows:

(S′₁)_ij＝(S₁)_ij*C(depth_(i,j)).

step 2: and constructing a food similarity network by using the ingredient list.

In the food similarity network, each node is a combination of "food-quantity-eating methods". First, each food item is expressed as a vector of ingredients, which means calories, foods, etc. per 100 grams of the edible part of the food itemThe values of dietary fibre, calcium, magnesium, iron, manganese, zinc, etc., i.e. each vitamin represents the amount of a component contained in 100 g of the food product. Secondly, under the condition that the 'quantity-eating method' is different, the relation of every two nodes is set as 0; under the condition that 'quantity-use method' is the same, a cosine formula is applied to a food vector to calculate the similarity between every two foods as a node relation value to obtain a food similarity network S₂. The cosine formula used here is as follows:

wherein a and b are two food component vectors respectively.

And step 3: food item group relationships are constructed using a food item classification system.

In this step, 20 types specifically classified, including grains and products, edible oils, meats and products thereof, sterilized fresh milk, dairy products, aquatic products, cans, sugar, cold foods, beverages, distilled or prepared liquors, fermented liquors, seasonings, bean products, cakes, confectionery, pickles, health foods (according to "health food management method"), new resource foods (according to "new resource food management sanitation method"), and other foods are used as the relationship of foods according to the food classification system stipulated by the state. In the invention, the foods with the same amount and the same classification as the eating method are grouped, namely, each element is a triad of 'food-amount-eating method'.

The food is classified into different groups according to different properties and component ratios, one food group reveals similar related information at the food level, and the nutrient substances and the content effects provided by the food on human bodies are related after the food is eaten. Thus, one heuristic is that all foods in a group are functionally similar and may cause similar or identical diseases. One group center at a time.

In the present invention, it is not that any two food items in a group remain close in vector, since the computational complexity would then be a square multiple of the number of food items in a group. The present invention introduces a group centric conceptUsing the geometric center point of all the food items in a group in the potential space as the group center point

A potential spatial vector representing the jth element in food group G,

representing the geometric center of food group G. Then there is

Representing the euclidean distance between a member j in a group G and the center point of the group in which it is located.

And 4, step 4: a known binary food-disease association network is constructed.

In this step, the known food-disease association is represented by the binary matrix R_(n×m)It is shown that, here, the association of food and disease is refined, that is, "food name-amount-eating method" is introduced as a refinement item of food, modeling is performed by using "food name-amount-eating method-disease", and the quadruple for the verified association is set to 1, otherwise 0. The rows in the matrix are n "food name-amount-eating method", and the columns are m diseases. The verification here mainly focuses on scientific research paper data. When the experimental results of the new and old papers are different, the journal paper with new published times and high influence factors is taken as the standard to certify whether the 'food-quantity-edible method' is related to 'diseases'.

And 5: the representation of food and disease in the underlying space is randomly initialized.

The specific implementation method of the step is as follows: random initialization of food and disease representation in underlying space R_n×KAnd V_K×mI.e. by assigning any number between 0-1 to each value in the two matrices as initialization.

Step 6: introducing disease weighting relation and food group relation, and learning the representation of food and disease potential space. The specific implementation method of this step is shown in fig. 2.

On the basis of matrix decomposition, the method considers the following two constraints of external knowledge on food potential space representation modeling and disease potential space representation modeling: (1) classifying diseases in the higher-level directory, and representing the more abstract meanings; the more specific the disease in the lower category it means. (2) The food is classified into different groups according to different properties and component ratios, the food groups reveal similar related information on the food level, and the nutrient substances and the content effects provided by the food on human bodies are related after the food is eaten. The basic matrix decomposition modeling is first explained, and then the weighted relationship constraint and the food group constraint are gradually introduced.

According to the matrix decomposition method, the food-disease association matrix R is decomposed into the product of a food vector U and a disease vector V, and then a decomposition objective function is defined as:

by introducing a hierarchical relationship after disease weighting, the invention can limit two diseases with adjacent parent-child relationship to keep a closer distance in a potential space.

Where tr (-) denotes the trace of the matrix corresponding in parentheses, i.e., the sum of the elements on the main diagonal (diagonal from top left to bottom right) of the matrix. S'₁Is a symmetry matrix of the disease. The invention defines a diagonal matrix

I.e. the value on the diagonal entry is S'₁The row and column. Definition of L in the invention₁＝D'₁-S′₁I.e., the graph laplacian. | A | non-conducting phosphor²Is L of the A matrix₂The regularization value, i.e., the sum of the squares of all elements in the A matrix that are not 0. V_.i，V_.jThe column vectors of the ith and jth columns in the V matrix. A. the^TRefers to the transposition of the matrix A, namely the corner marks of the elements in the matrix A are exchanged front and back, and the other right corner marks areThe same applies to T.

Because there is no hierarchical structure between foods, the invention applies the common graph laplacian operator to the similarity of foods:

wherein S₂See step 2 for definition. (D)₂)_ii＝∑_j(S₂)_ijI.e. D₂Is a diagonal matrix and the elements on the diagonal are S₂A row of₂＝D₂-S₂。

Introducing the food group relationship, the present invention takes the geometric center point of all food items in the potential space as the group center point, and all group members in the group should be close to the group center point. In each iteration, the present invention computes the center point of each group using the U and V used in the last iteration that has occurred, which points are used as fixed variables in the current iteration. The group-centric constraint can be expressed as follows:

wherein

Is the jth element in food group G,

is the geometric center of food group G, as specifically defined in step 3.

Representing the euclidean distance between the member j in group G and the center point of the group in which it is located. R is to be₀，R₁，R₂Merging into underlying matrix factorization targets

In (3), the complete objective function is obtained as follows:

wherein | A | purple²Is L of the A matrix₂The regularization value, i.e., the sum of the squares of all elements in the A matrix that are not 0. Lambda [ alpha ]₀And λ₁Balancing the influence of group constraints and weighting relationships, λ₂Model complexity is controlled to avoid overfitting. Lambda [ alpha ]₀And λ₁Selected from {0, 0.001, 0.01, 0.1, 1, 10, 100, 1000}, λ₂From the set {1, 10, 100, 1000}, by lattice search, i.e. traversing all λ₀，λ₁And λ₂Combine to find the best parameters of the model. From the gradient descent and Lagrangian method, the following is obtained for U_ikAnd V_kjThe iterative update formula:

wherein Ψ ═ (Y)_i·⊙(U_i·V))(V^T)_·k，

Wherein the content of the first and second substances,

a ⊙ B represents the multiplication of corresponding positions of the A and B matrixes.

According to O₁The method for solving U and V is as follows:

(1)O′₁←O₁；

(2) calculate each group center:

(3) by using information about U_ikAnd V_kjRespectively updating U and V by the iterative updating formula;

(4) according to the formula O₁Calculating a new objective function O₁；

(5) If | O'₁-O₁If the | is less than the epsilon, stopping circulation and outputting U and V; otherwise, using the ite ← ite +1, stopping circulation when the ite is more than or equal to Max _ ites, and outputting U and V; otherwise, repeating (1) - (4).

And 7: and outputting the correlation result of the predicted food and the disease by using the representation of the food and the disease potential space.

And performing dot multiplication on the ith row in the representation U of the food potential space and the jth column in the representation V of the disease potential space to obtain a possible relation value between the food-quantity-edible method i and the disease j.

It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.

Claims

1. A food-disease association prediction method based on disease weighting and food category constraint is characterized by comprising the following steps:

step 2, constructing a food similarity network by using the ingredient table;

step 4, constructing a known binary food-disease association network;

step 7, outputting the correlation result of the predicted food and the disease by using the representation of the food and the disease potential space, thereby enhancing the accuracy of the food disease correlation prediction;

the specific implementation method of the step 6 comprises the following steps:

applying the common graph laplacian operator to the food similarity:

wherein

Is the jth element in food group G,

is the geometric center of food group G;

In (3), the objective function is obtained as follows:

s.t.U≥0，V≥0

2. The method of claim 1, wherein the method comprises the steps of: the specific implementation method of the step 1 comprises the following steps:

C(depth_(i,j))＝1+log(depth_(i,j))

(S′₁)_ij＝(S₁)_ij*C(depth_(i，j))。

3. the method of claim 1, wherein the method comprises the steps of: the specific implementation method of the step 2 comprises the following steps: in the food similarity network, each node is a combination of 'food-quantity-eating method'; under the condition that the 'measuring-eating method' is different, the relation of every two nodes is set as 0; under the condition that the 'quantity-use method' is the same, calculating the similarity between every two foods by using a cosine formula according to a food ingredient table to be used as a node relation value to obtain a food similarity network S₂。

4. The method of claim 1, wherein the method comprises the steps of: the specific implementation method of the step 3 is as follows: constructing the relationship of the food according to the food classification system specified by the country and using the specifically classified 20 classes; food with the same quantity and the same eating method and classification is divided into a group, namely each element is a triad of 'food name-quantity-eating method'; the food is classified into different food groups according to different food properties and component ratios.

5. The method of claim 1, wherein the method comprises the steps of: the specific implementation method of the step 4 comprises the following steps: combining known food-disease associations with a binary matrix R_(n×m)The expression is that the 'food name-quantity-edible method' is used as a refinement item of the food, modeling is carried out by using the 'food name-quantity-edible method-disease', the four-tuple of the verified association is set to be 1, and otherwise, the four-tuple is 0, wherein n 'food name-quantity-edible methods' are arranged in a matrix, and m diseases are listed.

6. The method of claim 1, wherein the method comprises the steps of: the specific implementation method of the step 5 is as follows: random initialization of food and disease representation in underlying space R_n×KAnd V_K×m: initialization is done by assigning any number between 0-1 to each value in the two matrices.

7. The method of claim 1, wherein the method comprises the steps of: the specific implementation method of the step 7 is as follows: and performing dot multiplication on the ith row in the representation U of the food potential space and the jth column in the representation V of the disease potential space to obtain a possible relation value between the food-quantity-edible method i and the disease j.