CN110853707A

CN110853707A - Gene regulation and control network reconstruction method based on deep learning

Info

Publication number: CN110853707A
Application number: CN201911141752.9A
Authority: CN
Inventors: 张章; 王立飞; 王硕; 陶如意; 牟牧云; 肖镜舒; 张江
Original assignee: Jizhi Academy (beijing) Technology Co Ltd; Beijing Normal University
Current assignee: Jizhi Academy (beijing) Technology Co Ltd; Beijing Normal University
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2020-02-28

Abstract

The invention discloses a gene regulation network reconstruction method based on deep learning, which reconstructs a network structure of a gene regulation network from observed time sequence data of messenger RNA (mRNA) concentration change, namely, a mutual regulation relation between genes. The method provides a data-driven deep learning framework to simultaneously complete the reconstruction of a gene regulation network and the simulation of gene regulation dynamics, and the method comprises two co-trained modules which are respectively as follows: a adjacency matrix generator representing the connection structure of the gene regulatory network and a kinetic predictor that can predict the concentration of each messenger RNA in the future. The model of the method can reconstruct a gene regulation and control network with higher precision, so that people can conjecture the regulation and control relation among genes from observation data and possibly help to realize the control of biological characters.

Description

Gene regulation and control network reconstruction method based on deep learning

Technical Field

The invention relates to the crossing field of deep learning and biological science, which can be used for reconstructing a gene regulation network. The model integrates a plurality of multilayer perceptrons by using a Gumbel-Softmax mechanism, and can adjust the weights in a network generator and the multilayer perceptrons by forward simulation of the evolution process of the gene regulation and control network and reverse propagation to realize the reconstruction of the gene regulation and control network structure and the simulation of dynamics.

Background

Gene Regulatory Networks (GRNs) play an important role in cell development and cellular characteristics. Transcription Factors (TFs) interact to regulate millions of downstream genes, forming a regulatory network. To connect this network, a great deal of effort is put into understanding the basic principles of biology. One more common method of reconfiguring gene regulatory networks is through biochemical experiments. However, we can also reconstruct the gene regulation network by analyzing the gene expression time sequence data, namely a method for reconstructing the gene regulation network by the time sequence data of messenger RNA concentration change in the gene expression process, the method uses deep learning technology, forwardly simulates the gene regulation process through a neural network, optimizes all parameters in the forward process through backward propagation, and finally can guess the network structure of the gene regulation network with higher accuracy and can obtain a dynamics predictor capable of modeling the gene regulation dynamics with higher accuracy.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a data-driven method for reconstructing a gene control network from messenger RNA concentration change time sequence data, which can enable people to find the control relationship among genes, realize clear understanding of the biological gene control network and further possibly play a role in controlling biological characters.

In order to achieve the purpose, the invention provides a network reconstruction model based on deep learning. The principle of operation of gene regulatory networks can be summarized as follows: the process of gene expression is the process of gene transcription into RNA and RNA translation into protein, and the gene, messenger RNA and protein are in one-to-one correspondence. The regulatory relationship between genes is reflected in the influence of different proteins on other gene tables. In this process, messenger RNA concentration is most easily measured, so we use the change in concentration of messenger RNA to reconstruct the gene regulatory network.

Specifically, a gene regulation relationship is modeled into a network, a node is messenger RNA transcribed by a gene, node information is the concentration of the messenger RNA, and if the gene A has a regulation relationship with the gene B, an oriented edge exists between the node A and the node B.

Our method implementation can be briefly described as follows: firstly, a network generator generates an adjacency matrix by using Gumbel-Softmax sampling technology, and the adjacency matrix represents the connection mode of the gene regulation network. The internal parameters of the adjacency matrix generator are initialized randomly, so that the adjacency matrix generated by the network generator cannot accurately represent the real gene regulation network structure at the beginning. The formula generated for each element of the adjacency matrix is as follows:

wherein α ij is the probability that the element of the ith row and the jth column of the adjacency matrix is 1, ξ ij is the result of repeating the logarithm operation and the inverse operation twice by random sampling from the standard normal distribution, and τ is the temperature parameter;

further, in the study of gene regulatory dynamics, because there is heterogeneity in gene regulatory dynamics, i.e. different nodes follow different regulatory rules, we equip each node with a multi-layered perceptron as its dynamics learner. Since a column of the adjacency matrix represents the connection relationship between a specific gene and its in-degree node, we will further use the column to filter (i.e., multiply) all messenger RNA concentrations and input neighbor information into a specific multi-tier perceptron. The output of the multilayer perceptron is the concentration of this messenger RNA at the next moment.

After obtaining all messenger RNA concentration information of the next moment, calculating loss by the predicted value and a corresponding real value in data, carrying out backward propagation on the loss by a gradient descent method, and adjusting parameters in the process, including parameters of a network structure generator and parameters of a plurality of dynamics predictors. The loss function of this process can be expressed as follows:

the above processes are repeated until the loss is converged, at the moment, the network generator can sample an adjacent matrix which can accurately represent the real network connection mode, and the dynamics predictor can also accurately represent the dynamics process of how the specific gene is regulated and controlled by the neighbors.

In addition, we introduce a structural loss method, optimized for the network generator itself. The structural loss function means that the gene regulation network in the reality is known to be sparse, so that the number of all 1 in the adjacency matrix generated by the network generator is calculated as a penalty term. If the number of 1 in the adjacency matrix is more, the generated network of the adjacency matrix is more dense, the structural loss is more, so that the generated network is sparse, the method plays a greater role when the number of genes is large, and the structural loss can be expressed as the following formula:

wherein Ls represents the structure loss, the alpha value is the structure loss parameter, and if the alpha is larger, the structure loss is stronger. In addition, in the gradient descending process, the network generator and the dynamics predictor respectively correspond to the network structure and the regulation dynamics of the gene regulation network, so that the network generator and the dynamics predictor represent two different types of variables, and the learning rates corresponding to different sizes are also adopted in the optimization process.

Advantageous effects

1) The invention can complete the reconstruction of the gene regulation network, so that people can more clearly know the regulation relation among genes and have potential possibility to promote the further control of biological characters.

2) Besides reconstructing a gene regulation network, the invention can also respectively model different gene regulation dynamics and accurately predict the concentration state of a certain messenger RNA in the future.

3) The invention adopts a deep learning method to simulate the gene regulation dynamics, thereby simulating the highly nonlinear stimulation or inhibition between genes.

4) The invention achieves the highest accuracy in the gene regulation network reconstruction technology at present while having the advantages.

Drawings

Fig. 1 is a schematic diagram of a frame: the framework is integrally divided into two parts, namely a network structure generator and a plurality of kinetic predictors, wherein each kinetic predictor models a kinetic process regulated by a specific gene.

Fig. 2 is a network reconfiguration effect diagram: the graph is an roc curve drawn from the trained net generator, the roc curve exceeds the diagonal and the auc value is greater than 0.5, which means that our net generator learns the net structure with high accuracy.

FIG. 3 is a graph of the predicted effect of kinetics: the graph shows the change in concentration of authentic messenger RNA at a given initial concentration and the change in concentration of messenger RNA predicted by our method. It can be seen that our method accurately holds the variation trend of messenger RNA concentration in the presence of noise.

Detailed Description

The gene regulatory network reconfiguration process is further explained below with reference to the accompanying drawings.

The problem to be solved by the invention is to reconstruct a gene regulation network through concentration change time sequence data of messenger RNA by a deep learning-based method. To achieve the goal, two submodules, namely a network generator and a dynamics predictor, are built, all parameters in a model are adjusted by a back propagation and gradient descent technology in deep learning, and the overall model architecture is shown in FIG. 1.

Our overall goal is to use the full messenger RNA concentration data at time t, i.e., Xt, and the full messenger RNA concentration information to predict time t +1, i.e., Xt + 1. Network structure was learned in constant prediction and back-propagation tuning, and kinetic learners were learned that were able to accurately fit gene regulation kinetics. The model consists of two parts, 1, a network generator, and the function of the network generator is to generate an adjacency matrix by sampling to represent the connection structure of the network. A set of kinetic predictors, each of which is a multi-layered perceptron for learning how dynamically a particular gene is affected by his regulators. The two major parts work together in such a way that: the network generator generates an adjacency matrix, the dynamics predictor takes as input a specific column of the adjacency matrix (representing the regulatory genes of a specific certain gene) and all node states at time t, and outputs a scalar as the concentration value of the specific messenger RNA at the next time. And the output of all the dynamics predictors is spliced to the concentration vectors of all messenger RNAs at the next moment, loss is calculated and reversely propagated with the real concentration vector, and parameters of the dynamics predictors and the network generator are adjusted. Finally, the network generator will generate a adjacency matrix close to the real situation, and the dynamics predictor can also accurately learn how a specific gene is regulated by the neighbors.

Since in a gene regulatory network all genes are not regulated by their neighbors following exactly the same kinetic rules, and the kinetic process of gene regulation is highly non-linear. We do not try to learn many different dynamics rules with a neural network structure, but rather generate a specific dynamics learner for each gene. The dynamics learning device is a multi-layer neural network, the number of internal hidden layers and the number of layers of the dynamics learning device can be changed, however, because the goal of the dynamics learning device is to receive neighbor information and predict concentration information of the next moment of the dynamics learning device as far as possible, the input dimension and the output dimension of the dynamics learning device are respectively fixed as the number of nodes and 1, a specific column of an adjacent matrix is multiplied by a vector formed by all messenger RNA concentrations at the moment t (the operation can filter out the concentration information of genes of neighbors which are not considered as the genes by a network generator) and then input into the MLP, and the output of the dynamics learning device is considered to represent the concentration of the next moment of the nodes. We put the outputs of all the kinetic predictors together into a vector of length N, representing the total messenger RNA concentration at time t + 1.

Our specific embodiments may be stated in steps as follows:

1) internal parameters of the network generator and the dynamics predictor are randomly initialized.

2) And sampling the internal parameters of the network generator by using a Gumbel-softmax technology to obtain an adjacency matrix.

3) And (4) performing sparsity punishment on the adjacency matrix, namely calculating the number of 1 in the adjacency matrix as a punishment item. The penalty term can be formulated as:

wherein Ls represents the structure loss, the alpha value is the structure loss parameter, and if the alpha is larger, the structure loss is stronger.

4) Let i equal to 1, and input all node concentration information Xt at ith column and t moment of the adjacency matrix into the ith dynamics predictor. The dynamics predictor outputs the concentration information of the inode at the t +1 moment. Iteratively outputting the trained dynamics predictor to obtain a prediction curve and a real curve of iterative output as shown in figure 2

5) Comparing the concentration information of the i-node at the t +1 moment with the real concentration information of the i-node at the t +1 moment, calculating a loss function, wherein the loss function is calculated by an L1 norm and can be represented as the following formula:

6) repeating the steps of 4) and 5), each time repeating i +1, until all kinetic predictors have been trained.

7) Performing multiple rounds of training, and repeating the steps 2) to 6) for each round.

8) The training is stopped until the loss function converges, at which time the network generator can sample the adjacency matrix with higher accuracy, as shown in fig. 3

Training is carried out according to the steps, when the loss function is converged, the network generator can generate a relatively accurate gene regulation network structure, and the corresponding dynamics predictor can also accurately represent the dynamics corresponding to a certain gene.

Claims

1. A gene regulation network reconstruction method based on deep learning is characterized in that a gene regulation network is reconstructed through concentration change time sequence data of messenger RNA, two sub-modules, namely a network generator and a dynamics predictor, are built, and all parameters in a model are adjusted by a back propagation and gradient descent technology in the deep learning; the goal is to use the total messenger RNA concentration data at time t, i.e., Xt, and the total messenger RNA concentration information to predict time t +1, i.e., Xt + 1; learning a network structure in continuous prediction and back propagation adjustment and learning a dynamics learner capable of accurately fitting gene regulation dynamics;

the method comprises the following specific steps:

1) randomly initializing internal parameters of a network generator and a dynamics predictor;

2) sampling internal parameters of the network generator by using a gumbel-softmax technology to obtain an adjacency matrix, wherein the gumbel-softmax refers to a differentiable sampling technology, and a calculation process simulates a normal sampling process;

3) and (3) performing sparsity punishment on the adjacency matrix, namely calculating the number of 1 in the adjacency matrix as a punishment item, wherein the punishment item is expressed by the formula:

wherein Ls represents the structure loss, α value structure loss parameter, if α is bigger, the structure loss is stronger;

4) let i equal to 1, input the concentration information Xt of all nodes in the ith column and t moment of the adjacency matrix into the ith dynamics predictor, and the dynamics predictor outputs the concentration information of the i node at t +1 moment;

5) comparing the concentration information of the i-node at the t +1 moment with the concentration information of the i-node at the real t +1 moment, calculating a loss function, wherein the loss function is calculated by an L1 norm, and the L1 norm is the absolute value of the difference between a predicted value and a real value and is expressed by the following formula:

6) repeating the steps of 4) and 5), repeating i +1 each time until all kinetic predictors have been trained;

7) performing a plurality of rounds of training, each round repeating steps 2) to 6);

8) the training is stopped until the loss function converges.