CN108763862B

CN108763862B - Method for deducing gene pathway activity

Info

Publication number: CN108763862B
Application number: CN201810422205.7A
Authority: CN
Inventors: 刘文斌; 沈良忠; 昝乡镇
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2018-05-04
Filing date: 2018-05-04
Publication date: 2021-06-29
Anticipated expiration: 2038-05-04
Also published as: CN108763862A

Abstract

The invention provides a method for deducing gene pathway activity, which comprises the steps of obtaining a sample, a pathway network corresponding to the sample and expression values of all genes, and carrying out weighting treatment on the expression values of all genes by taking a t value of a gene t test and a Pearson correlation coefficient as weights of the genes; obtaining interaction types and corresponding intensities among genes in the same channel according to the topological structure of the channel network, and obtaining interaction expression values among the genes by using the intensities of the interaction types among the genes and the weighted expression values of the genes; the expression values and the interaction expression values of the respective genes were integrated and analyzed by principal component analysis, and the obtained first principal components were each further defined as the activity score of the pathway. The invention is implemented, and the importance of genes and the importance of the interaction between the genes are simultaneously considered to infer the activity of the pathway, thereby realizing the evaluation of the state of the biological pathway sample.

Description

Method for deducing gene pathway activity

Technical Field

The invention relates to the technical field of gene detection, in particular to a method for deducing gene pathway activity.

Background

Many recent research methods propose to search more robust biological markers at a functional level to break through the problem of instability of single gene tags. Because genes are not solely involved in biological processes, gene products usually act synergistically in the modes of functional modules or signal cascades and the like, functional modules which are disordered at a high level are possibly more stable than single genes as biomarkers, and various noises have little influence on the biomarkers. The biological markers at the functional level can effectively reduce the heterogeneity of tissues and the genetic heterogeneity of samples, and simultaneously effectively analyze the relationship between important functional pathways and diseases. Therefore, integrating the expression profiles of the functionally related genes and extracting the classification features at the functional level will be beneficial to obtain more robust biological markers. Functional modules are often embedded in classical pathways and protein networks, and these high-throughput information can be obtained from Gene Ontology, KEGG databases, or other Gene sets defined in microarray expression profiling research experiments, such as the molecular signature database MSigDB.

Since the pathway information highly reflects the chemical effect and functional expression between genes, the expression level of genes in the pathway is indistinguishable from the function embodied by the pathway, and once the expression level of significant genes in the pathway is disturbed, the function of part of the pathway is also disordered. Therefore, a classification identification experiment is performed by analyzing gene expression profiles in the pathway to define the activity of the pathway, so as to obtain accurate biomarkers. For example, in order to solve the problem of gene duplication in different paths, researchers such as Su design a log-likelihood function to search for a linear sub-path with classification capability, the obtained linear sub-path has higher classification capability, and the classification effect is further improved; in another example, Breslin et al investigators infer pathway activity by the sum of pathway member gene expression values; as another example, Guo et al investigators infer pathway activity by calculating the Mean (Mean) or Median (media) of pathway member gene expression values; for another example, researchers such as Bild and the like can deduce the pathway activity by analyzing a pathway member gene expression profile through a main component and using a first main component, and the method can also identify a disordered pathway pattern and an oncogenic pathway marker, thereby providing an important basis for the targeted treatment of related cancer subtypes; for another example, Lee et al have suggested that CORGs (condition-responsive genes) genes in a pathway play a major role in pathway activity rather than all genes in the pathway. The above research results indicate that considering the functional modules of genes can identify more stable biological markers and obtain more accurate classification effect.

However, the above method for inferring pathway activity only utilizes significant genes in a pathway, does not consider interaction information between genes, but only considers the pathway as a simple set of single genes, but ignores gene topology information in a pathway network, and loses many important information of intergenic communication.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide a method for deducing the activity of a gene pathway, and to infer the activity of the pathway by considering the importance of genes and the importance of interactions between genes, thereby realizing the evaluation of the state of a biological pathway sample.

In order to solve the above technical problems, an embodiment of the present invention provides a method for deriving gene pathway activity, comprising the steps of:

step S1, obtaining a sample and a corresponding path network thereof, obtaining the expression value of each gene contained in the path network, and weighting the expression value of each gene in the path network by taking the t value of the t test of the expression value of the gene between two different phenotypes and the Pearson correlation coefficient of the gene expression value and the sample phenotype as the weight of the gene;

step S2, obtaining the interaction type and the corresponding strength of the genes in the same channel according to the topological structure of the channel network, and obtaining the interaction expression value of each gene in the channel network by using the strength corresponding to the interaction type of the genes and the expression value of each gene after weighting treatment;

and step S3, integrating the expression values of all genes in the pathway network and the interaction expression values among all genes, analyzing by adopting a principal component analysis method, and further defining the obtained first principal components as the activity scores of the pathways.

Wherein, in the step S1, the expression values of the genes included in the path network are normalized by the formula

Wherein, g_ij represents the expression value of the gene i in the sample j, and mean and std represent the mean and standard deviation of the expression value of the gene in all samples, respectively.

Wherein in the step S1, the expression value of the gene after weight processing is z'_ij＝t_score(g_i)²*ρ(g_i)*z_ij(ii) a Wherein, z'_ijGene g in sample j_ijA weighted expression value; gene t_score(g_i) Is gene g_iAnalyzing the statistic value of the gene expression value between two phenotypes by using a two-tailed t test; ρ (g)_i) Is the Pearson correlation coefficient between the expression value of the gene in all samples and the sample phenotype.

Wherein, in the step S2, the interaction expression value between the genes is

Wherein e is_hjIs gene g_ijAnd gene g_kjAn expression value of the interaction; beta is a_ikIs gene g_iAnd gene g_kA beta value corresponding to the interaction type; rho_ikIs gene g_iAnd gene g_kPearson's correlation coefficient of expression value; z'_ijGene g in sample j_ijThe expression value after weighting; z'_kjGene g in sample j_kjThe expression value after weighting.

Wherein, in the step S3, the calculation formula of the activity score of each gene pathway is:

a(P_j)＝w_1jz′_1j+w_2jz'_2j+…+w_ijz′_ij+…+w_njz'_nj+w_(n+1)je_1j+…+w_(n+h)e_hj+…w_(n+l)e_lj(ii) a Wherein, a (P)_j) Is the pathway activity fraction, w, of sample j_1jIs the weight of the first gene in the sample j in the first principal component, w_ijIs the weight of the gene i in the sample j in the first principal component, w_(n+1)jThe weight of the first principal component for the interaction between the first genes in sample j, n being the baseThe total number of genes, l is the number of interactions between genes.

The embodiment of the invention has the following beneficial effects:

the invention adopts the principal component analysis method to analyze the expression value of each gene and the interaction expression value among each gene in the channel network integrating each sample, and defines the first principal component obtained by each sample as the activity score of the channel, thereby not only considering the importance of the genes, but also considering the importance of the interaction among the genes to infer the activity of the channel and having wide practicability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is within the scope of the present invention for those skilled in the art to obtain other drawings based on the drawings without inventive exercise.

FIG. 1 is a flow chart of a method for inferring gene pathway activity provided in an embodiment of the present invention;

FIG. 2 is a diagram illustrating an application scenario of the method for deriving gene pathway activity according to the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.

Referring to FIG. 1, a method for deriving the activity of a gene pathway is provided in the examples of the present invention, which comprises the following steps:

Specifically, in step S1, in order to make the gene expression values in the same class, it is avoided that the gene expression values are not in the same dimension, and an unreasonable classification result is obtained. Firstly, the expression values of all genes contained in the access network are standardized, and the specific formula is as follows:

in the formula (1), g_ijRepresents the expression value of the gene i in the sample j, and mean and std represent the mean and standard deviation of the expression value of the gene in all samples, respectively. If the expression value of a certain gene is missing in sample j, the average value of the expression values of the gene in other samples is used as a filling deletion value.

Since the expression difference of the gene in the two phenotypes can be visually depicted by the t-test value, if the t-test value of the gene is higher, the expression difference of the gene in the two phenotypes is more obvious, so that the gene expression value can be weighted by using the characteristic of the t-test value, and the gene expression value difference of the gene in different phenotypes is amplified.

The expression values of the genes after the treatment of each sample weight are:

z'_ij＝t_score(g_i)²*ρ(g_i)*z_ij (2)；

in formula (2), z'_ijGene g in sample j_ijA weighted expression value; gene t_score(g_i) Is gene g_iAnalyzing the statistic value of the gene expression value between two phenotypes by using a two-tailed t test; ρ (g)_i) Is the Pearson correlation coefficient between the expression value of the gene in all samples and the sample phenotype.

It should be noted that the t-test can be divided into single population test and double population test, and the single population t-test mainly tests the difference between the average number of one sample and the average number of population samples. See if this difference is significant. The statistics for the single sample t-test are:

wherein

Is the average of the population samples, n is the number of population samples of the sample, σ_XIs the standard deviation of the sample.

The double population t-test measures the difference between two samples at the level of the respective population. The double global t-test can be subdivided into an independent sample t-test and a paired sample t-test. An independent sample t-test is commonly used in cancer classification experiments. The difference in gene expression between two different phenotypes is described by t-test values for the gene between the two different phenotypes. T-test statistics for its gene at two different phenotypes were:

wherein n is₁And n₂The total number of positive and negative samples respectively,

and

the variance of the gene expression values in the two samples,

and

the mean value of the gene expression values in the two samples is shown. Zero assumes that the mean and variance of the positive-too distribution obeyed by both samples are the same. This method is usually called student t-test only if the variances of the two populations are equal. When this null assumption does not hold, the method is sometimes referred to as Welch's t-test. the t-test can also be used to test the difference between two measurements of the same statistic to determine if the difference between them is zero, in which case the test is often referred to as a "paired" or "duplicate measurement" t-test.

It should be noted that the pearson correlation coefficient is often used to characterize the correlation between gene expression values and sample phenotypes and the correlation between two genes, where there is an interaction between the two genes, and the pearson correlation coefficient can be used to visually describe the strength of the interaction between the two genes. The formula for calculating the Pearson correlation coefficient of the interacting gene i and gene k is:

the value of the Pearson correlation coefficient is between 1 and-1, and the Pearson correlation coefficient of two genes is 1, which shows that the two genes are completely positively correlated and have strong correlation; when the Pearson correlation coefficient of the two genes is 0, the two variables have no linear correlation and the correlation is weak; when the Pearson correlation coefficient of two variables is-1, the two genes are completely negatively linearly related, and the strong correlation between the two genes can be also shown.

The pearson correlation coefficients are symmetric, i.e.: corr (X, Y) ═ cor (Y, X). One key mathematical property of the pearson correlation coefficient is: it is invariant under different variations in the position and scale of the two variables. That is, we can transform X to a + bX and Y to c + dY, where a, b, c, and d are constants b and d > 0, and this change in the variables does not change the correlation coefficient between them.

In step S2, if there is an interaction relationship between gene i and gene k in the pathway, the expression value of the interaction between the two genes can be defined based on the expression values of the two genes. The gene interactions are weighted by the strength and type of interaction between them. Thus, the interaction expression between gene i and gene k is expressed as:

in the formula (3), e_hjIs gene g_ijAnd gene g_kjAn expression value of the interaction; beta is a_ikIs gene g_iAnd gene g_kA beta value corresponding to the interaction type; rho_ikIs gene g_iAnd gene g_kPearson's correlation coefficient of expression value; z'_ijGene g in sample j_ijThe expression value after weighting; z'_kjGene g in sample j_kjThe expression value after weighting.

By analogy, the value of the expression of the interaction between the genes in the pathway network can be determined.

In step S3, the activity score of each gene pathway is calculated by the formula:

a(P_j)＝w_1jz′_1j+w_2jz'_2j+…+w_ijz′_ij+…+w_njz'_nj+w_(n+1)je_1j+…+w_(n+h)e_hj+…w_(n+l)e_lj (4)；

in the formula (4), a (P)_j) Is the pathway activity fraction, w, of sample j_1jIs the weight of the first gene in the sample j in the first principal component, w_ijIs the weight of the gene i in the sample j in the first principal component, w_(n+1)jIs the weight of the first intergenic interaction in the sample j in the first principal component, n is the total number of genes, and l is the number of intergenic interactions.

It should be noted that Principal Component Analysis (PCA) is an important feature dimension reduction algorithm in machine learning, and the basic principle thereof is to project original data onto the dimension of the feature vector of the covariance matrix.

The algorithm for PCA roughly comprises the following steps:

1: carrying out standardization treatment on all sample data, namely mean value normalization;

2: calculating a covariance matrix C of the sample data:

where m is the number of samples and n is the amount of data per sample;

3: performing singular value decomposition on the covariance matrix obtained in the previous step:

[U,S,V]＝svd(C) (4-2)

4: then setting a projection feature matrix P according to the feature vector corresponding to the feature value;

5: projecting the original data onto a feature matrix to:

Z＝P^TX (4-3)

the PCA technique is commonly used in various research fields, and its name varies from field to field, for example, it is called noise and vibration spectrum analysis in structural dynamics, empirical mode analysis. In the machine learning process classification problem, feature selection process is often performed, and in the classification experiment, in the case of limited number of samples, tens of thousands of genes are obviously not desirable to be classified as features, which greatly reduces the performance of the classifier. Dimension reduction processing of biological data is a feasible method. The gene data after the dimensionality reduction of the PCA technology reserves the information of the original data, wherein the variance of the first principal component data is the largest and is often used for selecting as an important classification characteristic.

Fig. 2 is a diagram illustrating an application scenario of the method for deriving the activity of a genetic pathway according to the embodiment of the present invention. First, the gene expression values are normalized. Secondly, establishing gene interaction based on gene expression value data and a path; in a pathway network, each node represents a gene, and each edge represents an interaction relationship between two genes; third, a pathway activity score was calculated for each sample using principal component analysis.

The embodiment of the invention has the following beneficial effects:

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of deriving gene pathway activity comprising the steps of:

step S3, integrating the expression values of all genes in the pathway network and the interaction expression values among all genes, analyzing by a principal component analysis method, and further defining the obtained first principal components as the activity scores of the pathways;

in step S1, the expression values of the genes included in the path network are normalized by the formula

Wherein, g_ijRepresenting the expression value of the gene i in the sample j, mean and std respectively represent the average value and standard deviation of the expression value of the gene in all samples;

in step S1, the expression value of the gene after weight processing is z'_ij＝t_score(g_i)²*ρ(g_i)*z_ij(ii) a Wherein, z'_ijGene g in sample j_ijA weighted expression value; gene t_score(g_i) Is gene g_iAnalyzing the statistic value of the gene expression value between two phenotypes by using a two-tailed t test; ρ (g)_i) Is the Pearson correlation coefficient between the expression value of the gene in all samples and the sample phenotype;

in the step S2, the interaction expression value between the genes is

2. The method of deriving gene pathway activity according to claim 1 wherein in step S3, the activity score for each gene pathway is calculated by the formula:

a(P_j)＝w_1jz’_1j+w_2jz'_2j+… +w_ijz’_ij+… +w_njz'_nj+w_(n+1)je_1j+… +w_(n+h)e_hj+… w_(n+l)e_lj(ii) a Wherein the content of the first and second substances,

a(P_j) Is the pathway activity fraction, w, of sample j_1jIs the weight of the first gene in the sample j in the first principal component, w_ijIs the weight of the gene i in the sample j in the first principal component, w_(n+1)jIs the weight of the first intergenic interaction in the sample j in the first principal component, n is the total number of genes, and l is the number of intergenic interactions.