CN111009285A

CN111009285A - Biological data network processing method based on similarity network fusion algorithm

Info

Publication number: CN111009285A
Application number: CN201910451766.4A
Authority: CN
Inventors: 刘伟; 郑明霞; 赵溶; 丁彦蕊
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2019-05-28
Filing date: 2019-05-28
Publication date: 2020-04-14

Abstract

The invention discloses a biological data network processing method based on a similarity network fusion algorithm, and belongs to the technical field of biological information analysis. The method is used for the fields of discovery of disease pathogenesis, early diagnosis, later treatment and the like by constructing a similarity network for various biological genetic information such as mRNA, miRNA, lncRNA and the like, fusing the similarity matrix by using an SNF algorithm, creating an available sample network, clustering by using spectral clustering and analyzing the relationship between networks. The method can obtain more comprehensive results by utilizing the complementarity of different types of data, is greatly superior to the analysis and establishment of single data, and establishes a foundation for subsequent comprehensive analysis.

Description

Biological data network processing method based on similarity network fusion algorithm

Technical Field

The invention relates to a biological data network processing method based on a similarity network fusion algorithm, and belongs to the technical field of biological information analysis.

Background

With the development of the human genome project, bioinformatics is rapidly perfected and developed. The development of high throughput sequencing technologies has facilitated more comprehensive and deeper genome analysis. With the continuous reduction of sequencing cost, a plurality of groups of biological data including genomics, transcriptomics and the like are continuously accumulated, and massive biological data is helpful for comprehensively and effectively mining biological knowledge contained in the biological data, so that abundant data resources are provided for biological information analysis, and a new challenge is brought at the same time. With the continuous accumulation of biological data with big data characteristics and the opening of accurate medical strategy plans, the importance of biological information analysis is increasing day by day, and the biological information analysis method has great significance for promoting the development of the current related fields. However, how to excavate the potential changes of the biological network through the biological experiment data always uses a systematic method to research the hot spots and difficulties of the life phenomena. The conventional method can only analyze certain biological type data at the same time, but cannot analyze multiple biological type data at the same time, and cannot utilize different characteristics contained in different types of data.

And bin (the analysis of differences of circRNA expression profiles of Luminal subtype breast cancer cells and normal breast cells, the southern medical university, 2018, 38(8), 1014-1019) and the like) is a single-factor bioinformation analysis. And after data are extracted through circRNA expression spectrums of the two cells, quantile normalization and subsequent data processing are carried out on the collected array images, and volcano graph and clustering heat map analysis is carried out, so that the conclusion that the circRNA expression difference of the Luminal subtype breast cancer cells and normal breast cells is large is obtained, wherein the circRNA with the expression up-regulated or down-regulated is expected to become a new target for Luminal subtype breast cancer diagnosis. However, in the actual disease gene relationship, multiple types of genes commonly affect cells to generate diseases, and the analysis of single data has certain limitations.

Liuyu intelligence (Liuyu chip and DNA methylation chip integrated analysis explore molecular targets for occurrence and development of nasopharyngeal carcinoma, journal of clinical examination, 2018, (8), 574-. Although this article has applied multiple types of data to obtain nasopharyngeal carcinoma-related therapeutic targets, it essentially performs data processing on a single type of data and does not perform data analysis by fusing the characteristics of multiple types of data at the same time.

Disclosure of Invention

In order to solve the problem that the existing biological data network processing method only can analyze data from a single type of data and does not fuse the characteristics of a plurality of types of data at the same time to analyze the data so as to determine the disease subtype, the invention provides a biological data network processing method based on a similarity network fusion algorithm.

A method of biometric data network processing, the method comprising:

s1: respectively constructing sample similarity matrixes corresponding to various types according to sample data sets of different biological data types;

s2: according to the sample similarity matrix corresponding to each type constructed in S1, constructing a fusion similarity matrix of multiple types of sample data by adopting an SNF algorithm;

s3: and clustering the fusion similarity matrix corresponding to the multiple types of sample data obtained in the step S2 by adopting a spectral clustering method to determine the subclass of the sample data.

Optionally, the S1 includes:

carrying out normalization processing on each type of data in the sample data set containing different biological data types;

calculating Euclidean distances among samples of the same type after normalization, and constructing a distance matrix;

and constructing a sample similarity matrix of sample data of each type by adopting a Gaussian thermal kernel function.

Optionally, the Euclidean distance d_ijThe calculation formula is as follows:

wherein the content of the first and second substances,the sample data set contains M types of sample data, the number of the samples is n, M_vFor the number of genes included in each type of sample data, v is 1 … M, x_ikRepresenting the kth gene of the sample i, wherein the value ranges of i and j are [1, n]K has a value range of [1, m_v]；x_jkRepresents the kth gene of sample j.

Optionally, the constructing a sample similarity matrix of sample data of each type by using the gaussian thermal kernel function includes:

the sample similarity matrix of each type of sample data is denoted as wv, and the sample similarity matrix of each type of sample data is:

wherein mu is a hyper-parameter with the value range of [0.3, 0.8%]；ε_ijAre parameters used to eliminate the scaling problem.

Optionally,. epsilon_ijIs defined as:

wherein N is_iRepresents samples other than sample i, mean (d (i, N)_i) Is sample x)_iTo other samples N_iDistance mean of (2).

Optionally, the S2 includes:

after obtaining the sample similarity matrix wv constructed in S1 and corresponding to each type, obtaining a normalized weight matrix P corresponding to each type of sample data according to the following formula^(v)：

∑_f≠iw_ifRepresenting the sum of the similarity of the sample i and all other samples in the sample data of the same type, wherein the value range of f is [1, n ]]；

Definition for measuring local parentsAnd a core matrix S of the sum force, wherein the core matrix corresponding to each type of sample data is recorded as S^(v)：

The sum of the similarity of the first g samples with the highest similarity of the sample i is the value of g, and the value of g is in the range of [20,30 ]]；

Updating a sample similarity matrix wv corresponding to each data type by adopting an SNF algorithm, and iterating for a preset number of times to obtain updated P^(v)′：

Therein, sigma_k≠vP^(k)A normalized matrix P representing the correspondence of all data types except the current data type v^(v)Summing;

and fusing the similarity matrixes of all the data types to obtain a fused similarity matrix P:

optionally, the predetermined number of iterations is 10 to 20 iterations.

Optionally, the method further includes: and obtaining a sample similarity network according to the sample similarity matrix.

The second object of the present invention is to provide the use of the above method for analysis of disease subtype identification.

The third purpose of the invention is to provide the application of the method in the technical field of biological information analysis.

The invention has the beneficial effects that:

by adopting the SNF algorithm, a similarity network is firstly constructed for various biological genetic information such as mRNA, miRNA, lncRNA and the like, then the SNF algorithm is used for fusing similarity matrixes, an available sample network is created, spectral clustering is used for clustering, and the relationship between networks is analyzed, so that the method is used for the fields of discovery of disease pathogenesis, early diagnosis, later treatment and the like. The method can obtain more comprehensive results by utilizing the complementarity of different types of data, is greatly superior to the analysis and establishment of single data, and establishes a foundation for subsequent comprehensive analysis.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a fusion similarity network corresponding to a fusion similarity matrix obtained by fusing through the SNF algorithm.

FIG. 2 is a diagram of a sample similarity network constructed in accordance with the present invention.

FIG. 3 is a graph of the results of a clustering analysis of the fusion similarity matrix using spectral clustering in accordance with the present invention.

Fig. 4 is a schematic diagram of the clustering results shown in fig. 3 with obvious blocks.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The first embodiment is as follows:

in this embodiment, for detailed description, three types of data, mRNA, miRNA, and lncRNA, are taken as an example of data sets, and these data are respectively used to construct a similarity matrix for fusion, and then a sample network is constructed for analysis. The three types of mRNA, miRNA and IncRNA of 177 samples of pancreatic cancer patients were input in the following process, and the sample data was derived from TCGA database (https:// www.cancer.gov/TCGA).

The biological data network processing method based on the similarity network fusion algorithm provided by the embodiment comprises the following steps:

(1) respectively constructing a sample similarity matrix and a sample similarity network corresponding to each type of sample data;

assume the sample data set as { x₁,x₂,...,x_nThe sample data set contains M types of sample data in total, the number of the samples is n, and the number of genes contained in each type of sample data is M_v(v ═ 1 … M); in this example, the number of genes included in each of the three types of sample data, n-177, M-3, mRNA, miRNA, and lncRNA, is M1-8073, M2-557, and M3-17914.

Firstly, normalizing each type of data in different types of biological data sets of mRNA, miRNA and lncRNA of 177 samples, calculating Euclidean distance between the samples after normalization, constructing a distance matrix, and constructing a sample similarity matrix wv of each type of sample data by a Gaussian thermal kernel function, wherein v is 1 … M. The sample similarity matrixes respectively corresponding to the three types of sample data are w1, w2 and w 3;

for simplicity of description, the following description will take the example of constructing a sample similarity matrix corresponding to one type of sample data and a sample similarity network process, and the following process needs to be performed on each type of sample data for multiple types.

The normalized formula is:

u is the mean, σ is the standard deviation, and x is the sample data.

The Euclidean distance calculation formula is as follows:

i. j has a value in the range of [1, n ]]In this embodiment, i, j ∈ (1,177), x_ikRepresents the kth gene of the sample i, and the k value range is [1, m_v]。

Constructing a sample similarity matrix wv of each type of sample data by a Gaussian thermal kernel function as follows:

wherein, w_ijThe similarity between the sample i and the sample j is shown, mu is a hyper-parameter and the value range is [0.3,0.8 ]]；d_ijRepresents the euclidean distance of sample i from sample j; epsilon_ijIs a parameter for eliminating the scaling problem, ε_ijIs defined as

Wherein N is_iRepresents samples other than sample i, mean (d (i, N)_i) Is sample x)_iTo other samples N_iDistance mean of (d);

and after obtaining the sample similarity matrix wv of the sample data of each type, representing the sample similarity matrix wv in a graph form to obtain a sample similarity network corresponding to the sample data of each type.

(2) Similar network convergence

After the sample similarity matrix w constructed by different biological data types is obtained, a state matrix, namely the sample similarity matrix input in each iteration, is iteratively updated by using a Similarity Network Fusion (SNF) algorithm, and finally the fusion similarity matrix of various types of sample data is obtained, so that a fusion sample network is constructed, and further the next analysis is carried out.

The SNF algorithm is a method of constructing sample similarity networks for each data type using a sample network as an integration basis, and integrating these networks into a single similarity network using a nonlinear combination method. The SNF algorithm surpasses the current typing strategy for capturing continuous phenotypes, is greatly superior to the analysis and establishment of single data, and is very effective in identifying tumor subtypes and predicting survival.

Similar network iterative fusion based on the SNF algorithm integrates the data types well, so that biological information is further mined from a comprehensive angle.

After obtaining the sample similarity matrix wv of each data type, obtaining a normalized weight matrix P corresponding to the sample data of each type according to the following formula^(v)：

The normalization matrix P^(v)The method is not influenced by the self-similarity of the diagonal lines, and numerical instability is avoided.

Defining a kernel matrix S for measuring local affinity, and recording the kernel matrix corresponding to each type of sample data as S^(v)：

The sum of the similarity of the first g samples with the highest similarity of the sample i is the value of g, and the value of g is in the range of [20,30 ]]. Next, a k-nearest neighbor (k-nn) method is used, which can filter out those edges with low similarity, and only the k-nearest neighbors of the sample are retained.

Updating a sample similarity matrix wv corresponding to each data type by adopting an SNF algorithm, and after iterating for a preset time, taking the preset time for 10-20 times to obtain updated P^(v)′：

Therein, sigma_k≠vP^(k)Indicating in addition to the current data typeNormalization matrix P corresponding to all data types except v^(v)Summing;

in this embodiment, the sample similarity matrix corresponding to 3 types

In a feature fusion process of data on M sample similarity networks, if two samples i and j are similar in all data types, their similarity will be enhanced by the fusion process, and vice versa. And fusing the similarity matrixes of all the data types to obtain a fused similarity matrix P:

the fused similarity network corresponding to the fused similarity matrix P is shown in fig. 1, and a sample similarity network is constructed by the fused similarity network, as shown in fig. 2.

(3) Spectral clustering

Clustering the obtained fusion similarity matrix P by using a spectral clustering method to obtain subclasses.

Assuming the total number of clusters is C, each sample x_iHaving a label indicating vector y_i∈{0,1}^CWhen x is_iWhen belonging to the C-th cluster, C has a value range of [1, C%]，

y_i(k)＝1

If not, then,

y_i(k)＝0

by dividing the matrix

To represent a clustering scheme; network partitioning using spectral clustering algorithm:

s.t.Q^TQ＝I

wherein Q ═ Y (Y)^TY)^-1/2Partitioning the matrix for scale; l is⁺＝I-D^-1/2PD^-1/2A normalized laplacian matrix representing the fusion similarity matrix P; momentThe matrix D is a matrix of degrees of the similarity network corresponding to the fusion similarity matrix P, the diagonal elements are degrees of the corresponding position nodes, and the non-diagonal elements are set to 0. The objective function may be characterized by a feature vector decomposition problem. By calculating the minimum k feature vector and applying a k-means algorithm to the reduced data, clustering of samples is obtained, the analysis result is shown in fig. 3, the samples are clustered into three subclasses, comparing fig. 1, three obvious blocks with different sizes can be seen in fig. 3, each block represents a subclass, and the schematic diagram of the three obvious blocks is shown in fig. 4.

The invention adopts SNF algorithm, and creates a calculation model of a comprehensive view of biological information by calculating sample similarity and performing similarity network fusion. The SNF algorithm can maintain a high signal-to-noise ratio so that the individual data types can be well integrated together. And the spectral clustering algorithm can analyze the relationship between the network nodes. The method can centralize the characteristics of various data types, solves the limitation of single data analysis, and establishes a foundation for subsequent comprehensive analysis such as disease subtype identification and the like.

Some steps in the embodiments of the present invention may be implemented by software, and the corresponding software program may be stored in a readable storage medium, such as an optical disc or a hard disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for network processing of biological data, the method comprising:

2. The method according to claim 1, wherein the S1 includes:

3. The method of claim 2, wherein the euclidean distance dij is calculated as:

wherein, the sample data set contains M types of sample data, the number of the samples is n, M_vFor the number of genes included in each type of sample data, v is 1 … M, x_ikRepresenting the kth gene of the sample i, wherein the value ranges of i and j are [1, n]K has a value range of [1, m_v]；x_jkRepresents the kth gene of sample j.

4. The method of claim 3, wherein constructing the sample similarity matrix for each type of sample data using the Gaussian thermal kernel function comprises:

wherein mu is a hyper-parameter with the value range of [0.3, 0.8%]；ε_ijIs used for eliminatingExcept for the parameters of the scaling problem.

5. Method according to claim 4, characterized in that ε_ijIs defined as:

6. The method according to claim 5, wherein the S2 includes:

7. the method of claim 6, wherein the predetermined number of iterations is 10-20 iterations.

8. The method of claim 7, further comprising: and obtaining a sample similarity network according to the sample similarity matrix.

9. Use of the method of any one of claims 1 to 8 for analysis of disease subtype identification.

10. Use of the method of any one of claims 1 to 8 in the field of bioinformatic analysis techniques.