CN106650317B

CN106650317B - A method of tumour latent gene target is found by collaborative filtering public database

Info

Publication number: CN106650317B
Application number: CN201610879877.1A
Authority: CN
Inventors: 江经纬; 孙媛媛
Original assignee: Nanjing Double Transport Biotechnology Co Ltd
Current assignee: Shuangyun biomedical technology (Suzhou) Co.,Ltd.
Priority date: 2016-10-09
Filing date: 2016-10-09
Publication date: 2019-04-16
Anticipated expiration: 2036-10-09
Also published as: CN106650317A

Abstract

The present invention provides a kind of method for finding tumour latent gene target by collaborative filtering public database, the process employs based on known oncogene target database and belong to the Jaccard coefficient between different tumours, excavate the gene target for having been found that in certain tumour but having not found in another tumour.Method of the invention finds tumour latent gene target calibration method compared to traditional manual retrieval, and scientific research efficiency can be substantially improved, and improves the research hit rate of potential tumor target, utilizes the resource of a variety of public databases to the full extent, saves many experiments cost.

Description

A method of tumour latent gene target is found by collaborative filtering public database

Technical field

The present invention relates to the utilization technical fields of oncogene public database, more particularly to one kind to pass through collaborative filtering public affairs The method of database discovery tumour latent gene target altogether.

Background technique

Currently, the discovery of oncogene target, which mainly passes through, reads related scientific research document progress reasonable assumption and corresponding reality Verifying.It is constantly reformed, under the overall background that technology platform is constantly brought forth new ideas in laboratory facilities, a large amount of high-flux sequence experiment is also answered For finding oncogene target.Therefore, the result based on a large amount of high-flux sequence data is effectively stored in corresponding swollen On tumor database, such as COSMIC, TCGA etc..

In the tumour lane database of various different emphasis, about the database (such as COSMIC) of gene mutation, related In different tumor types and the database of clinical manifestation (such as TCGA), about tumour medicine sensibility database (such as GDSC), about the tumour database (such as CIViC) of prognosis.Based on the pure artificial a large amount of Relational database of lookup, it is usual because The problems such as huge for data volume, causes to propose that reasonable assumption and corresponding experimental design are very time-consuming and inefficient.Therefore, it develops A set of method using the public database discovery potential gene target of tumour seems very urgent out.

The object of oncogene target database includes tumor type and relevant gene target, the database of this result Structure is suitble to Collaborative Filtering Recommendation Algorithm.In general, the principle of Collaborative Filtering Recommendation Algorithm is according between user and user Similitude find and the potential interest of user and then make reasonable recommended, principle is two points for establishing user and commodity Scheme (bipartite graph), the algorithm be chiefly used in electric business website according between client similitude Recommendations such as Jingdone district, wash in a pan Treasured etc..However, Collaborative Filtering Recommendation Algorithm can also regard user and commodity as tumor type and corresponding gene target, it is different The tumour of type can also carry out Similarity measures, to recommend some gene targets out.

Currently, for each scientific research personnel of the traditional artificial search of application, it be in quantity and information content blowout Lane database finds potential gene target and carries out arrangement and contrived experiment, and the time that this process needs to spend is big at present About 3-6 months, it is current and it is foreseeable over the next several years in scientific documents no matter quantitatively or all presentation refers in information content The growth trend of numerical expression, so it is necessary to develop it is corresponding based on the collaborative filtering recommending method of database to solve tradition Manual search process needs time-consuming too long problem.

Summary of the invention

In view of the above problems, can substantially shorten scientific research personnel it is an object of that present invention to provide one kind to carry out rationally Assuming that, time of contrived experiment whole process, pass through the method that collaborative filtering public database finds tumour latent gene target.

In order to achieve the above object, The technical solution adopted by the invention is as follows: oncogene database is generally all comprising swollen Tumor type and associated gene mutation these two types information, these two types of information are generally by GWAS and high throughput sequencing technologies to massive tumor Sample analysis obtains, and has a degree of hereditary meaning.However, due to heterogeneity of tumor sample etc., in general Only some high-frequency mutated genes could be found by above method, and low frequency mutation is then difficult to be identified.In fact, The different mutated genes of same type tumour are to have certain relevance from hereditary meaning, in addition between different type tumour It is often found that publicly-owned metabolic pathway.

Our bipartite graphs according to oncogene Database based on tumor type and corresponding mutated gene, two points herein Collaborative filtering is carried out on the basis of figure, can find the latent gene target of certain tumor types.

A kind of method that tumour latent gene target is found by collaborative filtering public database provided by the invention, it is described Method include following operating procedure:

1) principle for utilizing graph theory is established in oncogene database and is owned according to the information in oncogene database Tumor type and its bipartite graph for corresponding to mutated gene, wherein during establishing bipartite graph, define two kinds of nodes, Yi Zhongjie Point is tumor type, and another node is mutated gene；Definition: tumor type X and oncogene warehouse publication this is swollen Have side between the corresponding mutated gene of tumor type X, define: there is no side between different tumor types, define: different mutated genes it Between also without side.

2) specified tumor type A is selected from bipartite graph, other all tumor types are target tumor type B, are calculated Jaccard value between specified tumor type A and target tumor type B, specific formula for calculation are as follows:

Wherein, | A | it is the mutated gene quantity of tumor type A, | B | it is the mutated gene quantity of tumor type B, | A ∩ B | it is the publicly-owned mutated gene quantity of tumor type A and tumor type B, | A ∪ B | to be present in tumor type A or tumor type B Publicly-owned mutated gene quantity, the Jaccard value calculated obtain between specified tumor type A and target tumor type B Similarity.

3) step 2 is repeated, other all target tumors in specified tumor type A and oncogene database are calculated separately The Jaccard value of type B chooses the target tumor type B that Jaccard value is greater than 0₁、B₂、B₃……Bn。

4) from selecting step 3 in oncogene database) in Jaccard value be greater than 0 target tumor type B₁、B₂、 B₃... the mutated gene B of specified tumor type A is not present in corresponding to Bn₁₁’… B_1i’、B₂₁’ … B_2j’、B₃₁’ … B_3j’ 、…、Bn₁' … Bnm'；By target tumor type B_iImparting target corresponding with the Jaccard value of specified tumor type A is swollen Tumor type B_iAll related mutation gene B_i1’… B_iq’。

5) by mutated gene B corresponding in step 4)₁₁’… B_1i’、B₂₁’ … B_2j’、 B₃₁’ … B_3j’ 、…、 B_n1' ... the Jaccard value of the mutated gene of the same name of Bnm ' is added.

6) mutated gene corresponding to the height arrangement target tumor type A according to Jaccard value after addition, according to Jaccard value height judges whether mutated gene is the latent gene target for specifying tumor type.

7) search document, determine latent gene target in step 6) whether in the field of specified tumor type A not by It studied.

In the calculating process of step 2 of the present invention, the value range of Jaccard value is 0~1, wherein Jaccard Value is bigger, then represents tumor type A and tumor type B is more similar.

In step 6) of the present invention, Jaccard value is higher after addition, and corresponding mutated gene is specified tumour class The probability of the latent gene target of type is higher.

Oncogene database of the present invention includes all tumor types and its common data for corresponding to gene mutation Library.

The present invention has the advantages that present invention employs the collaborative filtering method discovery tumour based on tumour database is potential Gene target substitute traditional artificial searching method.It is compared with the traditional method, search time is greatly decreased simultaneously in the present invention Reasonable design experiment.

Detailed description of the invention

Fig. 1 is the two subnetwork figure of tumor type-mutated gene of foundation in the present invention by taking COSMIC database as an example.

Specific embodiment

The present invention is described in further detail with specific embodiment for explanation with reference to the accompanying drawing.

Oncogene database of the present invention is public database all on the market, including all tumour classes Type and its information for corresponding to gene mutation.

Embodiment 1: a method of tumour latent gene target, the side are found by collaborative filtering public database Method includes following operating procedure:

Wherein, | A | it is the mutated gene quantity of tumor type A, | B | it is the mutated gene quantity of tumor type B, | A ∩ B | it is the publicly-owned mutated gene quantity of tumor type A and tumor type B, | A ∪ B | to be present in tumor type A or tumor type B Publicly-owned mutated gene quantity, the Jaccard value calculated obtain between specified tumor type A and target tumor type B Similarity.The value range of Jaccard value is 0~1, and wherein Jaccard value is bigger, then represents tumor type A and tumour class Type B is more similar.

4) from selecting step 3 in oncogene database) in Jaccard value be greater than 0 target tumor type B₁、B₂、 B₃... the mutated gene B of specified tumor type A is not present in corresponding to Bn₁₁’… B_1i’、B₂₁’ … B_2j’、B₃₁’ … B_3j’、…、Bn₁' … Bnm'；By target tumor type B_iImparting target corresponding with the Jaccard value of specified tumor type A is swollen Tumor type B_iAll related mutation gene B_i1’… B_iq’。

5) by mutated gene B corresponding in step 4)₁₁’… B_1i’、B₂₁’ … B_2j’、 B₃₁’ … B_3j’、…、B_n1’ ... the Jaccard value of the mutated gene of the same name of Bnm ' is added.

Embodiment 2: as shown in Figure 1, using method of the invention, by taking COSMIC database as an example, the tumor type-of foundation Two subnetwork figure of mutated gene.

Wherein dark node is tumor type, white nodes are the corresponding mutated gene of tumor type；Dark node is bigger The mutated gene quantity that representative participates in the tumour is more, white nodes are bigger represents tumor type number relevant to the mutated gene It measures more.

Embodiment 3: with COSMIC Database tumor type-two subnetwork figure of mutated gene, specify tumor type latent New gene target (by taking non-small cell lung cancer NSCLC as an example, recommended according to COSMIC database in 2014, choose first 10, Quantity statistics of publishing an article derive from 2015-2016 NCBI Pubmed database)

As seen from the above table: the present invention calculates similar between different tumor types on the basis of public tumour database Property, it finds some potential gene targets, substantially reduces the time cost obtained by manual search.With COSMIC in 2014 For database, using present invention discover that the new latent gene target of NSCLC before 10, occur between -2016 years 2015 The article of 3 potential targets, is PIK3CA(27 articles respectively), MLH1(1 articles) and EP300(3 articles).It compares In the method for traditional manual search, the present invention is more efficient, more acurrate, while a large amount of saving experimental costs and reduction experiment are blindly Property.

It should be noted that above-mentioned is only presently preferred embodiments of the present invention, protection model not for the purpose of limiting the invention It encloses, any combination or equivalents made on the basis of the above embodiments all belong to the scope of protection of the present invention.

Claims

1. a kind of method for finding tumour latent gene target by collaborative filtering public database, which is characterized in that described Method includes following operating procedure:

1) all tumours in oncogene database are established according to the information in oncogene database using the principle of graph theory Type and its bipartite graph for corresponding to mutated gene,

Wherein, two kinds of nodes are defined during establishing bipartite graph, a kind of node is tumor type, and another node is mutation Gene；Definition: tumor type X and having side between the corresponding mutated gene of tumor type X of oncogene warehouse publication, Definition: not having side between different tumor types, defines: without side between different mutated genes yet；

2) specified tumor type A is selected from bipartite graph, other all tumor types are target tumor type B, are calculated specified Jaccard value between tumor type A and target tumor type B, specific formula for calculation are as follows:

Wherein, | A | it is the mutated gene quantity of tumor type A, | B | it is the mutated gene quantity of tumor type B, | A ∩ B | be Tumor type A and tumor type B publicly-owned mutated gene quantity, | A ∪ B | to be present in tumor type A or tumor type B is publicly-owned Mutated gene quantity,

The Jaccard value calculated obtains the similarity between specified tumor type A and target tumor type B；

3) step 2 is repeated, other all target tumor type Bs in specified tumor type A and oncogene database are calculated separately Jaccard value, choose Jaccard value be greater than 0 target tumor type B₁、B₂、B₃……Bn；

4) from selecting step 3 in oncogene database) in Jaccard value be greater than 0 target tumor type B₁、B₂、B₃…… The mutated gene B of specified tumor type A is not present in corresponding to Bn₁₁’… B_1i’、B₂₁’ … B_2j’、 B₃₁’ … B_3j’、 …、Bn₁' … Bnm'；By target tumor type B_iImparting target tumor type corresponding with the Jaccard value of specified tumor type A B_iAll related mutation gene B_i1’… B_iq'；

5) by mutated gene B corresponding in step 4)₁₁’… B_1i’、B₂₁’ … B_2j’、B₃₁’ … B_3j’、…、B_n1’ … The Jaccard value of the mutated gene of the same name of Bnm ' is added；

6) mutated gene corresponding to the height arrangement target tumor type A according to Jaccard value after addition, according to Jaccard Value height judges whether mutated gene is the latent gene target for specifying tumor type；

7) document is searched, determines whether latent gene target is no studied in the field of specified tumor type A in step 6) It crosses.

2. the method according to claim 1 that tumour latent gene target is found by collaborative filtering public database, It is characterized in that, in the calculating process of the step 2), the value range of Jaccard value is 0~1, and wherein Jaccard value is got over Greatly, then it represents tumor type A and tumor type B is more similar.

3. the method according to claim 1 that tumour latent gene target is found by collaborative filtering public database, It is characterized in that, in the step 6), Jaccard value is higher after addition, and corresponding mutated gene is the latent of specified tumor type It is higher in the probability of gene target.

4. the side according to claim 1 or 2 or 3 for finding tumour latent gene target by collaborative filtering public database Method, which is characterized in that the oncogene database includes all tumor types and its common data for corresponding to gene mutation Library.