CN110265084A

CN110265084A - The method and relevant device of riboSnitch element are rich in or lacked in prediction cancer gene group

Info

Publication number: CN110265084A
Application number: CN201910484578.1A
Authority: CN
Inventors: 苏志熙; 何馥男
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-06-05
Filing date: 2019-06-05
Publication date: 2019-09-20

Abstract

The present invention relates to detection mutation to the method and relevant device for being rich in or lacking riboSnitch element in RNA secondary structure influence technique field, more particularly to prediction cancer gene group.The present invention provides the software of entitled SNIPER a kind of, it can be used in predicting riboSNitch and identify the non-coding element that riboSNitch is rich in or lacked in tumour.The content of present invention not only includes the introduction to software SNIPER, and the riboSNitch in analysis cancer in somatic mutation for the first time.

Description

The method and correlation of riboSnitch element are rich in or lacked in prediction cancer gene group Equipment

Technical field

The present invention relates to detection mutation in RNA secondary structure influence technique field, more particularly to prediction cancer gene group It is rich in or lacks the method and relevant device of riboSnitch element.

Background technique

RNA secondary structure can influence cell processes from many aspects, including influence rna stability, RNA positioning, RNA turns Record, RNA processing or even translation of protein etc..Wherein, change the single nucleotide variations (SNV) of RNA secondary structure, referred to as riboSNitch.These mutation may influence human health, lead to certain human diseases.On especially some non-coding regions The transcription and translation of mutation, secondary structure and gene is closely related.Noncoding region is especially had studied in the present invention The non-translational region (UTRs) and non-coding RNA (ncRNA) of riboSNitch, especially gene.Currently, the mankind are to these changes The mutation research of RNA secondary structure is also very limited.In addition, mass mutation can be generated during cancer occurrence and development, but It is current not systematic influence of the somatic mutation to RNA secondary structure during studying cancer occurrence and development of someone.

RNA secondary structure is by influencing the translation efficiencies such as RNA positioning, stability, montage and protein in gene regulation It plays a crucial role.Since most of human genomes are transcribed 1, and the structure of RNA may will affect post-transcriptional control and turn over All protein translation processes such as starting, extension and termination during translating².Therefore, further investigation RNA secondary structure may Facilitate us and more fully understands its molecule and biological action in regulation.

The secondary structure of RNA only has single-stranded or double-strand both of these case for each base.Currently, multiple groups are ground Study carefully personnel and developed and identifies that single-stranded or double-strand structure detects RNA secondary structure using probe specificity, In include SHAPE-Seq and FragSeq, Mod-seq and PARS etc.^3–6.With the rapid development of two generation techniques, pass through probe The method of prediction RNA secondary structure has also increased accordingly efficiency and accuracy⁷.In addition, there are also researcher develop it is different Software, can be from the secondary structure of sequence prediction RNA, including ViennaRNA, RNA-MoIP and RNASNP^8–10.Nearest one Research, analyzes RNA structure (father, mother and child) in family using the method for PARS, research finds the family In mutation, there is nearly 15% transcription SNV to change the secondary structure of the part RNA, be accredited as riboSNitch.In addition, this Kind mutation be it is heritable, disclose the generality of riboSNitch in human genome^6,11.Although full-length genome research has been demonstrate,proved Bright riboSNitch significant missing near RNA controlling element such as miRNA and protein binding site, illustrates the second level knot in the region Structure is than more conservative¹¹.In addition, nearest research may will affect RNA it is also shown that changing the mutation of partial rna secondary structure Binding protein (RBPs) and its binding affinity.These discoveries highlight the importance of RNA secondary structure around binding site, this Illustrate that SNV can further influence the joint efficiency of RBP and miRNA by changing partial rna secondary structure, so as in base Because playing key effect in regulation.

The pathogenic SNV of the secondary structure of RNA key is destroyed, it may be by changing secondary structure influence RNA function and most Lead to disease eventually.Studies have found that leading to high-speed rail proteinemia cataract syndrome and retinoblastoma disease phase Point mutation is closed, the expression of the RNA secondary structure controlling gene protein of 5 ' UTR of change may be passed through^14–16, these mutation are all demonstrate,proved Bright is riboSNitch.Researcher has also discovered on long non-coding RNA (incRNA) ribalgilase MRP RiboSNitch, may be incomplete related with human cartilage hair development¹⁷.In addition to this, it was discovered by researchers that the 3'UTR of FKBP5 In riboSNitch significantly change the secondary structure of 3 ' UTR of gene, and then affect the combination of miR-320a, study people Member indicates that the treatment to pain after chronic trauma can be mediated by changing secondary structure¹⁸.These researchs illustrate The importance of riboSNitch in human body, and these riboSNitch may can be taken as target spot, for treating correlation Disease.

With the rapid development of two generation sequencing technologies, in the research of cancer gene group, by thousands of tumor samples The case where being sequenced, disclosing somatic mutation in various cancers type.Most previous researchs are concentrated mainly on base Because of code area (CDS) and the noncoding region (UTR) of the encoding gene of group, such as protein coding gene, it is few be absorbed in it is non- The research of code area.RiboSNitch is had found in non-small cell lung cancer, especially in UTR and microRNA binding site week It encloses¹⁹.SNV in the 5'UTR that retinoblastoma correlation has been found that RB1 changes expression and may be by changing RNA Structure and it is carcinogenic¹⁴。

Due to RNA secondary structure regulate and control after rna transcription it is related to protein translation have a critically important function, and at present These somatic mutations for changing secondary structure in cancer gene group were studied there are no people.This series of studies is established for us Basis, specifies generality of the riboSNitch in cancer development.However, work of the riboSNitch in cancer gene group With still unclear, therefore we determine using the two cancer gene group databases of TCGA and ICGC, to carrying out cancer gene group In riboSNitch analyzed, with clear riboSNitch whether cancer gene group non-coding region (including coding The area UTR of gene and long non-coding gene) in it is generally existing, and if it is related with tumour generation.

Summary of the invention

The present invention provides the software of entitled SNIPER a kind of, it can be used in predicting riboSNitch and identify rich in tumour Contain or lack the non-coding element of riboSNitch.The content of present invention not only includes the introduction and first to software SNIPER RiboSNitch in secondary analysis cancer in somatic mutation.We have found that the riboSNitch in cancer is easier to cause a disease.This Outside, by constructing the random mutation model of three bases, we also predict significant enrichment in cancer or lack riboSNitch's Element, these elements quilt is it is considered that be extremely important in cancer disease process, and be associated with close non-coding with cancer occurrence and development Area's Genetic elements may adjust the expression of gene or protein by changing RNA secondary structure.In short, of the invention Research mainly highlights importance of the RNA secondary structure in cancer gene group, and provides the new strategy of one kind to identify New cancer related gene.

It is an object of the present invention to provide one kind to influence journey to RNA secondary structure based on MeanDiff value and prediction mutation The method of degree, comprising: the alkali between cancer sequence and corresponding normal sequence is calculated with the calculation equation of MeanDiff value Difference of the base to pairing probability, the calculation equation of the MeanDiff value are as follows:

Wherein k is the position being mutated in transcript, and w is window size, BPP_ref,iAnd BPP_alt,iRespectively represent reference sequences With i-th of base-pair probability of mutant nucleotide sequence, the value range of i is [k-w, k+w]；

The MeanDiff value of all mutation is sorted from large to small, MeanDiff value is bigger, predicts the mutation to RNA second level Structure influence degree is bigger.

A further object of the present invention, which is to provide a kind of be mutated based on the prediction of EucDiff value, influences journey to RNA secondary structure The method of degree, comprising: the base between cancer sequence and corresponding normal sequence is calculated with the calculation equation of EucDiff value To the difference of pairing probability, the calculation equation of the EucDiff value are as follows:

The EucDiff value of all mutation is sorted from large to small, EucDiff value is bigger, predicts the mutation to RNA second level knot Structure influence degree is bigger.

In the above-mentioned further embodiment based on MeanDiff value and EucDiff value, according to the above method, wherein The mutation of sequence preceding 2.5% is riboSNitch；2.5% mutation is non-riboSNitch after sequence.Further Embodiment in, according to any of the above-described method, wherein w is 2bp, 5bp, 10bp, 15bp, 20bp, 25bp, 50bp or 200bp, preferably 200bp.In a further embodiment, according to any of the above-described method, wherein the RNA secondary structure is The RNA secondary structure of mature transcript.In a further embodiment, according to any of the above-described method, used in dash forward Become sequence data and comes from ICGC database, TCGA database, thousand human genome databases or other somatic mutation data.Into In the embodiment of one step, according to any of the above-described method, wherein this method further includes filtering all idels；By the prominent of hg38 Become the mutation for being changed to hg19；Removal reduces the single nucleotide variations (SNV) of confidence level.In a further embodiment, according to Any of the above-described method, wherein the single nucleotide variations that removal reduces confidence level are to filter out to compare abnormal area, repetitive sequence Region and hyperfrequency mistake generation area.In a further embodiment, according to any of the above-described method, wherein further including choosing A major transcript of each gene is selected, and cancer somatic mutation is annotated.In a further embodiment, According to any of the above-described method, wherein calculating the base-pair between cancer sequence and corresponding normal sequence using RNAplfold Match probability.

A further object of the present invention is to provide a kind of equipment for predicting to be mutated to RNA secondary structure influence degree, Include:

Processor, including the module for obtaining sequence from database；For calculating the MeanDiff value or EucDiff The computing module of value；For the sorting module to the MeanDiff value or the sequence of EucDiff value；With for by ranking results it is defeated Output module out；And memory, it is stored thereon with instruction, described instruction makes described when being executed by the processor It manages device and executes method described in any of the above embodiments.

It is a further object to provide one kind based on MeanDiff value and the prediction mutation of EucDiff value to RNA bis- The method of level structure influence degree, comprising:

Cancer sequence and corresponding normal sequence are calculated with the calculation equation of MeanDiff value and EucDiff value respectively Between base-pair pairing probability difference, the calculation equation of the MeanDiff value and EucDiff value are as follows:

The MeanDiff value of all mutation and EucDiff value are sorted from large to small respectively, MeanDiff value and EucDiff The all forward mutation of value sequence, predicts that the mutation is big to RNA secondary structure influence degree.

It is above-mentioned based on MeanDiff value and EucDiff value prediction mutation to RNA secondary structure influence degree method into In the embodiment of one step, according to above-mentioned any one method, wherein the mutation of sequence preceding 2.5% is riboSNitch；Row 2.5% mutation is non-riboSNitch after sequence.According to above-mentioned any one method, wherein w is 2bp, 5bp, 10bp, 15bp, 20bp, 25bp, 50bp or 200bp, preferably 200bp.According to above-mentioned any one method, wherein the RNA secondary structure It is the RNA secondary structure of mature transcript.According to above-mentioned any one method, used in mutant nucleotide sequence data come from ICGC Database, TCGA database, thousand human genome databases or other somatic mutation data.According to above-mentioned any one method, Wherein this method further includes filtering all idels；The mutation of hg38 is changed to the mutation of hg19；Removal reduces the monokaryon of confidence level Thuja acid makes a variation (SNV).According to above-mentioned any one method, wherein the single nucleotide variations that removal reduces confidence level are to filter out ratio To abnormal area, repetitive sequence region and hyperfrequency mistake generation area.According to above-mentioned any one method, wherein further including choosing A major transcript of each gene is selected, and cancer somatic mutation is annotated.According to above-mentioned any one method, The base-pair between cancer sequence and corresponding normal sequence wherein, which is calculated, using RNAplfold matches probability.

It is a further object to provide it is a kind of for predict mutation to the equipment of RNA secondary structure influence degree, Include:

Processor, including the module for obtaining sequence from database；For calculating MeanDiff the and EucDiff value Computing module；Sorting module for sorting to MeanDiff the and EucDiff value；For the two ranking results to be taken friendship The module of collection；With the output module for that will sort and intersection result exports；And memory, it is stored thereon with instruction, it is described Instruction executes the processor according to method described in any of the above embodiments.

It is a further object to provide be rich in or lack riboSNitch element in a kind of prediction cancer gene group Method, comprising:

Using the mutation of prediction as prediction group；Using the mutation actually occurred in cancer as observation group, wherein different patients In the same mutation that same site occurs by separate counts；Calculate the mutant nucleotide sequence of prediction group and the practical mutation sequence of observation group The RNA secondary structure of column；

MeanDiff value and EucDiff value, the calculation equation for calculating separately each mutation in two groups are as follows；

The MeanDiff value of all mutation and EucDiff value are sorted from large to small respectively, MeanDiff value and EucDiff The intersection for being worth highest preceding 2.5% mutation is prediction group and the corresponding riboSNitch of observation group；

The riboSNitch number of comparison prediction group and observation group carries out unilateral Fisher and accurately examines and hypergeometry point Cloth is examined, and to identify significant enrichment riboSNitch, and obtains false discovery rate (FDR) using the p value that the correction of BH method is examined Value；After correcting FDR, the result of FDR < 0.05 is considered as the element for being rich in or lacking riboSNitch.

The further embodiment party of the method for riboSNitch element is rich in or lacked in above-mentioned prediction cancer gene group In case, according to the method for any one, wherein the preparation method of the mutant nucleotide sequence of prediction group is the mutation composed according to intragenic mutation Rate three base number corresponding with each transcript, obtains the random mutation number of each three base situation of each transcript, Carry out duplicate random sampling according to mutation number, to the transcript sequence of each gene, according to neutral mutation rate carry out with Machine mutation, to obtain prediction group；The neutral mutation rate in cancer gene group is wherein indicated using intragenic mutation spectrum.Into In the embodiment of one step, according to the method for any one, wherein w is 2bp, 5bp, 10bp, 15bp, 20bp, 25bp, 50bp or 200bp, preferential 200bp.In a further embodiment, according to the method for any one, wherein random mutation is directed to non-coding Area carries out, and preferably noncoding region is 5 ' UTR, 3 ' UTR and/or IncRNA.In a further embodiment, according to any one Method, wherein random mutation number is 1000 times.In a further embodiment, according to the method for any one, wherein using Mutant nucleotide sequence data come from ICGC database, TCGA database, thousand human genome databases or other somatic mutation data. In a further embodiment, according to the method for any one, wherein further comprising will be in cancer gene group RiboSNitch is compared with other riboSNitch, is accurately examined using unilateral Fisher whether to determine riboSNitch It is enriched in cancer gene group region, P value is less than 10^-3All elements be considered as cancer specific enrichment or missing The element of riboSNitch.In a further embodiment, according to the method for any one, wherein being calculated using RNAplfold The RNA secondary structure of sequence.In a further embodiment, according to the method for any one, wherein w is 200bp, single base Across window be 150bp.

It is a further object to provide one kind for predicting to be rich in or lack riboSnitch in cancer gene group The equipment of element, comprising: processor, including the module for obtaining sequence from database；For calculating described in claim 1 The computing module of MeanDiff and EucDiff value；Sorting module for sorting to MeanDiff the and EucDiff value；With In the module that the two ranking results are taken to intersection；For the module of comparison prediction group and the riboSNitch number of observation group, use In the module tested；For identifying the module of the element of enrichment or missing riboSNitch；It is defeated for exporting result Module out；And memory, it is stored thereon with instruction, described instruction holds the processor when being executed by the processor Row is according to method described in any of the above embodiments.

The invention also includes the computer-readable mediums for being stored with above-mentioned any instruction, wherein described instruction is by handling The method that the processor that device is when executing executes any of the above-described.

Beneficial effects of the present invention are as follows: method provided by the invention can predict cell mutation to RNA secondary structure Influence degree.The software of entitled SNIPER provided by the invention can be used in predicting riboSNitch and identify in tumour to be rich in Or the non-coding element of missing riboSNitch.The present invention analyzes in cancer in somatic mutation for the first time RiboSNitch has found that the riboSNitch in cancer is easier to cause a disease.The present invention constructs the random mutation model of three bases, And the element of significant enrichment or missing riboSNitch in cancer are predicted, these elements are by it is considered that be non-in cancer disease process It is often important, and close non-coding region gene element is associated with cancer occurrence and development, may by change RNA secondary structure come Adjust the expression of gene or protein.

Detailed description of the invention

Fig. 1 is SNIPER flow chart.

The ROC curve of Fig. 2 distinct methods different windows size.A figure respectively indicates MeanDiff with B figure and EucDiff exists ROC curve under different windows.C figure indicates to take preceding 2.5% MeanDiff and EucDiff to hand over when window size is 50bp Collection sports riboSNitch, while taking rear 2.5% when sporting riboSNitc of MeanDiff and EucDiff intersection ROC curve.

Fig. 3 is the Difference in Pathogenicity of riboSNitch and non-riboSNitch.A figure respectively indicates ICGC data with B figure The Difference in Pathogenicity distribution situation of riboSNitch (red) and non-riboSNitch (blue) in collection and TCGA data set. All mutation are divided into 5 class shown in figure according to the scoring situation of FATHMM by us, and conspicuousness is calculated by Chi-square Test.C Figure and D figure respectively indicate the riboSNitch (red) and non-riboSNitch (blue) of ICGC data set and TCGA data set FATHMM score distribution situation.P value passes through Mann-Whitney checking computation.E figure indicates in benign and disease cause mutation Ratio riboSNitch (red) and non-riboSNitch (blue) shared in all mutation.Conspicuousness is examined by card side Test calculating.

Fig. 4-7 is distribution of the value of MeanDiff or EucDiff in different mutation types.

Fig. 8 is the non-coding element that riboSNitch is enriched in cancer.Manhattan figure is respectively represented rich in riboSNitch 5'UTR (A figure), 3'UTR (B figure) and incRNA (C figure), wherein being only labelled with the non-coding element of FDR < 0.2.Runic gene Indicate FDR < 0.05, Blue Gene shows that the gene is accredited as a kind of element of cancer specific enrichment.

Fig. 9 is the non-coding element that riboSNitch is lacked in cancer.Manhattan figure respectively represents missing riboSNitch 5'UTR (A figure), 3'UTR (B figure) and incRNA (C figure), wherein being labelled with the non-coding element of FDR < 0.2.Runic gene table Show FDR < 0.05, Blue Gene shows that the gene is accredited as a kind of element of cancer specific missing.

Specific embodiment

Below in conjunction with specific embodiment the present invention is described in detail, cannot be construed as limiting the scope of the invention.

Embodiment

1.1 materials and method

1.1.1 data collection

Most of cancer somatic mutation data used in this research come from ICGC and TCGA database.In addition, we The somatic mutation of the melanoma of 25 genome sequencings and the gastric cancer of 100 genome sequencings is also collected^20,21。 The accidental data collection of normal person is obtained from thousand human genome databases, and the present invention uses the data of Phase 3²²。

Firstly, we have filtered out all indels, point mutation is only remained for further analyzing.Then, it uses The mutation that UCSC liftOver kit annotates all hg38 is changed to the mutation of hg19²³.In order to remove the SNV of low confidence, The germ line mutation of the somatic mutation of cancer databases and thousand human genomes is filtered by we, the section intersection of filtering be from (https: //personal.broadinstitute.org/anshul/projects/ of Broad Institute downloading encode/rawdata/blacklists).The list is all blacklist regions that wherein hg19 refers to genome as the mankind Set, these blacklist regions include compare abnormal area, repetitive sequence region and hyperfrequency mistake generation area etc. be easy Occur to compare wrong region.

We predict riboSNtiches and non-riboSNitch potential impact using fathmm-MKL²⁴.According to cause Characteristic of disease scoring, we fall into 5 types all variations: benign (scoring ∈ [0,0.2)), may be benign (scoring ∈ [0.2,0.4)), Potential pathogenic (scoring ∈ [0.4,0.6)), may be pathogenic (score ∈ [0.6,0.8)) and pathogenic (score ∈ [0.8, 1])。

MiRNA and the information of target interaction are from TargetScan v7.1 and miRanda-mirSVR data Library^25,26.Only selection has the microRNA binding site of high confidence level in our analysis: for TargetScan, using The binding site of the conservative miRNA family of PCT >=90；For miRanda, using > 1 with the conservative of high mirSVR scoring MiRNA binding site.RBP and the information of target interaction come from CLIPdb, and the present invention mainly uses HeLa cell line CLIP-seq data²⁷.Binding site comprising the prediction of a variety of methods in CLIPdb database is as a result, the present invention mainly uses The prediction result of PiRaNhA²⁸。

Furthermore it is known that Cancer Gene Census (CGC) of the cancer gene in COSMIC database²⁹, wherein Including existing oncogene and tumor suppressor gene.The relevant incRNA of cancer is downloaded from Lnc2Cancer database³⁰, include It may all incRNA information relevant to cancer.

1.1.2 gene annotation

The GTF file of the annotation coordinate of gene comes from the website ENCODE (https: //www.gencodegenes.org). The present invention mainly uses the annotation information of GENCODE v19³¹.From GTF file, gene is divided by " gene_type " For protein coding gene, pseudogene, long non-coding RNA and other small Noncoding genes etc..It is prominent in order to more accurately annotate Gene where becoming, we obtain 19035 human protein's encoding gene lists and length 3435 from HGNC database Non-coding RNA list³².In subsequent research of the invention, we only account for encoding gene in HGNC database and Somatic mutation on incRNA.

It is worth noting that, a gene can generate multiple transcripts by alternative splicing, and different primary sequences, Different secondary structures may be will form, i.e., different transcripts will form different RNA secondary structures.In order to reduce it is this not Certainty, while reducing calculation amount, in the present invention, the one of each gene has only been selected in our RNA structure prediction research A major transcription sheet.In order to obtain major transcription sheet, We conducted following steps: (1) we will transcribe coding base more first Because being ranked up according to major splice isotype (APPRIS) level, APPRIS is turn of alternative splicing gene in human genome Record isomers provides reliable classification schemes³³.APPRIS horizontal extent is from 1 to 5, wherein 1 is considered as most reliable transcription This.(2) if multiple transcripts of gene APPRIS having the same is horizontal, the transcript quilt containing CCDS ID It is considered more reliable transcript³⁴.(3) it if there is multiple transcripts APPRIS having the same is horizontal and has CCDS ID, then presses According to the horizontal sequence of RNA annotation in GENCODE, wherein 1 is most stable of transcript.It (4) if cannot be from all above methods Middle selection major transcription sheet, then will select longest transcript as major transcription sheet.In brief, the selection of major transcript It is generated according to following priority: APPRIS > CCDS > transcript steady level > transcript length.

After the major transcript for selecting each gene, all cancer somatic mutations are annotated.Furthermore, it is necessary to What is reminded is in analysis of the invention, we only account for the RNA secondary structure of mature transcript.Finally, we are from TCGA Have found 3332314 cancer somatic mutations in database, ICGC database and research before, and from 1000 genomes 1917818 germ line mutations are had found in data set.

1.1.3 RNA secondary structure prediction and riboSNitch detection

It is of the invention that we mainly use RNAplfold (http://www.tbi.univie.ac.at/RNA/) to RNA's Secondary structure predicted, RNAplfold is one in ViennaRNA for predicting the software of partial rna secondary structure⁸。 In view of RNA folding is a total transcription, the i.e. process of side transcription edge contraction.We are big by maximum base span and window Small parameter is respectively set to 150bp and 200bp³⁵.The each site and other sites of the available given sequence of RNAplfold Life base pairing probability.Using the base pairing probability matrix (BPPM) of output, we can reliably detect wild The difference of RNA structure between type and mutant, it can calculating each mutation influences size to the secondary structure of the part RNA.

The definition of riboSNitch is the SNV having a significant impact to partial rna secondary structure⁶.In cancer gene group, it is Influence of the prediction somatic mutation to secondary structure, we can use RNAplfold calculate tumour sequence and it is corresponding just Pairing probability variation between Chang Xulie, further predicts the riboSNitch in somatic mutation.Due to lacking primitive sequencer Data, we use the reference sequences of each gene as normal sequence, that is, standardize.The reference sequences of each transcript by Getfasta module in BEDTools extracts³⁶, used in gtf file be gencode v19.By with mutation Base replacement obtains corresponding tumour sequence referring to base.Later, RNAplfold is applied to normal sequence and cancer sequence To predict the base-pair probability in each site.In our study, we eliminate intron sequences, only predict mature turn Record the RNA secondary structure of object.

In order to calculate RNA secondary structure difference, we are counted using two different methods (MeanDiff and EucDiff) The difference of base-pair pairing probability (BPP) is calculated, that is, indicates to influence the size of local secondary structure due to point mutation.It is noticeable Be that structure change is not limited to single base, thus we to calculate base-pair in the wbp window size around mutational site general The change (w=200bp in subsequent calculating in the present invention) of rate.The calculation equation of MeanDiff and EucDiff is respectively as follows:

Wherein k is the position being mutated in transcript, and w is window size, BPP_ref,iAnd BPP_alt,iRespectively represent reference sequences With i-th of base-pair probability of mutant nucleotide sequence, the value range of i is [k-w, k+w].All mutation are finally according to MeanDiff It is sorted from large to small with EucDiff value.

In addition, it is contemplated that it is known mutation to local secondary structure have tremendous influence, and may influence miRNA or The combination of RBP and target binding site.So we also all riboSNitch and non-riboSNitch are navigated to it is known MiRNA and RBP binding site on.In our analysis, according to the binding site determined in 1.1.1, it is believed that place Mutation around the binding site or its within 20bp is with the somatic mutation for influencing miRNA and RBP combination potentiality.

1.1.4 using the performance of standard data set assessment MeanDiff and EucDiff

In order to determine the performance of verifying MeanDiff and EucDiff, we used pass through reality in the article reported before It tests identified riboSNitch and non-riboSNith sequence to be verified as standard data set sequence, these data sets It is the standard data set obtained by large-scale experimental data^6,35.Data set includes 1058 riboSNitche and 1058 Non- riboSNitche sequence, and total length is 101bp.RNAplfold is used to calculate the RNA structure of all sequences, accordingly MeanDiff and EucDiff is calculated using different window sizes, is respectively 2bp, 5bp, 10bp, 15bp, 20bp, 25bp by w With 50bp (arriving max-window value).According to the research proposal of forefathers, preceding 2.5% MeanDiff or EucDiff is considered as RiboSNitch, rear 2.5% mutation are non-riboSNitch³⁵.Finally, the ROC for calculating MeanDiff and EucDiff is bent Line assesses the performance of the MeanDiff and EucDiff of different window sizes by AUC value.Wherein, AUC value and ROC curve are equal It is calculated and is obtained using R packet " pROC "³⁷。

1.1.5 it detects significant enrichment or lacks the non-coding element of riboSNitch

Cancer is an evolutionary process, wherein a large amount of somatic mutations are the neutral mutations not selected.We use base The neutral mutation in cancer is simulated in the random mutation program of the spectrum of mutation, the data group of this random mutation is known as " pre- by we Survey group ".The random mutation rate of the prediction group can from intergenic region or include subregion calculate because the two regions be not by Select the neutral mutation region of pressure.So in our researchs of the invention, for the neutrality in more accurate simulation cancer Selection, we select to compose using cancer intragenic mutation to indicate the neutral mutation rate in cancer gene group.Due to TCGA data The intragenic mutation of concentration is seldom, so we only used ICGC data set in follow-up study of the invention.

In view of different transcripts has different backgrounds, during prediction group simulates random mutation, Wo Menye Need to consider the nucleotide sequence situation of different transcripts itself.In view of mutation rate and mutation type with transcript itself Sequence background is related, and in the calculating process of mutation rate, we consider 96 kinds of different mutation types, that is, considers mutation position The previous base (5 ' end) and the latter base (3 ' end) of point.We calculate the mutation of 96 possible intragenic mutation spectrums Frequency, i.e., 3 ' alkali of 5 ' base *, 4 seed type of 6 kinds * 4 seed types of mutation type (including C > A, C > G, C > T, T > A, T > C, T > G) Base；In addition, we are during random mutation, it is also contemplated that the sequence differences of different transcripts itself, likewise, we The base at 5 ' ends and the base at 3 ' ends are considered, 32 kinds of corresponding three nucleotide, i.e. 4 seed type of base * of 2 seed types are amounted to 5 ' base *, 4 seed type 3 ' bases.Wherein, it is contemplated that the complementary pairing principle of base, research of the invention only considered two Kind base: cytimidine (C) and thymidine (T).In conclusion according to the above-mentioned mutation rate for obtaining introne and each transcription This corresponding three bases number, can be obtained the random mutation number of each three base situation of each transcript.It is subsequent, Wo Mengen Carry out duplicate random sampling according to obtained mutation number, to the transcript sequence of each gene, according to neutral mutation rate into Row random mutation, to obtain our prediction group.Random sampling is simulated using R.In addition, respectively to each in the present invention The non-coding region (including 5 ' UTR, 3 ' UTR, incRNA) of a transcript random mutation 1000 times.

After obtaining the sequence of random mutation, the RNA second level knot that RNAplfold calculates mutant nucleotide sequence is can be used in we Structure can therefrom calculate the MeanDiff and EucDiff of each mutation, i.e., mutation is to secondary structure effect.In order to subtract Less predict riboSNitch in false positive, in research of the invention, by MeanDiff or EucDiff value it is highest before The intersection of 2.5% mutation is considered as riboSNitch.Since our research is concentrated mainly on the mutation of non-coding region, so I Random mutation only is carried out to non-coding region, i.e., the random mutation based on intragenic mutation spectrum is also only in the region of non-coding Simulation and calculating.

After having obtained prediction group data by random mutation, we are using the somatic mutation actually occurred in cancer as true Real observation group.The riboSNitch number of the observation group really occurred and the riboSNitch number of prediction group are compared, it can Finding the Genetic elements of significant enrichment and significant missing riboSNitch in cancer gene group, (element in the present invention mainly only wraps Include non-coding region: UTRs and incRNA).Since in same site same mutation may occur for different patients, so seeing Examining in group such mutation, we are also simulated using repeatable random mutation by separate counts, and in prediction group. So by comparing the riboSNitch number of observation group and prediction group, i.e., predictable significant enrichment and missing riboSNitch Element.

After obtaining observation group riboSNitch and expected riboSNitch quantity, subsequent we pass through progress unilateral side Fisher is accurately examined and hypergeometric distribution is examined to identify significant enrichment riboSNitch, and examined using the correction of BH method P value obtains FDR value³⁸.After correcting false discovery rate, the result of FDR < 0.05 is considered as the member of enrichment or missing riboSNitch Part.All processes and code are packaged into software SNIPER using perl program, and statistical analysis is completed using R.

1.1.6 the element of cancer specific enrichment or missing riboSNitch

Cancer specific element in order to obtain, we by cancer gene group riboSNitch and 1000Genome data The riboSNitch of concentration is compared.It is accurately examined using unilateral Fisher to determine riboSNitch whether in cancer base Because being enriched in group region.P value is less than 10^-3All elements be considered as the element of cancer specific.

1.1.7 SNIPER software package function introduction

SNIPER process software package mainly includes two parts, first part be calculate mutation MeanDiff and EucDiff value, the second part are to be enriched with or lacked riboSNitch's according to the mutation rate of cancer sample introne Element.

Firstly, being predicted on all somatic mutations of ICGC data set and each transcript respectively according to 96 kinds of mutation types The RNA secondary structure for 1000 random mutations that intragenic mutation frequency and trinucleotide distribution situation generate.Then, it utilizes MeanDiff and EucDiff calculates the secondary structure difference of sequence after reference sequences and mutation.Then, preceding 2.5% The mutation of meandiff and EucDiff is defined as riboSNitch, and rear 2.5% MeanDiff and EucDiff is defined as non-riboSNitch.RiboSNitch quantity observe by comparing and expected can detecte enrichment or missing The element of riboSNitch.

1.2 conclusion

1.2.1 MeanDiff and EucDiff is the effective ways for detecting riboSNitch

For each mutation, group is referred to according to the mankind, we can replace with gene reference sequence Central Plains base prominent Base after change is to obtain mutant nucleotide sequence.Then, we utilize RNA secondary structure prediction software prediction reference sequences and mutation The RNA secondary structure of sequence, the subsequent influence that can be mutated according to secondary structure prediction to RNA structure.According to document before for The research proposal of riboSNitch³⁵, we select the algorithm RNAplfold based on BPPM to predict RNA conformation.Meanwhile considering It is total transcription to RNA folding, so predicting the probability of the RNA secondary structure of mutation part also more using RNAplfold Properly.

Invention introduces two kinds of new methods to calculate riboSNitch:MeanDiff and EucDiff.In order to assess this The performance of two methods, we used 1,058 riboSNitch and 1, the normal datas of 058 non-riboSNitch sequence Collection, and length is 101bp.The data set is the riboSNitch found in a three-person household by laboratory facilities PARS With non-riboSNitch sequence^6,35.Let us choose MeanDiff value or EucDiff value is minimum and maximum 2.5% thinks It is riboSNitch and non-riboSNitch.By comparing ROC curve and AUC value, it has been found that MeanDiff and EucDiff Better than the optimum value in previous articles under identical conditions, prediction result (table 1) that when value is obtained using software SNPfold. Illustrate that MeanDiff and EucDiff can more accurately distinguish riboSNitch and non-riboSNitch mutation.

Pass through MeanDiff, the AUC value of EucDiff and SNPfold. method under 1 different windows size of table

* NA indicates not providing the result under such window in Corley et al (2015)

In order to verify window size to prediction riboSNitch influence, we be provided with window size be 2bp, 5bp, 10bp, 15bp, 20bp, 25bp, 50bp (i.e. standard data set can take maximum value) simultaneously compare.From Fig. 2, it may be seen that With the increase of window value, AUC value is higher, i.e., better to the prediction effect of riboSNitch and non-riboSNitch.So In subsequent research, we can be by way of improving window size, the accuracy of Lai Tigao riboSNitch prediction.It examines Considering RNA folding is a total transcription, and subsequent we use riboSNitch in RNAplfold searching cancer gene group When, the window size used is 200bp, and the window of crossing over of single base is 150bp.

In addition, according to ROC curve, it is observed that concentrating in this group of normal data, the performance of MeanDiff is slightly better than EucDiff.But in order to reduce error rate, in subsequent research, we are by preceding 2.5% MeanDiff value and EucDiff value Those of intersection mutation is considered riboSNitch, and those of rear 2.5% MeanDiff value and EucDiff value intersection are mutated It is considered non-riboSNitch.The riboSNitch and non-riboSNitch obtained in this way, final AUC value It can reach 0.774 (Fig. 2), further improve the accuracy of riboSNitch prediction, this facilitates our subsequent prediction cancers RiboSNtich in genome.

1.2.2 the riboSNitch and non-riboSNitch in cancer gene group

In order to find the riboSNitch in cancer gene group, we have collected a large amount of somatic mutation data, including The body cell of the full genome sequencing of 25 melanomas and 100 gastric cancers of TCGA and ICGC data set and previous publications is prominent Parameter evidence^20,39.Due to only having sub-fraction sequence that can be transcribed into RNA in genome and being folded into transcript, so at me Research in only used and fall in very at least part of point mutation data.It is included in addition, having been filtered out in our analysis Son mutation only considers the point mutation on mature transcript.Various cancers, somatic mutation quantity on transcript regions and The quantity of riboSNitch is in table S1.

All fall in after the mutation on exon is obtained, it is contemplated that individual gene variable can be cut by different It connects, forms a variety of different transcripts, is i.e. a point mutation may have different influences to different transcripts.In order to reduce Operand has only selected a most important transcript, the choosing of the transcript for each gene in our subsequent analyses Selecting mode has detailed introduction in 1.1.2 gene annotation part.RiboSNitch and non-in somatic mutation in order to obtain RiboSNitch mutation, we predict the secondary structure of the sequence before and after somatic mutation with RNAplfold, and general by what is obtained Rate matrix calculates MeanDiff and EucDiff value.Finally, the body of preceding 2.5% MeanDiff value and EucDiff value intersection is thin Cytoplasmic process change is considered riboSNitch, and the somatic mutation of the MeanDiff value of tail 2.5% and EucDiff value intersection is considered non-riboSNitch。

1.2.3 the riboSNitch in cancer gene group is it is more likely that pathogenic mutation

In order to determine whether riboSNitch and non-riboSNitch may cause different functional consequences, we are used Fathmm-MKL annotates the function effect scoring of somatic mutation²⁴.In view of fathmm-MKL prediction score be one from 0 to 1 continuous decimal, we are subsequent to be divided into score in 5 sections from low to high: benign, possible benign, potential pathogenic, Ke Nengzhi Characteristic of disease and pathogenic.We have found that the functional consequences of riboSNitch and non-riboSNitch from benign to pathogenic not Deng especially having especially significant variation (Fig. 3 A and 3B) in benign and pathogenic.RiboSNitch as a whole Fathmm-MKL scores also above non-riboSNitch.Since higher score shows that mutation is more pathogenic, it is concluded that RiboSNitch ratio non-riboSNitch is more likely to cause a disease (Fig. 3 C and 3D) in cancer gene group, this is also and before normal It matches in research in human genome¹⁶.These results also imply that the somatic mutation for significantly changing RNA secondary structure more may be used It can be pathogenic.

In order to confirm that above-mentioned riboSNitch is more likely to pathogenic as a result, we are from ClinVar, UniProt and mankind's base Because mutation database (HGMD) has collected 91183 pathogenic mutations in total and 79090 benign mutation^40–42, with determination Whether riboSNitch is easier really occurs in pathogenic mutation.Since most of the mutation in these databases comes from normal person Class sample, we using thousand human genomes MeanDiff and EucDiff before 2.5% and rear 2.5% corresponding MeanDiff and EucDiff value is as cutoff, and similarly, using preceding 2.5% mutation intersection as riboSNitch, rear 2.5% mutation is handed over Collection is used as non-riboSNitch.As shown in FIGURE 3 E, it is seen that riboSNitch and non-riboSNitch is in benign mutation And have apparent difference in pathogenic mutation, and pathogenic mutation is more likely to be riboSNitch, and benign variant tends to non- RiboSNitch (P value=2.87E-05, Chi-square Test).Therefore, we are prominent using known benign and pathogenic point in database Change observes that the distribution of riboSNitch and non-riboSNitch in benign mutation and pathogenic mutation is really different, and causes Disease mutation contains more riboSNitch really, and contains more non-riboSNitch in benign mutation really.

1.2.4 in cancer gene group riboSNitch feature

In order to determine the more features of riboSNitch in cancer gene group, we first by mutation be divided into 6 kinds it is different Mutation type (C > A, C > G, C > T, T > A, T > C, T > G).We have found that the value of MeanDiff or EucDiff is in different mutation classes Distribution in type is different, this shows that different mutation types may have different influences (Fig. 4-7) to RNA secondary structure. It is interesting that compared with other mutation types in ICGC and TCGA data set, it has been found that C > G is mutated whether MeanDiff or EucDiff value is all higher, illustrates that influence of C > G mutation to RNA secondary structure is bigger.In addition, in 5 ' UTR Mutation and the mutation of 3 ' UTR, incRNA and protein-coding region compare, have higher MeanDiff and EucDiff Value (P < 2.2e-16, Mann-Whitney are examined), i.e. mutation occur to be easier to change RNA secondary structure in 5 ' UTR, this may It is related with there is many high conservative structural domains on 5 ' UTR.It is interesting that having more than 80% in 5 ' UTR in cancer gene group Somatic mutation occur on GC base-pair.Due between GC pairs there are three hydrogen bond, the GC base-pair ratio AT base-pair in structure It is more stable.Therefore, which also demonstrates in oncogene group a large amount of GC to mutation, it may be possible to pass through and destroy part The stability of RNA structure and then the function of influencing gene perhaps can also help us to explain why C > T is prominent in cancer gene group Becoming can be more.

1.2.4 the riboSNitch of gene function is influenced

In order to determine riboSNitch whether functional area of the significant enrichment in cancer gene group, Wo Mencong The higher miRNA binding site of confidence level is had collected in TargetScan and miRanda data set^25,26.Simultaneously also from CLIPdb The binding site of RBP has been collected in database.From result it is observed that compared with non-riboSNitch, RiboSNitch is aobvious around the RBP binding site (P value=1.79E-07, unilateral Fisher are accurately examined) in cancer gene group Missing, this is consistent with the research previously to trio family⁶, illustrate RBP combination target in cancer gene group also by pure Change selection.However, riboSNtich is enriched with (P value=5E-21, unilateral Fisher are accurately examined) around miRNA combination target, Show that riboSNitch may combine the function of further influence gene in cancer by influencing miRNA.

1.2.5 the Computational frame SNIPER for predicting to be rich in or lack riboSNitch element

During this investigation it turned out, we are intended to develop a Computational frame, can be identified from mutation riboSNitch and The method for being rich in or lacking riboSNitch element is identified from cancer somatic mutation.We assume that with cancer gene group Other genes are compared, and more mutation will be occurred by, which being enriched on the gene of riboSNitch, will affect RNA secondary structure, then illustrate richness The gene of collection riboSNitch experienced the positive selection of RNA structure during the occurrence and development of cancer.In view of us MeanDiff and EucDiff both methods prediction riboSNitch is had found, subsequent we are still come using both methods The riboSNitch quantity in cancer gene group is judged, and by way of duplicate sampling, the case where according to cancer thumping variability It carries out random mutation and obtains prediction group, the riboSNitch number of prediction group and observation group is finally compared and is counted inspection It tests, obtain significant enrichment riboSNitch and lacks the element of riboSNitch.The Computational frame and detail of SNIPER exists 1.1.5 partially there is detailed introduction.Different from pervious method, this method uses intragenic mutation rate rather than shows outside Sub- mutation rate is mutated to simulate, therefore the gene rich in riboSNitch that SNIPER is detected, can be considered as in cancer By positive selection gene in structure.In addition, we observe riboSNitch it is more likely that causing a disease prominent in the research of 1.2.3 Become, therefore it is presumed that these enrichment riboSNitch genes be also likely to be the important gene of function in cancer gene group, Considerable effect may also be played in cancer occurrence and development.Similarly, the gene for lacking riboSNitch may be cancer In structural conservation gene, it may be possible to the indispensable gene of cell, and playing a significant role in cancer cell.

Although primary sequence is critically important for the adjusting of gene expression amount, RNA secondary structure is in rna expression even egg It also plays an important role in white expression, especially adjustment process after the transcription of influence RNA, such as tied with RBP or miRNA with corresponding The interaction of coincidence point.Therefore, our method can help us to understand the mutation that these change RNA secondary structure, together When can be used to identify mutation in significantly change the point mutation of structure, and identify enrichment or lack riboSNitch gene and member Part, it is believed that these genes for being mutated and being enriched with these mutation may be related to development of cancer, and potential impact gene Function.

1.2.6 the non-coding element of specific enrichment riboSNitch in cancer is identified

In subsequent research, in order to obtain the non-coding element for being enriched with riboSNitch in full-length genome, we will SNIPER is enriched with the element of riboSNitch for detecting in ICGC data set in ICGC data set, and where these elements Function of gene during cancer occurrence and development is analyzed.The RNA secondary structure of UTR region is extremely important, and mutation may Influence gene expression is combined by changing microRNA or RBP, to facilitate tumour^{12,14,18,19,43}.In our point In analysis, while using two methods of MeanDiff and EucDiff prediction mutation to the influence degree of secondary structure, and value is preceding 2.5% mutation intersection is as last riboSNitch.After obtaining the riboSNitch number of observation group and prediction group.Most Afterwards, checking computation significance,statistical is accurately examined and be enriched with using Fisher and obtains P value, and P value is carried out using BH method Correction³⁸.Since the function of protein coding gene is complicated, SNIPER is only used for the detection of non-coding element by we, and by In 3'UTR and 5'UTR RNA structure for Gene regulation and translation stability be required.To sum up, the further part master of this section If coming in identification code gene non-coding region (UTR) and long non-coding RNA (incRNA) using our method SNIPER Candidate element.

Firstly, SNIPER is applied in the somatic mutation of 5'UTR by we, to find that enrichment changes second level knot in gene 5 ' UTR of structure.It will be seen that there are two the FDR values of 5 ' UTR of gene to be less than 0.05:KAT6A and NOTCH2 from Fig. 8. The two genes are all closely related gene to occur with cancer, and two genes are in COSMIC cancer gene database It (CGC) is known cancer gene in.For the gene rich in riboSNitch in the region 5'UTR, it has been found that cancer gene Enrichment degree is 224 times higher than random distribution (P value < 2.2e-16, Chi-square Test), that is, illustrates cancer gene in our result Obvious (table S2: different FDR enrichment scores) are enriched with, cancer correlation can be found using SNIPER really by also further illustrating Gene.Wherein, NOTCH2 is cell-membrane receptor, closely related with the Proliferation, Differentiation of cell.NOTCH2 be both oncogene and Tumor suppressor gene, it plays an important role in cancer signal path^44,45.KAT6A is lysine acetyltransferase gene, previous Research be proved to participate in and control the cell growth of breast cancer⁴⁶.In addition, when q value is relaxed to 0.2, RALGPS2 gene 5 ' UTR region also be accredited enrichment riboSNitch region.RALGPS2 be also considered as in cancer potentially drive because Son, and the gene is proved to affect cell survival and the cell cycle of lung carcinoma cell under study for action⁴⁷.In conclusion we It was found that NOTCH2, KAT6A and RALGPS2 are potential cancer driving genes, it was demonstrated that this method of SNIPER, i.e., from influence The angle of RNA secondary structure is set out, us can also be helped to find the relevant driving gene of cancer.

We also identify the element that riboSNitch is rich in the region 3'UTR with our method.The case where with 5'UTR Equally, using SNIPER identify 7 regions 3'UTR rich in riboSNitch, including CLCNKB, CYP4B1, SLC9B1, CCDC104, POLR2M, ACAD11 and DIO1, q value cutoff value are 0.05 (Fig. 8 B).CYP4B1 is a kind of cytochromes enzyme, before Research in find that the gene finds higher expression in bladder tumor patients⁴⁸.SLC9B1 is a kind of Na+/H+ transport protein, Help to maintain Cell Homeostasis⁴⁹.POLR2M is rna plymerase ii subunit M, is played a crucial role in genetic transcription, It is considered as the candidate driving gene of prostate cancer⁵⁰.ACAD11 is the gene in Acyl dehydrogenase family, it participates in cell survival And it plays a crucial role in TP53 related pathways⁵¹.DIO1 genes encoding Type I iodine thyronine takes off iodine enzyme, be cell Proliferation, The important regulatory factor of differentiation and metabolism⁵².In addition, when the cutoff of q value is reduced to 0.2, SMO, SRPK1, FOXD4 and The 3'UTR of DBP is also accredited as enrichment riboSNitch.Wherein, SRPK1 is clearly reported with tumor inhibition effect, and It and is candidate driving gene^53,54.To sum up, illustrate that SNIPER can not only identify the cancer driving element of 5 ' UTR, also can successfully reflect Make the cancer driving element on 3 ' UTR.

In order to determine the element of the element really cancer specific whether being determined above enrichment riboSNitch, i.e. this yuan The riboSNitch number of part in cancer with have apparent difference in normal person.We will observe in cancer gene group The riboSNitch observed in riboSNitch and thousand human genomes is compared.In 5'UTR, it has been found that KAT6A It is the element of cancer specific enrichment riboSNitch with RALGPS2.In 3 ' UTR, other than CLCNKB and SMO, separately Outer nine genes (including CYP4B1, SLC9B1, CCDC104, POLR2M, ACAD11, DIO1, SRPK1, FOXD4, DBP) are reflected It surely is 3 ' the UTR elements rich in riboSNitch of cancer specific.These results indicate that cancer specific riboSNitch is rich Element of set is known as the driving element for being likely to become the relevant function element of cancer or presumption.

For incRNAs, when the cutoff of q value is 0.05, only USP30-AS1 is accredited as enrichment riboSNitch IncRNA, and this incRNA be not cancer specific enrichment riboSNitch element.It is relaxed to when by q value cutoff When 0.1, other three incRNA are identified, comprising: LINC01365, ZNF503-AS1 and LINC00689.Wherein ZNF503- AS1 and LINC00689 is predicted to be the incRNA (Fig. 8 C) of cancer specific enrichment riboSNitch.It is interesting that ZNF503- AS1 can promote the proliferation and migration of pigment epithelial cell, and the table of ZNF503-AS1 by adjusting its antisense gene ZNF503 Up to the prognostic indicator for having proved to be squamous cell lung carcinoma^55,56.In the website FuncPred, one is predicted by idiotype network The website of incRNA function⁵⁷, it has been found that three kinds of incRNA predicted above (USP30-AS1, ZNF503-AS1 and LINC00689) there is potential correlation with cancer disease process and corresponding FDR is respectively less than 0.05.

1.2.7 the non-coding element of specific deficiency riboSNitch in cancer is identified

Our method can also be used for the element of the significant missing riboSNitch in prediction cancer gene group, these RNA The relevant element of secondary structure may be the indispensable element in cancer occurrence and development.We identify 5'UTR, 3'UTR and The element (table S3) of riboSNitch is significantly lacked in incRNA.When the cutoff of q value is set as 0.05, it was found that 4 significant The 5'UTR element (ING3, RBM22, NSA2 and TAF2) of riboSNitch is lacked, all these elements are all that cancer specific lacks Lose the element (Fig. 9 A) of riboSNitch.It has been found that the 3'UTR elements of 22 significant missing riboSNitch, but only Two elements (KPNA4 and GABBR2) are accredited as cancer specific.In 3'UTR element, it has been found that ING3, RBM22, NSA2 and KPNA4 is shown as the conditionity indispensable gene in cancer cell in OGEE v2 database⁵⁸.Our result of study mentions Having supplied cancer specific riboSNitch depleted region may be the evidence of cancer indispensable element.In addition, being in q value cutoff When 0.05, it has been found that the incRNA of 7 significant missing riboSNitch, but only discovery LINC00698 is cancer specific IncRNA, the up-regulation of the gene is proved to may be related to the occurrence and development of cancer⁵⁹。

1.3 discussing

Our research is cut from the angle of bis- elder sister's structure of RNA, provides potential cancer in a kind of detection cancer for everybody The new method of gene is driven, and highlights the mutation of influence RNA secondary structure on noncoding region to cancer gene expression regulation Importance.As far as we know, this research is for the first time to significantly changing RNA second level in two cancer databases of TCGA and ICGC The comprehensive study that the somatic mutation of structure is analyzed.We have found that different mutation types has RNA secondary structure There is different influences, and this mutation is enriched in around the binding site of miRNA, but again around the binding site of RBP Missing.These results indicate that somatic mutation can also influence cancer by changing RNA secondary structure in cancer gene group Occurrence and development¹⁹, in some instances it may even be possible to there are the potentiality for adjusting gene or protein expression^12,13,18。

Additionally, it has been found that in cancer gene group riboSNitch it is more likely that disease cause mutation, this with exist before Conclusion in trio family is coincide⁶, illustrating that these mutation that RNA secondary structure is significantly changed in cancer gene group more have can Can be related to disease, even result in the generation of cancer.Therefore, we have developed a new method SNIPER for detecting cancer In riboSNitch, and predict non-coding region be rich in riboSNitch element.We are composed based on cancer intragenic mutation With one neutral mutation model of the trinucleotide background constructing of each transcript, for constructing the pre- of a cancer random mutation Survey group.By comparing the riboSNitch of riboSNitch and prediction group in cancer databases, so that it may prediction be rich in or Lack the element of riboSNitch.In view of 96 kinds of spectrums of mutation for used in the analysis at us being introne, this makes SNIPER can more effectively detect the positive selection signal in observation group during prediction.As the present invention above shown in, richness Non-coding element containing riboSNitch is likely to become cancer driven factor, and the non-coding element for lacking riboSNitch is big Part is all the indispensable gene of cancer gene group.In addition, we also identify significant enrichment or lack riboSNitch's IncRNA, but also need more experimental datas and then function of these incRNA in cancer progression can be studied.To sum up institute It states, we successfully construct a method SNIPER, which finds in cancer gene group is rich in The element of riboSNitch and missing riboSNitch.In addition, our method also can help us to identify more Cancer driven factor and indispensable gene.

Currently, having developed many experimental techniques and software to detect and analyze the RNA second level knot of full-length genome Structure^6,9,10,60,61.In view of RNA secondary structure whether in vivo or is all in vitro height change, therefore still it is difficult RNA secondary structure is accurately predicted by single method.There is software to identify list by integrating different calculation methods A mutation influences RNA secondary structure.It will be appreciated, however, that the riboSNitch based on experimental data prediction is still than existing Software prediction riboSNitch it is more effective⁶².Consider that we lack the experimental data of cancer gene group RNA secondary structure, therefore originally In order to study the mutation of the riboSNitch in full-length genome cancer databases in invention, predicted using MeanDiff and EucDiff Influence of the somatic mutation to RNA secondary structure is still acceptable.As shown in Fig. 2, the performance of MeanDiff and EucDiff It is much better than the other methods listed in previous research.Therefore, it is predicted using both methods in cancer gene group RiboSNitch is feasible.Certainly, we also need the reason of further making great efforts to explore the RNA secondary structure that mutation influences, Further exploring riboSNitch influences the associated biomolecules mechanism of gene expression.

In the existing method using non-coding driving element in somatic mutation identification cancer, mainly by comparing non-volume Mutation rate between target region and corresponding flank region finds positive selection signal in code region, and this method can help us Find more function element, promoter, enhancer and silencer etc.^63,64.In the present invention, we have developed a kind of new Method SNIPER, it is by comparing element and prominent at random using the influence of RNA secondary structure caused by somatic mutation as measurement standard The riboSNitch number of the prediction group of change finally identifies the positive selection signal that secondary structure is significantly changed in cancer gene group.To the greatest extent Most gene in pipe genome can transcribe, but we are only absorbed in the UTRs and incRNA of detection encoding gene The situation of change of RNA secondary structure.Although many researchs have been carried out to predict potential functionality incRNA^57,65,66, but The molecular function of a large amount of incRNA still needs to be explored.

New-generation sequencing technology allows the variation to human genome to carry out Whole genome analysis, and it is right to greatly strengthen us Influence the understanding of RNA secondary structure correlation variation.Especially in cancer gene group, with tiring out for cancer gene group sequencing data Accumulated amount, we have a large amount of cancer somatic mutation database.Although the observation group that we finally use 2% of riboSNitch total quantity less than general cell mutation quantity, but SNIPER can still find cancer sample from genome The region of riboSNitch is enriched with or lacked in this.If subsequent accidental data continues to build up, it is believed that it is following we can be into One step analyzes the distributional difference of the enrichment of various cancers type or missing riboSNitch element.In addition, if subsequent have more Data and the preferably method of prediction riboSNitch, are beneficial to us and more effectively identify cancer in non-coding region Candidate driven factor and required element.

In the present invention, in our preliminary analysis cancer gene group riboSNitch characteristic, find in cancer gene RiboSNitch in group is it is more likely that pathogenic mutation.In addition, we successfully construct a Computational frame SNIPER, to have Potential driven factor and indispensable gene in the identification cancer of effect.We pass through the spy to RNA secondary structure in cancer gene group Rope has obtained potential function element relevant to RNA secondary structure.But our method and thinking need more data It is verified.In brief, we emphasize that importance of the riboSNitch in cancer gene group, but subsequent how to assess These mutation whether participate in tumour really and these mutation how to influence post-transcriptional control in cancer and gene turns over It translates, is still a challenge.

Above with detailed description of the preferred embodimentsthe present invention has been described, those skilled in the art are without departing substantially from spirit of that invention the case where Under, equivalent modification or modification can be made, equally within the scope of the claims.

Bibliography

1.Abbosh,C.et al.Phylogenetic ctDNA analysis depicts early-stage lung cancer evolution.Nature545,446–451(2017).

2.Mortimer,S.A.,Kidwell,M.A.&Doudna,J.A.Insights into RNA structure and function from genome-wide studies.Nat.Rev.Genet.15,469–479(2014).

3.Julius,B.,Lucks.Multiplexed RNA structure characterization with selective 2’-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq).

4.Underwood,J.G.et al.FragSeq:transcriptome-wide RNA structure probing using high-throughput sequencing.Nat.Methods7,995–1001(2010).

5.Talkish,J.,May,G.,Lin,Y.,Woolford,J.L.&McManus,C.J.Mod-seq:high- throughput sequencing for chemical probing of RNA structure.RNA20,713–720 (2014).

6.Wan,Y.et al.Landscape and variation of RNA secondary structure across the human transcriptome.Nature505,706–709(2014).

7.Bai,Y.,Dai,X.,Harrison,A.,Johnston,C.&Chen,M.Toward a next- generation atlas of RNA secondary structure.Brief.Bioinform.17,63–77(2016).

8.Hofacker,I.L.RNA Secondary Structure Analysis Using the Vienna RNA Package.in Current Protocols in Bioinformatics(eds.Baxevanis,A.D.,Petsko, G.A.,Stein,L.D.&Stormo,G.D.)(John Wiley&Sons,Inc.,2009).

9.Yao,J.,Reinharz,V.,Major,F.&Waldispühl,J.RNA-MoIP:prediction of RNA secondary structure and local 3D motifs from sequence data.Nucleic Acids Res.45,W440–W444(2017).

10.Sabarinathan,R.et al.RNAsnp:Efficient Detection of Local RNA Secondary Structure Changes Induced by SNPs.Hum.Mutat.34,546–556(2013).

11.Lokody,I.RNA:riboSNitch reveal heredity in RNA secondary structure.Nat.Rev.Genet.15,219–219(2014).

12.Luo,Z.,Yang,Q.&Yang,L.RNA Structure Switches RBP Binding.Mol.Cell64,219–220(2016).

13.Taliaferro,J.M.et al.RNA Sequence Context Effects Measured In Vitro Predict In Vivo Protein Binding and Regulation.Mol.Cell64,294–306 (2016).

14.Kutchko,K.M.et al.Multiple conformations are a conserved and regulatory feature of the RB1 5′UTR.RNA21,1274–1285(2015).

15.Martin,J.S.et al.Structural effects of linkage disequilibrium on the transcriptome.RNA18,77–87(2012).

16.Halvorsen,M.,Martin,J.S.,Broadaway,S.&Laederach,A.Disease- Associated Mutations That Alter the RNA Structural Ensemble.PLoS Genet.6, e1001074(2010).17.Rogler,L.E.et al.Small RNAs derived from incRNA RNase MRP have gene-silencing activity relevant to human cartilage–hair hypoplasia.Hum.Mol.Genet.23,368–382(2014).

18.Linnstaedt,S.D.et al.A Functional riboSNitch in the 3′Untranslated Region of FKBP5Alters MicroRNA-320a Binding Efficiency and Mediates Vulnerability to Chronic Post-Traumatic Pain.J.Neurosci.38,8407–8420(2018).

19.Sabarinathan,R.et al.Transcriptome-Wide Analysis of UTRs in Non- Small Cell Lung Cancer Reveals Cancer-Related Genes with SNV-Induced Changes on RNA Secondary Structure and miRNA Target Sites.PLoS ONE9,e82699(2014).

20.Berger,M.F.et al.Melanoma genome sequencing reveals frequent PREX2 mutations.Nature(2012).doi:10.1038/nature11071

21.Wang,K.et al.Exome sequencing identifies frequent mutation of ARID1A in molecular subtypes of gastric cancer.Nat.Genet.43,1219–1223(2011).

22.Mu,X.J.,Lu,Z.J.,Kong,Y.,Lam,H.Y.K.&Gerstein,M.B.Analysis of genomic variation in non-coding elements using population-scale sequencing data from the 1000Genomes Project.Nucleic Acids Res.39,7058–7076(2011).

23.Rosenbloom,K.R.et al.The UCSC Genome Browser database:2015 update.Nucleic Acids Res.43,D670–D681(2015).

24.Shihab,H.A.et al.An integrative approach to predicting the functional effects of non-coding and coding sequence variation.Bioinformatics31,1536–1543(2015).25.Agarwal,V.,Bell,G.W.,Nam,J.-W.& Bartel,D.P.Predicting effective microRNA target sites in mammalian mRNAs.elife4,e05005(2015).

26.Betel,D.,Koppal,A.,Agius,P.,Sander,C.&Leslie,C.Comprehensive modeling of microRNA targets predicts functional non-conserved and non- canonical sites.

Genome Biol.11,R90(2010).

27.Yang,Y.-C.T.et al.CLIPdb:a CLIP-seq database for protein-RNA interactions.BMC Genomics16,51(2015).

28.Uren,P.J.et al.Site identification in high-throughput RNA–protein interaction data.Bioinformatics28,3013–3020(2012).

29.Forbes,S.A.et al.COSMIC:exploring the world’s knowledge of somatic mutations in human cancer.Nucleic Acids Res.43,D805–D811(2015).

30.Ning,S.et al.Lnc2Cancer:a manually curated database of experimentally supported incRNAs associated with various human cancers.Nucleic Acids Res.44,D980–D985(2016).

31.Harrow,J.et al.GENCODE:The reference human genome annotation for The ENCODE Project.Genome Res.22,1760–1774(2012).

32.Yates,B.et al.Genenames.org:the HGNC and VGNC resources in 2017.Nucleic Acids Res.45,D619–D625(2017).

33.Rodriguez,J.M.et al.APPRIS:annotation of principal and alternative splice isoforms.Nucleic Acids Res.41,D110–D117(2013).

34.Pruitt,K.D.et al.The consensus coding sequence(CCDS)project: Identifying a common protein-coding gene set for the human and mouse genomes.Genome Res.19,1316–1323(2009).

35.Corley,M.,Solem,A.,Qu,K.,Chang,H.Y.&Laederach,A.Detecting riboSNitch with RNA folding algorithms:a genome-wide benchmark.Nucleic Acids Res.43,1859–1868(2015).

36.Quinlan,A.R.&Hall,I.M.BEDTools:a flexible suite of utilities for comparing genomic features.Bioinformatics26,841–842(2010).

37.Robin,X.et al.pROC:an open-source package for R and S+to analyze and compare ROC curves.BMC Bioinformatics12,77(2011).

38.Hochberg,Y.&Benjamini,Y.More powerful procedures for multiple significance testing.Stat.Med.9,811–818(1990).

39.Wang,K.et al.Whole-genome sequencing and comprehensive molecular profiling identify new driver mutations in gastric cancer.Nat.Genet.46,573– 582(2014).

40.Landrum,M.J.et al.ClinVar:public archive of relationships among sequence variation and human phenotype.Nucleic Acids Res.42,D980–D985(2014).

41.Apweiler,R.et al.UniProt:the Universal Protein knowledgebase.Nucleic Acids Res.32,D115–D119(2004).

42.Stenson,P.D.et al.The Human Gene Mutation Database:building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine.Hum.Genet.133,1–9(2014).

43.Lackey,L.L.,Coria,A.,Tolson,C.,McArthur,E.&Laederach,A.Abstract 505:Somatic and inherited riboSNitch in TPT1 and LCP1 mRNA secondary structures.Cancer Res.77,505–505(2017).

44.Agrawal,N.et al.Exome Sequencing of Head and Neck Squamous Cell Carcinoma Reveals Inactivating Mutations in NOTCH1.Science333,1154–1157 (2011).

45.Hayashi,T.et al.Not all NOTCH Is Created Equal:The Oncogenic Role of NOTCH2in Bladder Cancer and Its Implications for Targeted Therapy.Clin.Cancer Res.22,2981–2992(2016).

46.Turner-Ivey,B.et al.KAT6A,a Chromatin Modifier from the 8p11-p12 Amplicon is a Candidate Oncogene in Luminal Breast Cancer.Neoplasia16,644–655 (2014).

47.Santos,A.O.,Parrini,M.C.&Camonis,J.RalGPS2 Is Essential for Survival and Cell Cycle Progression of Lung Cancer Cells Independently of Its Established Substrates Ral GTPases.PLOS ONE11,e0154840(2016).

48.Imaoka,S.et al.CYP4B1 Is a Possible Risk Factor for Bladder Cancer in Humans.Biochem.Biophys.Res.Commun.277,776–780(2000).

49.Chintapalli,V.R.et al.Transport proteins NHA1 and NHA2 are essential for survival,but have distinct transport modalities.Proc.Natl.Acad .Sci.112,11720–11725(2015).

50.Schinke,E.N.et al.A novel approach to identify driver genes involved in androgen-independent prostate cancer.Mol.Cancer13,120(2014).

51.Jiang,D.et al.Analysis of p53 transactivation domain mutants reveals Acad11as a metabolic target important for p53 pro-survival function.Cell Rep.10,1096–1109(2015).

52.P.et al.Restoration of type 1 iodothyronine deiodinase expression in renal cancer cells downregulates oncoproteins and affects key metabolic pathways as well as anti-oxidative system.PLoS ONE12,(2017).

53.Gammons,M.V.et al.Targeting SRPK1 to control VEGF-mediated tumour angiogenesis in metastatic melanoma.Br.J.Cancer111,477–485(2014).

54.Mavrou,A.et al.Serine–arginine protein kinase 1(SRPK1)inhibition as a potential novel targeted therapeutic strategy in prostate cancer.Oncogene34,4311–4319(2015).

55.Tang,R.-X.et al.Identification of a RNA-Seq based prognostic signature with five incRNAs for lung squamous cell carcinoma.Oncotarget8, 50761–50773(2017).

56.Chen,X.et al.IncRNA ZNF503-AS1 promotes RPE differentiation by downregulating ZNF503 expression.Cell Death Dis.8,e3046(2017).

57.Perron,U.,Provero,P.&Molineris,I.In silico prediction of incRNA function using tissue specific and evolutionary conserved expression.BMC Bioinformatics18,(2017).

58.Chen,W.-H.,Lu,G.,Chen,X.,Zhao,X.-M.&Bork,P.OGEE v2:an update of the online gene essentiality database with special focus on differentially essential genes in human cancer cell lines.Nucleic Acids Res.45,D940–D944 (2017).

59.Wang,H.et al.Comprehensive analysis of aberrantly expressed profiles of incRNAs and miRNAs with associated ceRNA network in muscle- invasive bladder cancer.Oncotarget7,86174–86185(2016).

60.Lackey,L.,Coria,A.,Woods,C.,McArthur,E.&Laederach,A.Allele- specific SHAPE-MaP assessment of the effects of somatic variation and protein binding on mRNA structure.RNA24,513–528(2018).

61.Ouyang,Z.,Snyder,M.P.&Chang,H.Y.SeqFold:Genome-scale reconstruction of RNA secondary structure integrating high-throughput sequencing data.Genome Res.23,377–387(2013).

62.Woods,C.T.&Laederach,A.Classification of RNA structure change by ‘gazing’at experimental data.Bioinformatics33,1647–1655(2017).

63.Lanzós,A.et al.Discovery of Cancer Driver Long Noncoding RNAs across 1112 Tumour Genomes:New Candidates and Distinguishing Features.Sci.Rep.7,(2017).

64.Mularoni,L.,Sabarinathan,R.,Deu-Pons,J.,Gonzalez-Perez,A.&López- Bigas,N.OncodriveFML:a general framework to identify coding and non-coding regions with cancer driver mutations.Genome Biol.17,(2016).

65.Baytak,E.et al.Whole transcriptome analysis reveals dysregulated oncogenic incRNAs in natural killer/T-cell lymphoma and establishes MIR155HG as a target of PRDM1.Tumor Biol.39,1010428317701648(2017).

66.Li,Y.et al.IncRNA ontology:inferring incRNA functions based on chromatin states and expression patterns.Oncotarget6,39793–39805(2015).

Claims

1. being rich in or lacking the method for riboSnitch element in a kind of prediction cancer gene group, comprising:

Using the mutation of prediction as prediction group；Using the mutation actually occurred in cancer as observation group, wherein different patients are same The same mutation that one site occurs is by separate counts；The practical mutant nucleotide sequence of the mutant nucleotide sequence and observation group of calculating prediction group RNA secondary structure；

Wherein k is the position being mutated in transcript, and w is window size, BPP_ref,iAnd BPP_alt,iIt respectively represents reference sequences and dashes forward Become i-th of base-pair probability of sequence, the value range of i is [k-w, k+w]；

The MeanDiff value of all mutation and EucDiff value are sorted from large to small respectively, MeanDiff value and EucDiff value are most The intersection of high preceding 2.5% mutation is prediction group and the corresponding riboSNitch of observation group；

The riboSNitch number of comparison prediction group and observation group carries out unilateral Fisher and accurately examines and hypergeometric distribution inspection It tests, to identify significant enrichment riboSNitch, and obtains false discovery rate (FDR) value using the p value that the correction of BH method is examined；School After positive FDR, the result of FDR < 0.05 is considered as the element for being rich in or lacking riboSNitch.

2. the method for claim 1 wherein the preparation method of the mutant nucleotide sequence of prediction group is the mutation composed according to intragenic mutation Rate three base number corresponding with each transcript, obtains the random mutation number of each three base situation of each transcript, Carry out duplicate random sampling according to mutation number, to the transcript sequence of each gene, according to neutral mutation rate carry out with Machine mutation, to obtain prediction group；The neutral mutation rate in cancer gene group is wherein indicated using intragenic mutation spectrum.

3. in any one of preceding the method for claim, wherein w is 2bp, 5bp, 10bp, 15bp, 20bp, 25bp, 50bp or 200bp, preferably 200bp.

4. wherein random mutation is carried out for noncoding region in any one of preceding the method for claim, preferably noncoding region is 5 ' UTR, 3 ' UTR and/or IncRNA.

5. method for claim 4, wherein random mutation number is 1000 times.

6. in any one of preceding the method for claim, used in mutant nucleotide sequence data come from ICGC database, TCGA data Library, thousand human genome databases or other somatic mutation data.

7. in any one of preceding the method for claim, wherein further comprise by cancer gene group riboSNitch and other Whether riboSNitch is compared, accurately examined using unilateral Fisher to determine riboSNitch in cancer gene group region Middle enrichment, P value is less than 10^-3All elements be considered as cancer specific enrichment or missing riboSNitch element.

8. in any one of preceding the method for claim, wherein using the RNA secondary structure of the RNAplfold sequence of calculation.

9. wherein w is 200bp, and the window of crossing over of single base is 150bp in any one of preceding the method for claim.

10. a kind of equipment for predicting to be rich in or lack riboSnitch element in cancer gene group, comprising:

Processor, including the module for obtaining sequence from database；For calculate MeanDiff described in claim 1 and The computing module of EucDiff value；Sorting module for sorting to MeanDiff the and EucDiff value；For the two to be arranged Sequence result takes the module of intersection；For the module of comparison prediction group and the riboSNitch number of observation group, for testing Module；For identifying the module of the element of enrichment or missing riboSNitch；Output module for exporting result；And Memory, is stored thereon with instruction, and described instruction executes the processor according to right It is required that method described in any one of 1-9.