CN110246544A

CN110246544A - A kind of biomarker selection method and system based on confluence analysis

Info

Publication number: CN110246544A
Application number: CN201910409758.3A
Authority: CN
Inventors: 刘婉婷; 张弓; 何庆瑜
Original assignee: Jinan University
Current assignee: Jinan University; University of Jinan
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2019-09-17
Anticipated expiration: 2039-05-17
Also published as: CN110246544B

Abstract

The invention discloses a kind of biomarker selection method and system based on confluence analysis, this method include the following steps: to choose raw sequencing data；Raw sequencing data uses FANSe algorithm, carries out mapping analysis, obtains gene quantification information, sets gene original packet；Importance ranking of the gene in original packet is calculated using GWGS algorithm, then is integrated the importance of every group of gene using GWRS algorithm, the gene importance after being integrated arranges list, and gene is sorted from high to low according to importance；Data mining is carried out using the Wrapper Feature Selection model based on SVM, data sample type is distinguished, filters out biomarker in the gene high from importance.The present invention is according to sequencing data feature, the polycentric raw sequencing data of organic combination, and platform, sample, the systematical difference in experimental design are solved, depth data excavation is carried out using the high confluence analysis algorithm of robustness, is excavated to common, special, crucial large biological molecule.

Description

A kind of biomarker selection method and system based on confluence analysis

Technical field

The present invention relates to biomarker detection technique fields, and in particular to a kind of biomarker based on confluence analysis Selection method and system.

Background technique

Find more common and high specificity, key strong large biological molecule (including nucleic acid and protein), Ke Yiti Therapeutic treatment effect is risen, but existing molecular marker is difficult to meet common, special, crucial requirement, molecular marker is big Mostly analyze to obtain using multicenter data, and the conventional treatment mode of existing multicenter data (is assembled using Meta analysis Analysis), the conclusion of multicenter study is integrated, since multicenter data are commonly present experiment object disparity, instrumental method difference etc. Inconsistent factor is not added and respectively merges that the method that its initial data is analyzed is not appropriate, and meta-analysis is vulnerable to original number Bias is caused according to the influence of the factors such as quality, original researcher's analysis level, original research tool mistakes and omissions, so that a large amount of precious Your data fails to be fully used.

Summary of the invention

In order to overcome the shortcomings of the prior art, the present invention provides a kind of biomarker selection based on confluence analysis Method and system establish a kind of confluence analysis strategy, have the whole of strong robustness using high-precision bottom layer treatment algorithm development Hop algorithm directly carries out confluence analysis to multicenter raw sequencing data, to make full use of multicenter magnanimity sequencing data, digs Dig common, special, crucial large biological molecule.

In order to achieve the above object, the invention adopts the following technical scheme:

The present invention provides a kind of biomarker selection method based on confluence analysis, includes the following steps:

S1: raw sequencing data is chosen；

S2: raw sequencing data uses FANSe algorithm, carries out mapping analysis, obtains gene quantification information, sets base Because of original packet；

S3: importance ranking of the gene in original packet is calculated using GWGS algorithm, then using GWRS algorithm by every group The importance of gene is integrated, and the gene importance after being integrated arranges list, and gene is sorted from high to low according to importance；

S4: data mining is carried out using the Wrapper Feature Selection model based on SVM, distinguishes data sample This type filters out biomarker in the gene high from importance.

Raw sequencing data described in step S1 as a preferred technical solution, using what is generated from sequencing machine The sequencing file of fastq format.

Carry out mapping analysis, specific steps described in step S2 as a preferred technical solution, are as follows:

Short reading sequence is broken into multiple nonoverlapping seeds, each seed degree is identical, by all seeds and refers to base Because group is matched, statistics marking is carried out according to initiation site to the seed matched, is ranked according to score height, according to Coordination interception refers to gene order, short reading sequence is compared with intercepting with reference to genome sequence, by the highest order in comparison Short reading sequence location obtains gene quantification information as final position.

It is important in original packet that gene is calculated using GWGS algorithm described in step S3 as a preferred technical solution, Property sequence, first using GWRS algorithm to mapping analysis after sequencing data evaluate and test, according to expression significance degree assign Give different numerical value, the specific formula for calculation that GWRS algorithm is evaluated and tested are as follows:

Wherein, r_ijIndicate the rank value of the i-th gene in jth microarray, i ∈ (1, m), j ∈ (1, n), s_ijFor GWRS Value, to containing the gene of NA, s in microarray_ijValue is also set as NA.

The importance of every group of gene is integrated using GWRS algorithm again in step S3 as a preferred technical solution, specifically Calculation formula are as follows:

Wherein, ω_jIndicate the weighted value of jth microarray, s_ijFor GWRS value.

The Wrapper Feature Selection based on SVM is used described in step S4 as a preferred technical solution, Model carries out data mining, specific steps are as follows:

S41: Wrapper Feature Selection model, training Wrapper Feature are established based on SVM Selection model；

S42: trained Wrapper Feature Selection will be input to according to the good genome of importance ranking Model judges to export whether result can separate specimen types, reaches preset condition, export corresponding gene, not up to default Condition, loop-around data mining process is carried out, gradually adds gene until reaching preset condition, the corresponding base of output final result Cause.

The present invention provides a kind of biomarker selection system based on confluence analysis, comprising: raw sequencing data is chosen Module and data-mining module are integrated in module, quantitative analysis module, sequence；

The raw sequencing data chooses module for choosing raw sequencing data, chooses fastq lattice from sequencing machine The sequencing file of formula；

The quantitative analysis module carries out mapping analysis using FANSe algorithm to raw sequencing data, and it is fixed to obtain gene Measure information；

The sequence integrates module for generating the arrangement list of gene importance, calculates gene original using GWGS algorithm Importance ranking in grouping, then integrated the importance of every group of gene using GWRS algorithm, the gene after being integrated is important Property arrangement list, gene is sorted from high to low according to importance；

The data-mining module is for filtering out biomarker, using the Wrapper Feature based on SVM Selection model carries out data mining, distinguishes data sample type, filters out biological marker in the gene high from importance Object.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) present invention establishes sequencing data confluence analysis strategy, according to sequencing data feature, the polycentric original of organic combination Beginning sequencing data, and solve platform, sample, the systematical difference in experimental design, using the high confluence analysis algorithm of robustness into Row depth data excavates, and excavates to common, special, crucial large biological molecule.

Detailed description of the invention

Fig. 1 is the flow diagram of biomarker selection method of the present embodiment based on confluence analysis.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

Embodiment

The biomarker selection method based on confluence analysis that the present embodiment provides a kind of, raw sequencing data is utilized FANSe serial algorithm mapping and it is quantitative after, first calculate importance of the gene in certain single data set using GWGS algorithm, Again by the multiple data sets of GWRS Algorithms Integration, importance ranking of the gene in all data sets is obtained, is arranged according to importance Sequence is that gene is gradually put into screening model by sequence, finally selects biomarker.

The present embodiment introduces high-precision sequencing analysis algorithm FANSe, FANSe algorithm and is based on Hash seed matching progress sequence Compare, can efficiently, high accurancy and precision by short reading sequence alignment into reference genome, algorithm accuracy is high, serious forgiveness pole By force, by the graceful algorithm of Smith-water to micro- insertion/micro-deleted extremely sensitive, while result has reliable experimental verification.

The present embodiment sequencing data amount is needed by a large amount of pre-processing, such as the step that mapping calculation amount is very big, And the step for precision will have a direct impact on Integrative analysis accuracy.

As shown in Figure 1, the biomarker selection method provided in this embodiment based on confluence analysis, specific steps are such as Under:

S1: raw sequencing data is chosen；

S2: raw sequencing data obtains accurate quantitative result after carrying out mapping analysis, obtains gene quantification information, if Determine gene original packet:

Raw sequencing data is the sequencing file of fastq format directly generated from sequencing machine, this document need with The reference sequences of corresponding species compare, and thus calculating in sequencing sample has what gene (qualitative part), the table of each gene Up to amount be how many (dosing section).Mapping analytical calculation process are as follows: short reading sequence is broken into several nonoverlapping seeds, Each seed degree is identical, and all seeds are matched with reference genome, are united to the seed matched according to initiation site Meter, marking, the higher ranking of score is more forward, refers to gene order according to coordination interception, and short reading sequence is referred to base with interception It because group sequence is precisely compared, is relatively given a mark according to base-base, by returning for the wherein graceful algorithm of Smith-water Mechanism of tracing back is cancelled, and acceleration purpose is had reached, and comparison result is arranged, using the short reading sequence location of the highest order in precise alignment as most Final position is set, that is, gene has been determined, completes mapping overall process.Then according to the sequence quantity on mapping, quantitative gene table Up to amount.Algorithm has robustness and serious forgiveness extremely strong by evaluation, therefore handles downloading again from different realities with this algorithm The data of platform are tested, experiment porch or different experiments bring experimental data bias can be removed or reduce；

S3: importance ranking of the gene in original packet is calculated using GWGS algorithm, then using GWRS algorithm by every group The importance of gene is integrated, and the gene importance after being integrated arranges list, and gene is sorted from high to low according to importance:

First using the GWRS algorithm as shown in formula (1) to being commented in the processed single centre sequencing data of FANSe It surveys, different numerical value is assigned according to the significance degree of expression,

Wherein, r_ijIndicate the rank value of the i-th gene in jth microarray, i ∈ (1, m), j ∈ (1, n), s_ijFor GWRS Value, to containing the gene of NA, s in microarray_ijValue is also set as NA；

Confluence analysis is carried out to above-mentioned GWRS result using GWGS algorithm shown in formula (2), one group is generated and crosses in mostly The gene expression data of calculation evidence:

Wherein, ω_jIndicate the weighted value of jth microarray；

S4: data mining is carried out using the Wrapper Feature Selection model based on SVM, distinguishes data sample This type filters out biomarker in the gene high from importance；

In the present embodiment, model based on support vector machines (SVM) based on establishing, at step S2, step S3 That managed sequences the genome of importance, is gradually added in circulation model, i.e., increased a gene than last time every time, and put into Into trained Wrapper Feature Selection model in advance, judge to export whether result meets optimal stabilization Whether accuracy can really separate specimen types, if reaching best stabilized accuracy, that is, jump out circulation and output reaches this As a result corresponding gene, if not up to best accuracy will be as a result, detection will be carried out persistently, gradually addition gene is until reaching Until optimum.Above step can accurately filter out both important from the gene importance list that step S2, S3 generates Property gene in the top and can accurately distinguishing sample type is as marker.

In the present embodiment, Wrapper Feature Selection model training method is Training, that is, is known Whether known sample answer, the gene for detecting investment can separate the sample of different phase, and the present embodiment is with random sampling What the mode of 1000 sample datas was groped is best suitable for the corresponding suitable parameter of the data type, i.e., related under this parameter Gene can distinguish sample and reach highest accuracy.

In the present embodiment, for improved model adapt to sequencing data, the relevant sequencing data of the present embodiment application sample into The adjustment of row model and preliminary experiment, according to data characteristics to GWRS, SVM etc. in GWGS and Wrapper feature selection Module is adjusted, while fully taking into account computational efficiency optimization, parallelization calculating and the problems such as distributed computing.

In the present embodiment, model needs to carry out appropriate adjustment according to the different of clinical sample:

1. needing to introduce FANSe serial algorithm for sequencing data to guarantee that quantitative result is sequenced, quantified in good sequencing As a result upper that screening could be unfolded；

The characteristics of 2.GWRS and GWGS have also contemplated sequencing data cannot such as only rely on and quantitatively make by means of mono- difference of P value For parameter, may need to introduce it is multiple, based on the present embodiment uses fold differences, weight of the P value as fold differences To give gene importance ranking；

3. a pair sequencing data is sampled, according to its feature, the sieve of Wrapper Feature Selection model is formulated Parameter is selected, guarantee obtains highest stable accuracy.

In the present embodiment, clinical sequencing data is screened from multiple databases, according to the step mentioned in technical solution Suddenly, first by all data through FANSe serial algorithm mapping and quantitative Treatment, after obtaining gene quantification information, with original number It is unit according to grouping, calculates importance ranking of the gene in original packet using GWGS algorithm, reapplying GWRS algorithm will be every The importance integration of group gene, one group of gene importance after being integrated arrange list.On earth according to importance height by gene Sequence, screens large biological molecule (i.e. from important gene using the Wrapper Feature Selection model based on SVM Biomarker), by the calculating and screening to this batch data, filter out common, special, crucial large biological molecule.

The present embodiment also provides a kind of biomarker selection system based on confluence analysis, comprising: raw sequencing data Module, quantitative analysis module are chosen, sorts and integrates module and data-mining module；

The sequence integrates module for obtaining the arrangement list of gene importance, calculates gene original using GWGS algorithm Importance ranking in grouping, then integrated the importance of every group of gene using GWRS algorithm, the gene after being integrated is important Property arrangement list, gene is sorted from high to low according to importance；

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention, It should be equivalent substitute mode, be included within the scope of the present invention.

Claims

1. a kind of biomarker selection method based on confluence analysis, which is characterized in that include the following steps:

S1: raw sequencing data is chosen；

S2: raw sequencing data uses FANSe algorithm, carries out mapping analysis, obtains gene quantification information, and setting gene is former Begin to be grouped；

S3: importance ranking of the gene in original packet is calculated using GWGS algorithm, then uses GWRS algorithm by every group of gene Importance integration, after integrate gene importance arrangement list, gene is sorted from high to low according to importance；

S4: data mining is carried out using the Wrapper Feature Selection model based on SVM, distinguishes data sample class Type filters out biomarker in the gene high from importance.

2. the biomarker selection method according to claim 1 based on confluence analysis, which is characterized in that in step S1 The raw sequencing data, using the sequencing file of the fastq format generated from sequencing machine.

3. the biomarker selection method according to claim 1 based on confluence analysis, which is characterized in that in step S2 The carry out mapping analysis, specific steps are as follows:

Short reading sequence is broken into multiple nonoverlapping seeds, each seed degree is identical, by all seeds and refers to genome It is matched, statistics marking is carried out according to initiation site to the seed matched, ranked according to score height, according to coordination Interception refers to gene order, short reading sequence is compared with intercepting with reference to genome sequence, by the short reading of highest order in comparison Sequence location obtains gene quantification information as final position.

4. the biomarker selection method according to claim 1 based on confluence analysis, which is characterized in that in step S3 It is described that importance ranking of the gene in original packet is calculated using GWGS algorithm, first using GWRS algorithm to mapping points Sequencing data after analysis is evaluated and tested, and different numerical value, the tool that GWRS algorithm is evaluated and tested are assigned according to the significance degree of expression Body calculation formula are as follows:

Wherein, r_ijIndicate the rank value of the i-th gene in jth microarray, i ∈ (1, m), j ∈ (1, n), s_ijFor GWRS value, to micro- Contain the gene of NA, s in array_ijValue is also set as NA.

5. the biomarker selection method according to claim 1 based on confluence analysis, which is characterized in that in step S3 The importance of every group of gene is integrated using GWRS algorithm again, specific formula for calculation are as follows:

Wherein, ω_jIndicate the weighted value of jth microarray, s_ijFor GWRS value.

6. the biomarker selection method according to claim 1 based on confluence analysis, which is characterized in that in step S4 It is described that data mining, specific steps are carried out using the Wrapper Feature Selection model based on SVM are as follows:

S42: trained Wrapper Feature Selection mould will be input to according to the good genome of importance ranking Type judges to export whether result can separate specimen types, reaches preset condition, export corresponding gene, not up to presets item Part, loop-around data mining process is carried out, gradually adds gene until reaching preset condition, the corresponding base of output final result Cause.

7. a kind of biomarker based on confluence analysis selects system characterized by comprising raw sequencing data chooses mould Module and data-mining module are integrated in block, quantitative analysis module, sequence；

The raw sequencing data chooses module for choosing raw sequencing data, chooses fastq format from sequencing machine File is sequenced；

The quantitative analysis module carries out mapping analysis using FANSe algorithm to raw sequencing data, obtains gene quantification letter Breath；

The sequence integrates module for generating the arrangement list of gene importance, calculates gene in original packet using GWGS algorithm In importance ranking, then the importance of every group of gene is integrated using GWRS algorithm, the gene importance after integrate is arranged Column list sorts gene according to importance from high to low；