CN106326689A

CN106326689A - Method and device for determining site subject to selection in colony

Info

Publication number: CN106326689A
Application number: CN201510358145.3A
Authority: CN
Inventors: 陈伟芬; 余胜; 王莹; 王崇志; 何伟明
Original assignee: BGI Technology Solutions Co Ltd
Current assignee: BGI Technology Solutions Co Ltd
Priority date: 2015-06-25
Filing date: 2015-06-25
Publication date: 2017-01-11

Abstract

The invention discloses a method and device for determining a site subject to selection in a colony. The method includes the following steps: acquiring nucleic acid sequencing data of colony samples, wherein the colony samples are from a plurality of individuals of a specie, and the colony samples can be divided into 2n first class sub colonies according to n pairs of preset indexes, wherein the n is a natural number; performing detection according to the nucleic acid sequencing data so as to acquire colony SNP data, wherein the colony SNP data includes a plurality of first class colony SNP data; and comparing differences of polymorphism of different first class sub colonies on the basis of the colony SNP data so as to determine an SNP subject to selection, wherein the SNP subject to selection is a site subject to selection. The invention further provides a device and system for determining a site subject to selection in colony. The method, device and/or system can accurately determine the site subject to selection.

Description

The method and apparatus determining in colony site by selection

Technical field

The present invention relates to field of biology, especially, relate to population genetics field, more particularly, it relates to a kind of true By the method in site and the device in a kind of site determined in colony by selection of selection in grouping body.

Background technology

Along with secondary order-checking (next generation sequencing, the NGS) maturation of technology and gradually reducing of cost, every Based on this, investigative technique for various purposes emerges in an endless stream.RNA-Seq is a kind of based on NGS, by sample Transcript profile (transcriptome) checks order, and is mainly used in disclosing the technology of gene expression rule in sample, is extensively transported With.Meanwhile, the sequencing data of RNA-Seq can also be used for detecting the pleomorphism site in whole subgenomic transcription region, including SNP Site.

Summary of the invention

According to an aspect of of the present present invention, the present invention provides a kind of method in site determined in colony by selection, described choosing The effect selected includes that artificial selection acts on and at least one of natural selection effect, and the method comprises the following steps: (1) obtains colony The nucleic acid sequencing data of sample, described population sample is from multiple individualities of species, optional, described population sample from The homologue of one multiple individuality of species or the same area of multiple individualities of species, described population sample can foundation N is divided into 2n one-level subpopulation to desired indicator, and n is natural number；(2) based on the nucleic acid sequencing data in (1), detection To obtain colony's SNP data, described colony SNP data include multiple one-level subpopulation SNP data；(3) based in (2) Colony's SNP data, relatively the difference of the polymorphism of different one-level subpopulations, to determine the SNP by selection, described It is the described site by selection by the SNP of selection.In one embodiment of the invention, described nucleic acid sequencing Data utilize RNA-Seq technology to get, for transcript sequencing data.Alleged desired indicator can be arbitrary two each and every one The different feature of body sample, in one embodiment of the invention, desired indicator is geographical and/or biological character is relevant, Such as can originate with different geographical, there is certain (a bit) various trait etc. be used as the index of Preliminary division colony.At this In a bright embodiment, before carrying out the step (3) of the method or after step (3), carry out group structure analysis, Including: based on the colony's SNP data in (2), described population sample is carried out group structure analysis, it is thus achieved that group structure divides Analysis result；Optional, carry out described group structure analysis and include phylogenetic tree construction, principal component analysis and STRUCTURE At least one in analysis.And, in another embodiment of the present invention, further, analyze based on described group structure As a result, described population sample is repartitioned, i.e. original described to the classification results replacement of colony with the division result obtained One-level subpopulation, and then carry out (3) and determine in colony the site by selection.

According to another aspect of the present invention, the present invention provides a kind of method based on colony's transcript data analysis group structure, should Method includes: obtaining the nucleic acid sequencing data of population sample, described population sample is from multiple individualities of species, optional, Described population sample is from the homologue of a multiple individuality of species or the same area of multiple individualities of species, described Population sample can be divided into 2n one-level subpopulation according to n to desired indicator, and n is natural number；Based on described nucleic acid sequencing data, Detecting to obtain colony's SNP data, described colony SNP data include multiple one-level subpopulation SNP data；Based on described group Body SNP data, relatively the difference of the polymorphism of different one-level subpopulations, determines the SNP by selection, and/or, base In described colony SNP data, described colony is carried out group structure analysis.

According to another aspect of the invention, the present invention provides the device in a kind of site determined in colony by selection, this dress The method putting to implement in the invention described above grouping body the most really site by selection, device includes: data are defeated Enter unit, be used for inputting data；Data outputting unit, is used for exporting data；Processor, is used for performing machine-executable program, Perform described machine-executable program has included one aspect of the present invention or method in any embodiment；Memory element, with Described data input cell, data outputting unit are connected with processor, are used for storing data, can perform including described machine Program.It will be appreciated by those skilled in the art that described machine-executable program can be saved in storage medium, alleged storage Medium may include that read only memory, random access memory, disk or CD etc..

According to another aspect of the present invention, the present invention provides the system in a kind of site determined in colony by selection, and this is System can in order to implement the invention described above on the one hand or all or part of step of method in any embodiment, this system bag Including: sequencing data acquisition device, in order to obtain the nucleic acid sequencing data of population sample, described population sample is from species Multiple individualities, optional, described population sample from the homologue of a multiple individuality of species or species many each and every one The same area of body, described population sample can be divided into 2n one-level subpopulation according to n to desired indicator, and n is natural number；SNP Detection device, is connected with described sequencing data acquisition device, and for based on described nucleic acid sequencing data, detection is to obtain colony SNP Data, described colony SNP data include multiple one-level subpopulation SNP data；Purpose site determines device, with described SNP Detection device connects, in order to based on described colony SNP data, the relatively difference of the polymorphism of different one-level subpopulations, to determine It is the described site by selection by the SNP of selection, the described SNP by selection.

Utilize the method for the invention described above, device and/or system can determine in colony the site by selection accurately. The method of the present invention and/or device, concentrate on the subgenomic transcription region of more general importance, it is possible to turns based on the colony obtained Record notebook data, it is thus achieved that gene expression data, discloses the gene expression rule of sample, and this is beneficial to disclose genetic background difference bar Gene expression rule under part, is the further expansion to population selection scopes such as RAD, GBS.And, it is obtained in that again group Body SNP data, disclose group structure and population genetic evolution laws.The inventive method, device and/or system can be in order to specifications Colony's transcript profile is resurveyed sequence analysis process, reduces and analyzes risk, it is possible to high efficiency, high-quality and high standard complete colony's project Analysis.

Accompanying drawing explanation

Above-mentioned and/or the additional aspect of the present invention and advantage will be apparent from from combining the accompanying drawings below description to embodiment With easy to understand, wherein:

Fig. 1 be in one embodiment of the present of invention really in grouping body by the flow chart of steps of method in site of selection.

Fig. 2 be in one embodiment of the present of invention really in grouping body by the flow chart of steps of method in site of selection.

Fig. 3 be in one embodiment of the present of invention really in grouping body by the flow chart of steps of method in site of selection.

Fig. 4 be in one embodiment of the present of invention really in grouping body by the device schematic diagram in site of selection.

Fig. 5 be in one embodiment of the present of invention really in grouping body by the system schematic in site of selection.

Fig. 6 is the schematic diagram of the population genetic variations that the Frappe in one embodiment of the present of invention speculates based on colony SNP.

Fig. 7 is the schematic diagram of the phylogenetic tree using adjacent method to infer based on colony SNPs in one embodiment of the present of invention.

Fig. 8 is the PCA analysis result schematic diagram based on colony SNP in one embodiment of the present of invention.

Fig. 9 is that the Arlequin program in one embodiment of the present of invention detects the result by selection site based on colony SNP Schematic diagram.

Figure 10 is that the Global FST test program in one embodiment of the present of invention detects by selection site based on colony SNP Result schematic diagram.

Figure 11 is that the BayeScan program in one embodiment of the present of invention is tied by selection site based on colony SNP detection Really schematic diagram.

Detailed description of the invention

Embodiments of the invention are described below in detail, and the example of described embodiment is shown in the drawings, wherein, the most identical Or similar label represents same or similar element or has the element of same or like function.Describe below with reference to accompanying drawing Embodiment be exemplary, be only used for explaining the present invention, and be not considered as limiting the invention.Need explanation, this Term " one-level " used in literary composition, " two grades " etc. are only for convenience of describing, it is impossible to be interpreted as instruction or hint relative importance, Sequencing relation can not be had between being interpreted as.In describing the invention, except as otherwise noted, " multiple " are meant that two Individual or two or more.In this article, unless otherwise clearly defined and limited, term " is connected ", the term such as " connection " should do Broadly understood, connect for example, it may be fixing, it is also possible to be to removably connect, or be integrally connected；Can be to be mechanically connected, It can also be electrical connection；Can be to be joined directly together, it is also possible to be indirectly connected to by intermediary, can be two element internals Connection.

According to one embodiment of present invention, as it is shown in figure 1, the present invention provides a kind of position determined in colony by selection The method of point, described selection includes that artificial selection acts on and at least one of natural selection effect, and the method includes following step It is rapid: S10 obtains the nucleic acid sequencing data of population sample, and described population sample is from multiple individualities of species, optional, Described population sample is from the homologue of a multiple individuality of species or the same area of multiple individualities of species, described Population sample can be divided into 2n one-level subpopulation according to n to desired indicator, and n is natural number；S20 is based on the core in S10 Acid sequencing data, detects to obtain colony's SNP data, and described colony SNP data include multiple one-level subpopulation SNP data； S30 is based on the colony's SNP data in S20, and relatively the difference of the polymorphism of different one-level subpopulations, is made by selection to determine SNP, the described SNP by selection be the described site by selection.

According to one embodiment of present invention, described nucleic acid sequencing data utilize RNA-Seq technology to get, and survey for transcript Ordinal number evidence.With same species, multiple different genetic backgrounds individuality as object of study, by transcript profile (transcriptome) Sample carries out high-flux sequence, the disposable subgenomic transcription region polymorphism data obtaining this individually defined thing kind of groups level, including Colony's SNP data and full genome/transcript expressing information, may be used for disclosing the evolutionary relationship between research individuality and genetic constitution Difference, the site being subject to artificial/natural selection effect under specific selection in the gene cluster of common evolutionary, subpopulation and individuality Or the biological question such as functional module and metabolic pathway on expressing with significant difference between subpopulation.And, relative to The transcript profile of conventional a small amount of sample is resurveyed sequence, compared to the population selection technology such as RAD, GBS, the survey region phase of the present invention To concentrating on subgenomic transcription region, gene expression can be carried out quantitatively, under the conditions of this is beneficial to disclose genetic background difference Gene expression rule, be the further expansion to population selection scopes such as RAD, GBS.

Alleged desired indicator can be the different feature of arbitrary two individual specimen, according to one embodiment of present invention, Desired indicator is geographical and/or biological character is relevant, such as, can originate with different geographical, have certain (a bit) dissimilarity Shapes etc. are used as the index of Preliminary division colony.

According to one embodiment of present invention, as in figure 2 it is shown, before carrying out step S30 of the method, also include carrying out S23 Group structure is analyzed, and S23 group structure is analyzed and included: based on the colony's SNP data in S20, carry out described population sample Group structure is analyzed, it is thus achieved that group structure analysis result；Optional, carry out described group structure analysis and include that constructing system is grown At least one in tree, principal component analysis (PCA) and Group Structure analysis.

Adjacent method phylogenetic tree construction can be utilized, it is also possible to utilize MEGA software building relation, utilize MEGA software (http://www.megasoftware.net), by the genotype file composition sequence of all for each sample SNP site, one by one The corresponding sequence of body sample, as the input file of MEGA, MEGA, should according to the difference on each individual sample sequence Software has three kinds of methods (Maximum likelihood, Least Squares and Maximum parsimony) to build relation Tree.

In statistics, principal component analysis (Principal Components Analysis, PCA) is a kind of skill simplifying data set Art, is a linear transformation.This conversion transforms the data in a new coordinate system so that the of any data projection One big variable number is upper at first coordinate (referred to as first principal component), second largest variable number is at the second coordinate (Second principal component) On, the like.Principal component analysis is frequently used for reducing the dimension of data set, retains the feature maximum to data set contribution simultaneously Variable.By retaining low order main constituent, ignore what high-order main constituent realized.This is owing to low order composition tends to encumbrance According to concentrating most important aspect.According to list of references A tutorial on Principal Components Analysis.Lindsay I In Smith, 2002-02 and embodiment, first SNP data are converted into character matrix, such as, set by real SNP data characteristics Fixed the most consistent with reference sequences for 0, contrary for 2, degeneracy base is 1, and uniforms.Then by the side of above-mentioned introduction Method builds linear vector equation.Wherein i represents i-th sample from 1 to k.Application What R lingware bag was powerful solves equation ability, solves matrix a, according to the data characteristics of each sample extract front four main constituents to Amount, shows each individual cluster situation using vector as coordinate axes.

Group Structure analyzes can utilize Structure software (http://pritch.bsd.uchicago.edu/software/structure2_1.html) is carried out, this software base based on SNP site Because of typing data, infer whether there are different groups and judge the colony that each individuality is belonged to.According to software manual, by colony The genotype file format transformation of SNP, as Structure input file up to 50,000 times simulations of employing in mixed model, In the presence of assuming multiple colony, calculate the probability of each individual ownership all kinds of (sub-) colony.More than Jing Guo, it is possible to realize individual The classification of body.In one embodiment of the invention, on the basis of classification, it is also possible to screen individuality, such as basis further Above-mentioned group structure analysis result, it is achieved to individual classification, extracts each individual specimen information, rejects the individuality that there is objection, Such as classify indefinite or obvious outliers.

According to one embodiment of present invention, further, based on described group structure analysis result, described population sample is entered Row is repartitioned, and substitutes original one-level subpopulation with the new subpopulation that the division result obtained i.e. obtains, and then based on new Subpopulation and SNP data thereof carry out step S30 to determine in colony the site by selection, so, divide with group structure Colony/subpopulation is classified or reclassifies by analysis result again, is conducive to accurately judging the site by selection.

According to one embodiment of present invention, as it is shown on figure 3, after carrying out step S30 of the method, also include carrying out S23 Group structure is analyzed, and S23 group structure is analyzed and included: based on the colony's SNP data in S20, carry out described population sample Group structure is analyzed, it is thus achieved that group structure analysis result；Optional, carry out described group structure analysis and include that constructing system is grown At least one in tree, principal component analysis (PCA), Group Structure analysis and the detection of Genetic Constitution of Population Frappe.

According to one embodiment of present invention, the nucleic acid sequencing data of described population sample are by each individual sample forming population sample This nucleic acid sequencing data composition, it is desirable to the nucleic acid sequencing data of each individual specimen are no less than 4G, are beneficial to accurately detect SNP, and then be conducive to accurately determining by selection site based on colony SNP data accurately.

According to one embodiment of present invention, population sample is from same species, the individuality with different genetic background.For Population sample is analyzed, it is proposed that the individual specimen quantity comprised in population sample is not less than 30, and, all individualities related to are extremely Two and plural subpopulation can be divided into less, i.e. alleged one-level subpopulation according to certain index, in order to after Continuous variation analysis.According to one embodiment of present invention, it is also preferred that the left each one-level subpopulation includes at least 10 individual specimen, It is beneficial to variation analysis.According to one embodiment of present invention, all individual specimen are cultivated at identical conditions, so After sample at identical tissue or position, obtain population sample, so make to carry out colony based on these population sample data and divide It is meaningful that analysis includes carrying out analysis of gene differential expression, and reason is, the hereditary difference of individual specimen i.e. variable has existed, Sample under the same terms, it is possible to make the difference expression gene obtained can go to lay down a definition from the angle of hereditary difference, otherwise, Duo Gebian The existence of amount, the reason that can cause differential expression is equivocal.Such as, research colony is divided into anti-saline and alkaline and does not resist saline and alkaline Two classes, it is possible to use all individualities being grown under equivalent environment are processed by the saline of identical metering, then to special after processing The tip of a root of (such as 1 hour) of fixing time is sampled, and so, subsequent population Analysis and Identification difference expression gene out may Can be used for disclosing these species and resist saline and alkaline mechanism, and, can determine that this differential expression is owing to the difference of genetic background causes.

According to one embodiment of present invention, described one-level subpopulation includes at least one two grades of subpopulation；Optional, an institute State two grades of subpopulations and include at least 10 individualities.Two grades of subpopulations can be different from another (a bit) dividing colony by utilization Index divides one-level subpopulation and obtains.Utilize the method in any embodiment of the present invention can to repeatedly divide after multistage Asia The site by selection in colony accurately judges.

It is according to one embodiment of present invention, described based on colony's SNP data, the relatively difference of different one-level subpopulation polymorphisms, To determine the SNP by selection, including: based on colony's SNP data, utilize at least two method of inspection relatively described in not With the difference of the heterozygosity of the identical SNP site in one-level subpopulation, the SNP position that at least two method of inspection is supported will be obtained Point is defined as the SNP by selection；Optional, the described method of inspection includes F statistic, molecular variant analysis and multilamellar Bayes method.In some embodiments of the invention, Arlequin program, Global FST test program and BayeScan are utilized In program two or whole three, or include utilizing in tri-kinds of methods of Arlequin, BayesScan and Datacal extremely Lack two or all three method judges to compare the heterozygosity difference degree in site, when certain SNP site obtains three of the above inspection At least two in proved recipe method or the support of all three, the assay of two kinds the most at least within all assert that this SNP is not It is significant with the difference of the heterozygosity in subpopulation, then judges that this SNP is as the site by selection.So, be conducive to Accurately judge.

According to one embodiment of present invention, described at least two method of inspection is utilized to come described in comparison in different one-level subpopulations The difference of the heterozygosity of identical SNP site, is defined as being selected by the SNP site obtaining at least two method of inspection support The SNP of effect, including: calculate described SNP site heterozygosity difference value in different one-level subpopulations, by heterozygosity difference Value is defined as the site by selection not less than the SNP site of threshold value.In one embodiment of the invention, alleged miscellaneous Right difference value is with F_ST(Fixation index) represents.F_STCan be used to genome distance and the difference of population evaluating between colony Different, it is one index of differentiation degree between tolerance population, special in the one of nineteen twenty-two application F-inspection by Sewall Wright Situation develops.F_STNull hypothesis be when colony does not break up, pleomorphism site is in (sub-) group and between (sub-) group The frequency difference of inferior bit base does not have significance.Calculate F_STMethod a lot, although circular is different, but substantially manages Opinion is consistent, the definition be i.e. given by Hudson (1992):Wherein, Π_BetweenAt this In represent from two subpopulations (Between), extract a sample respectively, partner, calculate this to sample SNP gene The difference of type, so can calculate the difference of all paired samples SNP genotype, finally average and be Π_Between。Π_Within Represent from a subpopulation (Within), extract 2 samples respectively, partner, calculate this to sample SNP genotype Difference, so can calculate the difference of all paired samples SNP genotype, finally average and be Π_Within.If having two Individual subpopulation, can the most first calculate Π by two subpopulations_Within, then add up.In this embodiment, in conjunction with existing subpopulation The structure of SNP data, based on above-mentioned principle, derivation formula is as follows:

F_{S T} = \frac{Π_{B e t w e e n} - Π_{W i t h i n}}{Π_{B e t w e e n}} = 1 - \frac{Π_{W i t h i n}}{Π_{B e t w e e n}} = 1 - \frac{[\underset{j}{Σ} (_{2}^{n_{j}}) \underset{j}{Σ} 2 \frac{n_{i j}}{n_{i j} - 1} (1 - x_{i j})] / \underset{j}{Σ} (_{2}^{n_{j}})}{\underset{j}{Σ} 2 \frac{n_{i}}{n_{i} - 1} x_{i} (1 - x_{i})},

Wherein, xⁱ _jIt is The frequency of SNP site i inferior bit base (the second base) in subpopulation j, and nⁱ _jIt is that SNP site i is in subpopulation j Physical location on chromosome, n_jIt it is then the summation of SNP site number for comparative analysis in subpopulation j.The present invention's In one embodiment, tri-kinds of methods of Arlequin, BayesScan and Datacal are utilized to carry out the secondary equipotential of comparison test SNP site Base frequency difference between subpopulation, the difference each arranged has the threshold value of significance and is respectively 0.05,0.1 and 0.01.

According to one embodiment of present invention, the present invention provides a kind of method based on colony's transcript data analysis group structure, The method includes: obtaining the nucleic acid sequencing data of population sample, described population sample is from multiple individualities of species, optionally , described population sample from the homologue of a multiple individuality of species or the same area of multiple individualities of species, Described population sample can be divided into 2n one-level subpopulation according to n to desired indicator, and n is natural number；Based on described nucleic acid sequencing Data, detect to obtain colony's SNP data, and described colony SNP data include multiple one-level subpopulation SNP data；Based on The difference of the polymorphism of described colony SNP data, relatively different one-level subpopulations, determines the SNP by selection, and/ Or, based on described colony SNP data, described colony is carried out group structure analysis.

According to one embodiment of present invention, as shown in Figure 4, the present invention provides a kind of position determined in colony by selection The device 100 of point, this device 100 in order to implement the invention described above on the one hand the most really in grouping body by the site of selection Method, device 100 includes: data input cell 110, is used for inputting data；Data outputting unit 120, is used for exporting data； Processor 130, is used for performing machine-executable program, performs described machine-executable program and has included one aspect of the present invention Or the method in any embodiment；Memory element 140, with described data input cell 110, data outputting unit 120 and place Reason device 130 is connected, and is used for storing data, including described machine-executable program.It will be appreciated by those skilled in the art that Described machine-executable program can be saved in storage medium, and alleged storage medium may include that read only memory, random Memorizer, disk or CD etc..

According to one embodiment of present invention, as it is shown in figure 5, the present invention provides a kind of position determined in colony by selection The system 1000 of point, this system can in order to implement the invention described above on the one hand or method in any embodiment whole or Part steps, this system 1000 includes: sequencing data acquisition device 1100, in order to obtain the nucleic acid sequencing data of population sample, Described population sample is from multiple individualities of species, optional, and described population sample is from the phase of a multiple individuality of species With tissue or the same area of multiple individualities of species, described population sample can be divided into 2n according to n to desired indicator One-level subpopulation, n is natural number；SNP detects device 1200, is connected, for base with described sequencing data acquisition device 1100 In described nucleic acid sequencing data, detecting to obtain colony's SNP data, described colony SNP data include multiple one-level subpopulation SNP data；Purpose site determines device 1300, is connected, in order to based on described colony SNP with described SNP detection device 1200 The difference of the polymorphism of data, relatively different one-level subpopulations, to determine the SNP by selection, described is made by selection SNP be the described site by selection.

Utilize the method in the invention described above any embodiment, device and/or system to determine in colony accurately to be selected The site of effect.The method of the present invention and/or device, focus primarily upon the subgenomic transcription region of more general importance, it is possible to Based on the colony's transcript data obtained, it is thus achieved that gene expression data, disclosing the gene expression rule of sample, this is beneficial to take off Show the gene expression rule under the conditions of genetic background difference, be the further expansion to population selection scopes such as RAD, GBS.And And, it is obtained in that again colony's SNP data, discloses group structure and population genetic evolution laws.The inventive method, device and/ Or system can resurvey sequence analysis process in order to specification colony transcript profile, reduce and analyze risk, it is possible to high efficiency, high-quality and height Standard completes the analysis to colony's project.

Below in conjunction with accompanying drawing and concrete sample data embodiment to the determination of the present invention by the method in the site of selection, colony Item analysis device and/or system are described in detail.It is exemplary by the embodiment being described with reference to the drawings, is only used for Explain the present invention, and be not considered as limiting the invention.Except as otherwise explaining, relate in following example hands over the most especially Reagent, sequence (joint, label and primer), software and the instrument treated is all conventional commercial product or increases income, such as, purchase Buy the transcript profile library construction Kit of Illumina.

Embodiment one

Reference sequences, sequencing strategy, sample requirement and other points for attention:

I) reference sequences: require the genome reference sequences of useful better quality.

Ii) sequencing strategy: use PE91 (double end sequencings, it is thus achieved that multipair paired-end reads, the length of every reads is all For 91bp) sequencing strategy, single sample reaches the standard of filtered data amount 4G.

Iii) sample should be from same species, the individuality with different genetic background.

Iv) for total research colony, it is recommended that 30 individualities and the above scale of construction.Meanwhile, all individualities related to can be according to certain Plant index and be divided into two and plural subpopulation (being easy to variation analysis), and each subpopulation is preferably more than 10 Individual.

V) all samples are cultivated at identical conditions, then in identical tissue, position sampling.Reason is, sample The hereditary difference (variable) of product has existed, and samples the most under the same conditions, and the difference expression gene obtained is only possible to from something lost The angle passing difference goes to lay down a definition.Otherwise, the existence of multiple variablees, the reason that can cause differential expression is equivocal.Such as grind Study carefully colony be divided into anti-saline and alkaline and do not resist saline and alkaline two classes.The saline of identical metering can be used being grown under equivalent environment All individualities process, and are then sampled the tip of a root of special time (such as 1 hour) after processing.Follow-up qualification Difference expression gene out then might reveal that these species resist saline and alkaline mechanism, because differential expression is due to genetic background Difference causes.

For the analysis process of specification colony transcript profile weight sequencing project, reduce and analyze risk, to reach high efficiency, high-quality, height The purpose of standard finished item, herein proposes a kind of groups transcript profile weight sequencing analysis method, specifically includes that

One, experiment flow

After extracting sample total serum IgE and using DNase I digestion DNA, raw with the enrichment with magnetic bead eucaryon with Oligo (dT) Thing mRNA (if prokaryote, then enter next step after removing rRNA with test kit)；Addition interrupts reagent and exists In Thermomixer, mRNA is broken into short-movie section by thermophilic, with the mRNA after interrupting for templated synthesis one chain cDNA, so Rear preparation two chain synthesis reaction system synthesizes two chain cDNA, and uses kits to reclaim sticky end reparation, cDNA 3' end adds base " A " jointing, then carries out clip size selection, finally carries out PCR amplification；The library built With Agilent 2100Bioanalyzer and ABI StepOnePlus Real-Time PCRSystem quality inspection qualified after, use Illumina HiSeq^TM2000 or other sequenators check order.

Two, information analysis content

1) standard rna-Seq analyzes

Analyze including data filtering, quantitative gene expression, group difference gene identification and GO, KEGG Pathway enrichment thereof, SNP calling and annotation etc..

2) analyses based on colony's SNP data

Prediction to the consensus sequence (consensus sequence) of single sample in analyzing based on standard rna-Seq, i.e. SNP Identify the intermediate steps of (SNP calling), arrange the SNP data obtaining population level, for the analysis of following many aspects:

A, group structure analysis: include that phylogenetic tree construction, main constituent (PCA) are analyzed and STRUCTURE analyzes, Three can reflect the structure of colony, but each analysis side emphasis is the most different.Phylogenetic tree construction lays particular emphasis on announcement Evolutionary relationship between individual in population；Main constituent (PCA) analysis side overweights genetic background difference between announcement individual in population Principal element；STRUCTURE analysis side overweights the genetic constitution to each individuality and compares, quantifies, and with diagram Mode discloses the similarities and differences of genetic constitution between individuality.

B, detection are by the site of selection: selection (coming from artificial or natural) is generally (sub-in the differentiation of population Group formation) during play very important effect.From the SNP data of subgroup, all sites can be counted not With the difference (Fst) of polymorphism between subgroup, and verify the site of Fst significant difference.These sites are subject to as potential The site of selection, it is possible to support study dies person recognizes the process of the selection being directed to some subgroup further.

Fst (Fixation index) is mainly used to evaluate the genome distance between colony and the difference of population, is to divide between tolerance population One index of change degree, is developed in a kind of special circumstances of nineteen twenty-two application F-inspection by Sewall Wright.

F_STNull hypothesis be when colony does not break up, the frequency difference of the pleomorphism site inferior bit base in group and between group is Inapparent.Calculate F_STMethod a lot, although circular is different, but basic theories is consistent, i.e. by Hudson (1992) definition be given:

F_{S T} = \frac{Π_{B e t w e e n} - Π_{W i t h i n}}{Π_{B e t w e e n}},

Wherein Π_BetweenRepresent from Liang Ge colony (Between), extract a sample respectively, partner, calculate this to sample The difference of this SNP genotype, so can calculate the difference of all paired samples SNP genotype, finally average and be Π_Between。

Π_WithinRepresent from a colony (Within), extract 2 samples respectively, partner, calculate this to sample SNP The difference of genotype, so can calculate the difference of all paired samples SNP genotype, finally average and be Π_Within。

If You Liangge colony, Liang Ge colony can the most first calculate Π_Within, then add up.

3) additional analysis based on gene expression data

A, cluster analysis, PCA analyze: based on gene expression data, can cluster the individuality in colony, PCA divides Analysis, presents the difference on gene expression level between individuality and individuality.This result can be with SNP data construct system out Grow tree and PCA analysis result is mutually confirmed, compared.

Compare between b, co-expression gene network struction and group: in various vital movements, multiple genes (co-expression genes) Generally express synergistically, to realize some specific function under the conditions of a lot.Go out from the gene expression data of multiple Different Individual Send out, the module of many co-expression genes can be constructed.Based on this, researcher can be analyzed: i) under given conditions, Which co-expression gene module is playing effect (expressing in higher levels), and this is conducive to recognizing these specified conditions behinds Gene expression rule；Ii) which co-expression gene module plays a role in the specific individuality of which (which), and this is conducive to solving The biological function of analysis part co-expression gene module；Iii) between the co-expression gene module constructed more than can be with subpopulation Compare.Difference between this higher level of co-expression gene module is the most individual, can reveal that out from routine Gene differential expression data (assuming that separate between gene and gene, do not consider the interaction between them) in cannot The new content embodied.

Above, with same species, multiple different genetic backgrounds individuality as object of study, by transcript profile (transcriptome) Sample carries out high-flux sequence, the disposable subgenomic transcription region polymorphism data (colony obtaining this individually defined thing kind of groups level And full genome/transcript expressing information, and then can reveal that the evolutionary relationship between (i) research individuality and genetic constitution are poor SNP) Different, (ii) be the gene cluster of common evolutionary under specific selection, by the position of artificial/natural selection effect in (iii) subpopulation The biologys such as functional module and the metabolic pathway on expressing with significant difference between point, and (iv) individuality or subpopulation Problem.Resurveying sequence relative to the transcript profile of conventional a small amount of sample, the method also will provide colony's SNP data, and these data can be used In disclosing the evolutionary relationship of each individuality in group structure, Swarm Evolution history, colony, and the potential position by selection The biological questions such as point.Compared to population selection technology such as RAD, GBS, the survey region of the method concentrates on the most universal The subgenomic transcription region of importance.Meanwhile, gene expression can be carried out quantitatively by the present invention, and this is beneficial to disclose the heredity back of the body Gene expression rule under the conditions of scape difference, is the further expansion to population selection scopes such as RAD, GBS.

Embodiment two

Example introduces operating process step by step in detail below:

One, convenient transcript group is resurveyed sequence flow process

Different geographical includes the Qinling Mountains, Mount Min, Liangshan, Qionglai and the giant panda in phase ridge, the giant panda blood of acquisition or tissue samples Number 34 altogether, wherein, from Liangshan be 2 sample number be GP37 and GP52 (being blood sample), come It is GP14-19 and GP51 (being blood sample) from 7 sample number that have of Mount Min, has 8 from the Qinling Mountains Sample number is respectively GP3-8 (blood sample), GP10 (tissue samples) and GP12 (blood sample), from Qionglai There are 15 sample number to be respectively GP2, GP13, GP22-31, GP33 and GP35-36 (being blood sample), come It is respectively GP38-39 (being blood sample) from 2 sample number that have in phase ridge.Sample transcript profile nucleic acid extraction, library Build and order-checking is carried out with reference to preceding embodiment, it is thus achieved that each sample sequencing data.According to the difference of region, 34 samples are divided It is 5 one-level subpopulations.

Complete data filtering, Quality Control, clean sequencing data (clean data) comparison, to genome reference sequences, is such as utilized SOAP or BWA, compare according to its default setting, each sample is carried out SNP identification (call snp), by clean Data comparison on gene set reference sequences, calculate each gene expression and carry out group difference expressing gene identify and GO, KEGG pathway is enriched with analysis.Again by clean data comparison to genome reference sequences, such as utilize TopHat or STAR compares, it was predicted that variable sheer and new transcript, and completes various statistical work, including original, filter after Data volume statistics, reads mapping Information Statistics, genome coverage statistics, generation library randomness assessment figure etc..

Two, (Call) colony SNP and Swarm Evolution analysis based on colony SNP are identified

In the consensus information of genome reference sequences, (i.e. SOAPsnp exports each individual relative obtained from previous step Cns file) set out, it is integrally formed colony's SNP data, this is all individual level, is and takes all individual specimen SNP Union is colony's SNP data.Based on this colony SNP, carrying out Swarm Evolution analysis, Swarm Evolution analysis includes evolving The structure of tree, principal component analysis, individual inheritance composition analysis etc..This requirements of process prepares some simple configuration files, explanation As follows:

Individual.txt: sample (individual specimen) message file, every a line is the information of a sample, and often row 6 arranges, such as table Shown in 1.

Table 1

Snp.lst: colony SNP (genotype) listed files, colony's SNP file format is as shown in table 2.

Table 2

First row	Chromosome numbers
		Secondary series	Allele position
3rd row	The nucleotide in corresponding reference sequences site
		4th row	Order-checking sample genotype, separates with space, and order need to be corresponding with individual file

Population.txt: carry out two community information of site selection analysis, first row is subgroup title, can be with individual File is different, and secondary series is sample abbreviation ID, need to be present in individual file the 4th row.

* .gff: genome gff file, carries out determining during the selection analysis of site by selecting place, site gene, can not provide.

1) Call colony SNP

Utilizing SOAPsnp to detect the SNP of each sample, the SNP data integrating all single samples obtain colony's SNP data. Specifically include:

First we take into full account and utilize published panda genomic information (Zhao S, et al.Whole-genome sequencing of giant pandas provides insights into demographic history and local adaptation.Nat Genet.45 (1): 67-71 (2013)), download the dbsnp that panda genome is corresponding from NCBI website, as the elder generation of SOAPsnp Test probability, and according to the result of study determined at present, the prior probability arranging heterozygous sites SNP is 0.0010, and isozygoty site SNP Prior probability be 0.0005.After above parameter is set, utilize SOAPsnp software by filtered data and panda reference gene Group comparison, obtaining comparison result is CNS file.Owing to there is the region of some low order-checking degree of depth in each sample genome, at this The file of the probability of comprehensive all sample genotype, utilizes method of maximum likelihood to integrate the data of all samples, and generation comprises all The pseudogene group (Pseudo-genome) in each site of sample.The genotype of select probability maximum is as the consistent base of each sample Because of type, go out high-quality SNPs by infomation detection such as genotype and the order-checking degree of depth.After obtaining the consensus sequence of each sample, Result saves as colony's SNPs form, it is thus achieved that colony's SNP data.

2) Swarm Evolution analysis

Input colony SNP result, and based on colony SNP, integration is called multiple software and is carried out Swarm Evolution analysis, bag Include Tree, PCA, Structure and Frappe to analyze, specific as follows.

Software names PopuStruct.pl, and relevant parameter illustrates such as table 3, it should be noted that colony's SNP file must be with Individual file is corresponding.The Structure running software time is longer, if be pressed for time, it is proposed that first carry out group with Frappe Fluid-structure analysis, obtains Preliminary Analysis Results.

Table 3

Parameter	Explanation
		-indi<s>	Each individual information in colony, individual order is consistent with colony SNP file, it is necessary to arrange.
-list<s>	Colony's SNPs genotype listed files, it is necessary to arrange.
		-OutDir<s>	Outgoing route, gives tacit consent to current path.
-prefix<s>	Output script prefix information, gives tacit consent to " Pop ".
		-Struct<y/n>	Whether carry out group structure analysis with Structure software, give tacit consent to " y "
-Tree<y/n>	Whether constructing system tree, gives tacit consent to " y "
		-Frappe<y/n>	Whether carry out group structure analysis with Frappe software, give tacit consent to " y "
-PCA<y/n>	Whether carry out principal component analysis, give tacit consent to " y "
		-queue<s>	Deliver task queue, give tacit consent to bc.q
-project<s>	Deliver throwing task-P parameter value, give tacit consent to rdtest
		-help	Help information

Output file (result)

I) Frappe destination file and Structure destination file, can be adjusted in conjunction with excel and map.Result such as Fig. 6 institute Showing, Fig. 6 is the population genetic variations schematic diagram that Frappe speculates based on colony SNP, and in figure, every piece of separation represents a group Body, abscissa represents a sample, and different spacing blocks represent K ancestors that are different or that differ greatly, analyze each strain In genetic constitution, the ratio of each the imagination ancestors' composition being had.If corresponding two the different segmentation blocks of sample, Then represent the intermediate varieties that this sample is probably between two subgroups.When K value obtains the biggest, the diversity between sample gets over quilt Amplify, get the thinnest, certainly defining K value can be carried out according to actual result and which is got just can embody the structural relation of all samples completely. In figure, K takes 2,3,4 and 5 respectively, it can be seen that K=3 colony will be divided into 3 subpopulations and substantially can completely embody The structural relation of all samples.

Ii) tree destination file utilizes mega software to be adjusted, and result is as shown in Figure 7.Fig. 7 is to use based on colony SNP The schematic diagram of the phylogenetic tree that adjacent method is inferred, in figure, branch's distance is the nearest, illustrates that between two branches, evolutionary relationship is the nearest.Right Sample in same subgroup, it should display can well be grouped together or not far behind, can be illustrated between kind by this figure Evolutionary relationship far and near.As can be seen from Figure 7, this colony is segmented into 3 subpopulations.

Iii) PCA analysis result, need to map with excel, and result is as shown in Figure 8.Fig. 8 is PCA based on colony SNP The schematic diagram of analysis result, in figure, difform labelling represents the sample of different subgroup, and a labelling point represents a sample, The transverse and longitudinal coordinate of point is the value of same sequential element in the first and second characteristic vectors that this sample is corresponding respectively, corresponding eigenvalue Size represents the ratio that this main constituent is shared in whole relation, can be contrasted with the actual packet of sample by this figure, see Go out sample packet quality.And then can see and want to reclassify to obtain new subgroup.

Three, detected by selection site

In conjunction with the embodiments one and the structure of colony's SNP data of above-mentioned acquisition, derivation formula is as follows:

\begin{matrix} F_{S T} = \frac{Π_{B e t w e e n} - Π_{W i t h i n}}{Π_{B e t w e e n}} \\ = 1 - \frac{Π_{W i t h i n}}{Π_{B e t w e e n}} = 1 - \frac{[\underset{j}{Σ} (_{2}^{n_{j}}) \underset{i}{Σ} 2 \frac{n_{i j}}{n_{i j} - 1} x_{i j} (1 - x_{i j})] / \underset{j}{Σ} (_{2}^{n_{j}})}{\underset{i}{Σ} 2 \frac{n_{i}}{n_{i} - 1} x_{i} (1 - x_{i})} \end{matrix}

X in above formulaⁱ _jIt it is the frequency of SNP site i inferior bit base (the second base) in subpopulation j；And nⁱ _jIt it is SNP position Point i physical location on chromosome in subpopulation j；n_jIt it is then the subpopulation j summation for the SNP site number of comparative analysis. Wherein variable j is according to above-mentioned group structure analysis result, is newly taken as 3, and variable i substitutes into the last SNP position judged.

Above-mentioned process of calculation analysis, based on colony SNP, calls that may be present between multiple software detection subpopulation selection The site of effect, named SnpSelect.pl, the software approach of use includes: Arlequin, BayesScan and Datacal tri- Kind, each software correspondence parameter declaration, including the setting of threshold value, refers to table 4.

perl SnpSelect.pl<snp.list><individual><2population.txt>[options]；Wherein 2population file Referring to participate in two sub-population information of site selection analysis, concrete form is shown in explanation.

Table 4

Output file

I) Arlequin analysis result, as shown in Figure 9.Fig. 9 shows that Arlequin program is made by selection based on colony SNP detection By the analysis result in site.Transverse axis is represented to the anchor point heterozygosity at population level, and the longitudinal axis represents between subgroup to anchor point On heterozygosity difference value (Fst).Point during upper part encloses represents the site (q < 0.01 or q < 0.05) by orthoselection, Point during lower part encloses represents the site (q < 0.01 or q < 0.05) by Balancing selection.

Ii) Global FST test analysis result, as shown in Figure 10.Figure 10 shows that Global FST test program is based on colony SNP Detect the result by selection site.Transverse axis is represented to the anchor point heterozygosity at population level, the longitudinal axis represent between subgroup To the heterozygosity difference value (Fst) on anchor point.The corresponding site of front 1%Fst value is considered as candidate locus, i.e. more than horizontal line Point be the site by selection detected.

Iii) BayeScan analysis result, as shown in figure 11.Figure 11 shows that BayeScan program is subject to based on colony SNP detection The result in selection site.Transverse axis is represented to the anchor point heterozygosity at population level, and the longitudinal axis represents to the inspection of anchor point The value (with 10 as the truth of a matter) that q value (q value) is taken the logarithm.The site of q value < 0.1 is considered as that candidate is by selection position Point, being i.e. positioned on figure the point on the right of vertical line is that candidate is by selection site.

In conjunction with Fig. 9-Figure 11, when point selection in place is analyzed, it is thus achieved that have above at least two method support is judged to final being selected Action site.

In the description of this specification, reference term " embodiment ", " some embodiments ", " example ", " concrete Example " or the description of " some examples " etc. means to combine this embodiment or example describes specific features, structure, material or Person's feature is contained at least one embodiment or the example of the present invention.In this manual, the schematic representation to above-mentioned term It is not necessarily referring to identical embodiment or example.And, the specific features of description, structure, material or feature can be in office What one or more embodiments or example combine in an appropriate manner.

Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that: without departing from this These embodiments can be carried out multiple change in the case of the principle of invention and objective, revise, replace and modification, the present invention's Scope is limited by claim and equivalent thereof.

Claims

1. the method determining in colony site by selection, described selection includes that artificial selection acts on and natural At least one of selection, it is characterised in that comprise the following steps:

(1) obtaining the nucleic acid sequencing data of population sample, described population sample is from multiple individualities of species, optional, Described population sample is from the homologue of a multiple individuality of species or the same area of multiple individualities of species, described Population sample can be divided into 2n one-level subpopulation according to n to desired indicator, and n is natural number；

(2) based on the nucleic acid sequencing data in (1), detect to obtain colony's SNP data, described colony SNP packet Include multiple one-level subpopulation SNP data；

(3) based on the colony's SNP data in (2), relatively the difference of the polymorphism of different one-level subpopulations, is subject to determine The SNP of selection, the described SNP by selection are the described site by selection.

2. the method for claim 1, it is characterised in that after carrying out before (3) or carrying out (3), including:

Based on described colony SNP data, described colony is carried out group structure analysis, it is thus achieved that group structure analysis result；

Optional, carry out described group structure analysis and include that phylogenetic tree construction, principal component analysis and STRUCTURE analyze At least one in.

3. the method for claim 2, it is characterised in that based on described group structure analysis result, described population sample is carried out Divide, substitute described one-level subpopulation with the division result obtained.

4. claim 1-3 either method, it is characterised in that the nucleic acid sequencing data of described population sample are by forming population sample Each individual specimen nucleic acid sequencing data composition, the nucleic acid sequencing data of each individual specimen be no less than 4G.

5. claim 1-3 either method, it is characterised in that described nucleic acid sequencing data are transcript sequencing data.

6. claim 1-3 either method, it is characterised in that described desired indicator is geographical and/or biological character is relevant.

7. claim 1-3 either method, it is characterised in that each described one-level subpopulation includes at least 10 individualities.

8. claim 1-3 either method, it is characterised in that described one-level subpopulation includes at least one two grades of subpopulation；

Optional, described two grades of subpopulations include at least 10 individualities.

9. claim 1-3 either method, it is characterised in that described based on colony's SNP data, relatively different one-level subgroups The difference of body polymorphism, to determine the SNP by selection, including:

Based on colony's SNP data, utilize the identical SNP position in the more described different one-level subpopulations of at least two method of inspection The difference of the heterozygosity of point, is defined as the SNP by selection by the SNP site obtaining at least two method of inspection support；

Optional, the described method of inspection includes F statistic, molecular variant analysis and multilayer protection.

10. the method for claim 9, it is characterised in that described utilize at least two method of inspection to carry out described in comparison different one-levels The difference of the heterozygosity of the identical SNP site in subpopulation, determines the SNP site obtaining at least two method of inspection support For the SNP by selection, including:

Calculate described SNP site heterozygosity difference value in different one-level subpopulations, by heterozygosity difference value not less than threshold value SNP site is defined as the site by selection.

The device in 11. 1 kinds of sites determined in colony by selection, it is characterised in that including:

Data input cell, is used for inputting data；

Data outputting unit, is used for exporting data；

Processor, is used for performing machine-executable program, performs described machine-executable program and has included that claim 1-10 is appointed The method of one；

Memory element, is connected with described data input cell, data outputting unit and processor, is used for storing data, wherein wraps Include described machine-executable program.