CN113930492A - Biological information processing method for paternity test of contaminated sample - Google Patents

Biological information processing method for paternity test of contaminated sample Download PDF

Info

Publication number
CN113930492A
CN113930492A CN202111056059.9A CN202111056059A CN113930492A CN 113930492 A CN113930492 A CN 113930492A CN 202111056059 A CN202111056059 A CN 202111056059A CN 113930492 A CN113930492 A CN 113930492A
Authority
CN
China
Prior art keywords
sample
typing
samples
paternity
polluted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111056059.9A
Other languages
Chinese (zh)
Inventor
杨功达
曾丰波
巫萍
胡秀娣
黄奎匀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Lansha Medical Laboratory Co ltd
Original Assignee
Wuhan Lansha Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Lansha Medical Laboratory Co ltd filed Critical Wuhan Lansha Medical Laboratory Co ltd
Priority to CN202111056059.9A priority Critical patent/CN113930492A/en
Publication of CN113930492A publication Critical patent/CN113930492A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • C12Q1/6874Methods for sequencing involving nucleic acid arrays, e.g. sequencing by hybridisation
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a biological information processing method for paternity test of a polluted sample, belonging to the technical field of biological information. The method comprises the following steps: (1) obtaining sequencing data of two samples to be identified by parents, wherein the sequencing data is obtained by adopting a second-generation sequencing technology; (2) judging whether one sample is polluted by the other sample, if so, typing the uncontaminated sample to obtain a homozygote number M and a heterozygote number N; (3) with M as a base number, typing corresponding point positions in the polluted sample to obtain a heterozygosis ratio f; (4) calculating according to M to obtain theoretical heterozygosity F0 which accords with the paternity and theoretical heterozygosity F1 which does not accord with the paternity; (5) judging the deviation degree of F from F0 and F1 according to a statistical algorithm; if F is not significantly different from F0 and F is significantly different from F1, then the two samples are in a paternity relationship; if F is significantly different from F0 and F is not significantly different from F1, then the two samples are non-paternity.

Description

Biological information processing method for paternity test of contaminated sample
Technical Field
The invention belongs to the technical field of biological information analysis, and particularly relates to a biological information processing method for paternity test of a polluted sample, which is mainly used for paternity test of a sample polluting another sample.
Background
Paternity testing is a well-established application field of genetic testing technology, and generally, the genetic markers of two test materials are detected, the detection results of the two samples are compared, and if the genetic markers of the two samples accord with Mendelian's law of heredity, the two samples are considered to accord with paternity. At present, two types of common genetic markers are mainly used, namely short Sequence Tandem Repeat (STR) and single base polymorphism (SNP), and in addition, some researches adopt insertion deletion (InDel) mutation sites as genetic markers. The first generation sequencing technology is the most mature detection technology applied in the field of paternity test, and paternity test discrimination is generally performed through 21 STR loci. The first-generation sequencing has the advantages of high speed, low cost, simple and convenient operation and the like, is widely used by various identification mechanisms, and is a main detection technology for the paternity test direction at present. However, the STR locus is easy to mutate, and low-frequency error noise is easy to generate in the PCR amplification process, and the first generation sequencing method is limited by the technology, so that if the sample is contaminated, especially when the contamination source occupies most of the signals of the sample and the real sample only occupies a small part of the signals, the first generation paternity test method based on STR is difficult to accurately detect the sample. Particularly, when one of the two samples is contaminated by the other, conventional first-generation sequencing cannot use the sample for paternity testing, and even gives an erroneous test result. In a real application scenario, in both individuals performing parent-child identification, there is often a case where one party (such as a father or a mother) samples for the other party (such as a child), and a case where one party inspects a material to pollute the other party may occur, and sometimes even a case where a pollution signal is greater than a signal of a real material to be inspected itself may occur.
Disclosure of Invention
In order to solve the problems, the invention adopts a new generation of high-throughput sequencing technology, SNP loci are taken as genetic markers, target region capture sequencing is carried out on thousands of SNP loci in a human genome, each SNP locus can detect low-frequency mutation as low as one thousandth, and the problem of sample cross contamination can be solved through specific bioinformatics analysis and mathematical statistical algorithm, even if one examined material is polluted by the other examined material by 99%, the invention still can accurately identify whether the two examined materials accord with the parenthood relationship. The technical scheme is as follows:
the embodiment of the invention provides a biological information processing method for carrying out paternity test on a polluted sample, which comprises the following steps:
s101: obtaining sequencing data of two samples to be identified by parents, wherein the sequencing data is obtained by adopting a second-generation sequencing technology;
s102: judging whether one sample is polluted by the other sample, if so, typing the uncontaminated sample to obtain a homozygote number M and a heterozygote number N;
s103: with M as a base number, typing the corresponding point positions in the polluted sample and calculating to obtain a heterozygosis ratio f;
s104: calculating according to M to obtain theoretical heterozygosity F0 which accords with the paternity and theoretical heterozygosity F1 which does not accord with the paternity;
s105: judging the deviation degree of F from F0 and F1 according to a statistical algorithm; if F is not significantly different from F0 and F is significantly different from F1, then the two samples are in a paternity relationship; if F is significantly different from F0 and F is not significantly different from F1, then the two samples are non-paternity.
In step S101, the polymorphic SNP loci with mutation rate of 0.4-0.6 are sequenced, and the number of the polymorphic SNP loci is more than 200.
Specifically, the sequencing method of the second generation sequencing technology comprises the following steps:
s1011: constructing a probe;
s1012: extracting DNA in a sample;
s1013: a library of building blocks;
s1014: performing hybridization capture and sequencing on the library target region by adopting the probe of the step S1011;
s1015: and splitting and filtering the sequencing data by the quality value to obtain the sequencing data.
Wherein the determining whether one sample is contaminated by another sample comprises: comparing the measured sequences with a human reference genome, calculating the sequencing depth of A and a of each sample at each SNP locus, drawing typing graphs of the two samples according to the wild type proportion and the sequencing depth of each locus, and judging whether the samples are polluted or not according to the typing graphs; if one sample is contaminated, the other sample is not contaminated; and calculating the correlation between the two samples, and if the correlation is high, judging that one sample is polluted by the other sample.
Specifically, the judging whether the sample is contaminated according to the typing map includes: drawing a typing map by taking the sequencing depth of the SNP locus as an abscissa and the wild type proportion of the SNP locus as an ordinate, taking a typing map in which points at the ordinate 0 and 1 are not dispersed as a typing map G of a clean sample, taking a typing map in which the ordinate is uniformly distributed from 0 to 1 as a typing map Y of a severely polluted sample, and taking a typing map in which the points at the ordinate 0 and 1 are more dispersed and are not overlapped with points near the ordinate 0.5 as a typing map L of a slightly polluted sample; if the typing map of the sample is Y, the sample is polluted; if the typing pattern of the sample is G or L, the sample is uncontaminated.
In step S102, M is obtained by correcting homozygote sites at which the proportion of a is 85% or more or 15% or less.
Wherein, if the correlation of the two samples is greater than 0.7, the correlation of the two samples is determined to be high.
In step S103, only the sites where the proportion of a is 100% or 0% are designated as homozygotes, and the other sites are designated as heterozygotes.
Wherein, step S104 specifically includes: f0 and F1 are obtained by calculation according to the frequency p of the wild type locus and the frequency q of the mutant type locus of the M homozygote SNP loci in the population and the Mendelian genetic law.
Specifically, in step S105, the chi-square test is used to judge the degree of deviation of F from F0 and F1, and a p value of less than 0.05 is considered to have no significant difference.
The method can firstly carry out pollution identification and correlation analysis on the material to be detected for paternity test, and firstly judge whether the two samples have cross contamination, thereby avoiding false positive caused by the cross contamination of the samples (erroneously judging the two samples to be in accordance with paternity relations). Under the invention, even if the pollution of the material A to be detected of the material B reaches 99%, the paternity test can be carried out on the material A and the material B only under the condition of ensuring the sequencing depth, and the accuracy can reach more than 99.99%. The invention can greatly improve the availability ratio of the material to be detected, and particularly can realize the paternity test of precious material to be detected obtained under special conditions even if cross contamination exists.
Drawings
FIG. 1 is a flow chart of a method for processing biological information for paternity testing of a contaminated sample according to an embodiment of the present invention;
fig. 2 is a flowchart of step S101;
FIG. 3 is a typing chart G of a clean sample in the example;
FIG. 4 is a typing chart L of a slightly contaminated sample in the example;
FIG. 5 is a typing chart Y of a sample slightly contaminated in the examples.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.
Example 1
Referring to fig. 1, example 1 provides a bioinformation processing method for paternity testing of contaminated samples, the method comprising the steps of:
s101: the method comprises the steps of obtaining sequencing data of two samples to be identified by parents and children, wherein the sequencing data are obtained by adopting a second-generation sequencing technology, and the two samples are usually father samples and child samples.
S102: and judging whether one sample is polluted by the other sample according to the sequencing data, and if so, typing the uncontaminated sample according to the sequencing data to obtain the number M of homozygotes and the number N of heterozygotes. If not, then common paternity testing methods may be employed, such as paternity testing by mismatch rate.
S103: and (4) with M as a base number, typing the corresponding point positions in the polluted sample according to sequencing data, and calculating to obtain a heterozygosity ratio f.
S104: and calculating according to M to obtain the theoretical heterozygosity F0 meeting the parent-child relationship and the theoretical heterozygosity F1 not meeting the parent-child relationship.
S105: judging the deviation degree of F from F0 and F1 according to a statistical algorithm; if F is not significantly different from F0 and F is significantly different from F1, then the two samples are in a paternity relationship; if F is significantly different from F0 and F is not significantly different from F1, then the two samples are non-paternity.
Example 2
Referring to fig. 2, example 2 provides a bioinformation processing method for paternity testing of a contaminated sample, the method comprising the steps of:
s101: the two samples were sequenced using the second generation sequencing technique to obtain sequencing data. Thousands of SNP sites with high polymorphism on the human genome are selected and used as genetic markers for paternity test and discrimination. After obtaining the test material, firstly, extracting nucleic acid from the target test material, carrying out whole genome library construction, adding a barcode sequence representing the number, a sequencing joint which can be used for high-throughput sequencing and other necessary sequences to the DNA sequence of each sample in the library construction process, and carrying out whole genome amplification. After the library is completed, a set of probe sequences is used to perform liquid phase hybridization capture on thousands of SNP sites, and high-throughput sequencing and bioinformatics analysis are performed.
In step S101, the existing second-generation sequencing technology is adopted to sequence the polymorphic SNP locus with the mutation rate of 0.4-0.6. Wherein, the number of the polymorphic SNP sites is more than 200, and is usually less than 20000 from the aspect of cost. In order to ensure that the SNP site can be detected, the SNP site may be covered with 1 to 4 probes. Specifically, referring to fig. 2, the sequencing method of the second generation sequencing technology includes:
s1011: constructing a probe;
s1012: extracting DNA in a sample;
s1013: a library of building blocks;
s1014: performing hybridization capture and sequencing on the library target region by adopting the probe of the step S1011;
s1015: and splitting and filtering the sequencing data by the quality value to obtain the sequencing data.
The second generation sequencing technology adopts a conventional method, and the probes can specifically adopt products sold by the company of Biotechnology engineering (Shanghai) GmbH, and 12000 probes are designed in total.
S102: and (3) comparing the measured sequences to a human reference genome, calculating the sequencing depth of A and a of each sample at each SNP locus, drawing typing graphs of the two samples according to the wild type proportion and the sequencing depth of each locus, and judging whether the samples are polluted or not according to the typing graphs. Referring to fig. 3 to 5, a typing map is plotted with the sequencing depth of the SNP sites as abscissa and the wild type ratio of the SNP sites as ordinate, a typing map with non-discrete (or little discrete) points at the ordinates 0 and 1 is denoted as a typing map G of a clean sample, a typing map with uniformly distributed ordinates from 0 to 1 is denoted as a typing map Y of a heavily contaminated sample, and a typing map with more discrete (fringing) points at the ordinates 0 and 1 and non-overlapping with a point near the ordinate 0.5 (heterozygous) is denoted as a typing map L of a slightly contaminated sample; if the typing map of the sample is Y, the sample is polluted; if the typing pattern of the sample is G or L, the sample is uncontaminated. If one is a typing plot of Y and the other is G or L, then the two samples are scored as uncontaminated and contaminated, respectively. If one sample is contaminated (e.g., Y) the other sample is not contaminated (e.g., G or L); and calculating the correlation between the two samples, and if the correlation is high, judging that one sample is polluted by the other sample.
The correlation of general parent-child relationship is 0.4-0.6, if the correlation is greater than 0.7, the pollution source comes from the same family; if the correlation is <0.2, it indicates contamination in other families. That is, if the correlation between the two samples is greater than 0.7, the correlation between the two samples is determined to be high, and it can be determined that one sample is contaminated by the other sample and is located in the same family.
In step S102, regardless of whether the typing map is G or L, the site where the proportion of a is equal to or greater than K1% (including 100%) or equal to or less than K2% (including 0%) is corrected to be homozygote to obtain M. K1 and K2 are selected according to actual conditions; in this example, K1 is 85 and K2 is 15, as can be experienced.
Specifically, after sequencing and analysis, each SNP site of each sample has a sequencing depth, and a "wild-type" and "mutant" site depth that is discriminated from a human genome reference sequence. Taking a certain SNP locus as an example, using A to represent a wild type locus, a to represent a mutant type locus, and if the total depth of the locus in a sequencing result is 100X, wherein A is 100X, a is 0X, the locus is a homozygous wild type locus; if A is 0X and a is 100X, the site is a homozygous mutant site; if the sequencing depth of A and a is close to 1:1, the site is heterozygote. In practical cases, some SNP sites may be detected with 95X A and 5X a due to sequencing errors of the sequencing technology itself or slight contamination of the sample during collection. By counting the wild type proportion of all SNPs in each sample and sequencing according to the sequencing depth, a wild type proportion distribution map can be drawn, so that whether the sample is clean, slightly polluted or severely polluted is preliminarily judged. SNP site typing can be performed on "clean" and "slightly contaminated" samples by setting certain thresholds (e.g., 5%, 15%, etc.) according to actual needs. For example, if the proportion of A is greater than 95%, the sample is classified as AA, if the proportion of A is less than 5%, the sample is classified as AA, and the sample is classified as AA during the period, thereby correcting the low-frequency pollution.
In the typing map, each point represents an SNP site, the abscissa represents the sequencing depth of the SNP site, and the ordinate represents the wild-type ratio of the SNP site. Taking a certain SNP locus as an example, A represents a wild type locus, a represents a mutant type locus, if the total depth of the locus in a sequencing result is 100X, wherein A is 100X, a is 0X, the locus is a homozygous wild type locus, the abscissa is 100, and the ordinate is 1. If A is 0X and a is 100X at a SNP site, the site is a homozygous mutant site with an abscissa of 100 and an ordinate of 0. If the sequencing depth of a SNP site, A is 45X, a is 55X, and A and a are close to 1:1, the site is heterozygote, the abscissa is 100, and the ordinate is 0.45. If sample A is AA type at the site and sample B is AA or AA at the site, the site does not exclude the paternity of A and B, and is counted as a "matching" site in the method; when B is aa, the site excludes the parent-child relationship between A and B, and is counted as a "mismatch" site in the method. Generally, when the parent-child relationship between A and B is established, 100% of all detection sites are matched sites; when two unrelated individuals are aligned, 80-85% of the loci randomly match Mendelian inheritance, and 10-15% of the loci are mismatched loci.
In the case when one of the test materials is contaminated, even severely contaminated, by the other, the difficulty of detection will increase dramatically. It is assumed that sample A can be typed normally, while sample B is contaminated with sample A. If the sample A is of an AA type at a certain SNP site, a sample material of the sample A pollutes the sample B, so an A signal of a pollution source of the first party also appears in a sequencing result of the SNP site of a sample B, and according to a conventional typing method, the site can be judged as the AA type if the sample B is also of the AA type at the site, or the AA type if the sample B is of the AA type or the AA type, but the site can hardly be judged as the AA type, particularly under the conditions that the cross contamination is serious and the pollution of the first party in the sample B occupies a larger proportion. Under the judgment method, whether the A and the B have the parent-child relationship or not, the A sample pollutes the B sample, so that the detection result of the B cannot be mismatched (the A type of the A and the AA type of the B), the matching rate reaches 100 percent, and an error judgment result is possibly given. In this patent, for such abnormal material detection of cross contamination, the conventional mendelian genetic principle is not used as the basis for determining paternity test, but the homozygosity rate and the heterozygosity rate of the sample A and the sample B are respectively counted according to the distribution frequency information of thousands of SNP sites in the population, thereby being used as the basis for determining whether the paternity relationship exists between the A and the B. In practical use, at least one sample is required to be subjected to typing, and no pollution or a small amount of pollution which does not influence typing exists, namely, the first party is assumed. When the matching rate of A and B is found to be 100% by using the traditional comparison method, but obvious pollution can be seen through SNP locus typing maps, the method needs to be further analyzed, and the specific method is as follows. Judging that the material B is polluted and the material A is not polluted through a parting diagram; and then, calculating the correlation, and judging that the material B is polluted by the material A if the correlation between the material B and the material A is high.
S103: and (4) typing the corresponding point positions in the polluted sample by taking M as a base number, and calculating to obtain a heterozygosis ratio f. In this process, only sites where the proportion of a is 100% or 0% are designated as homozygotes, and the other sites are designated as heterozygotes. Specifically, SNP typing is carried out on a material B to be detected by using the strictest threshold value, namely 100% of A or a in each SNP is judged as AA or AA; and even if A: a is a 100:1 site, also identified as Aa. And then carrying out typing statistics on all SNP sites of the sample A, carrying out homozygote and heterozygote statistics, and then calculating to obtain a heterozygote ratio f.
S104: and calculating theoretical heterozygosity F0 meeting the parent-child relationship and theoretical heterozygosity F1 not meeting the parent-child relationship according to the frequency p of the wild type locus and the frequency q of the mutant locus of the M homozygote SNP loci in the population and according to the Mendelian genetic law.
S105: judging the deviation degree of F from F0 and F1 by chi-square test; if F is not significantly different from F0 and F is significantly different from F1, then the two samples are in a paternity relationship; if F is significantly different from F0 and F is not significantly different from F1, then the two samples are non-paternity. Specifically, a p-value less than 0.05 is considered to be no significant difference. Assuming that p and q at all SNP sites are 50%, F0 expected value is 50% and F1 expected value is 75%. By chi-square test, it can be judged which distribution F conforms to F0 and F1, and thus whether the parenthood relationship between the sample a and the sample b is established or not can be judged.
Example 3
Embodiment 3 discloses a specific detection process of the method, which comprises the following steps:
one family, designated RT0281, was tested for a male parent sample (RT 0281F) and a child sample (RT 0281C), both fingernail samples, designated RT0281F and RT0281C, respectively. The father sample is clean and pollution-free through sequencing analysis, but the child sample presents a mixed signal of two samples in a ratio of 1:4, 80% of the signals in the child sample are actually the corresponding father sample through comparison, and only 20% of the signals are the child sample. According to the method, all SNP sites of the RT0281F sample are firstly subjected to typing statistics and are divided into homozygote and heterozygote which are 536 and 575 respectively. Among 536 homozygotes of RT0281F, the numbers of homozygotes and heterozygotes of RT0281C were 298 and 238, respectively, and the heterozygote ratio f was 0.44. And according to the frequencies (p) of the wild-type loci and (q) of the mutant loci of 536 homozygous SNP sites in RT0281F in the population, the expected values of F0 and F1 calculated by combining the Mendelian inheritance law are respectively 0.47 and 0.72, and the F value of 0.44 is not statistically significantly different from F0 but is significantly different from F1 by the chi-square test, so that the F value is judged to accord with the distribution of F0, and further the paternity relationships of RT0281F and RT0281C are judged to be established.
The specific analysis process is as follows: assuming that the two are not related in parent-child relationship, the SNP sites (i.e. AA and AA) of RT0281F which are homozygotes are selected, and the calculation process of the theoretical hybridization ratio is shown in Table 1.
Table 1 assumes that the two are not related in parent-child relationship
Figure 579918DEST_PATH_IMAGE001
Thus, assuming a non-paternity relationship between the two, sites AA + AA in RT0281F were selected, and after contamination of these sites in RT0281C, the homozygous heterozygous ratio was 1: 3, the heterozygous sum ratio is 0.75. Since the mutation rate of the selected binary SNP locus is about 50%, and in practical cases, the mutation rate of each SNP locus fluctuates up and down (such as 0.48 and 0.55), the expectation value of the non-paternity heterozygosity ratio needs to be recalculated to a certain family, and the family is 0.72.
Assuming that the two are related, SNP sites (i.e. AA and AA) which are homozygotes in RT0281F are selected. The theoretical hybridization ratios were calculated as shown in Table 2.
Table 2 assumes that the two are not related in parent-child relationship
Figure 687551DEST_PATH_IMAGE002
Thus, assuming that the two are in a paternity relationship, the AA + AA sites in RT0281F were selected, and after contamination of these sites in RT0281C, the homozygous heterozygous ratio was 1:1, heterozygosity and ratio 0.5. Since the mutation rate of the selected binary SNP site is about 50%, and in practical cases, the mutation rate of each SNP site fluctuates up and down (e.g. 0.48, 0.55), specifically to a certain family, the expectation value of the paternity heterozygosity ratio needs to be recalculated, and the family is 0.47.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A method for processing biological information for paternity testing of a contaminated sample, the method comprising the steps of:
s101: obtaining sequencing data of two samples to be identified by parents, wherein the sequencing data is obtained by adopting a second-generation sequencing technology;
s102: judging whether one sample is polluted by the other sample, if so, typing the uncontaminated sample to obtain a homozygote number M and a heterozygote number N;
s103: with M as a base number, typing the corresponding point positions in the polluted sample and calculating to obtain a heterozygosis ratio f;
s104: calculating according to M to obtain theoretical heterozygosity F0 which accords with the paternity and theoretical heterozygosity F1 which does not accord with the paternity;
s105: judging the deviation degree of F from F0 and F1 according to a statistical algorithm; if F is not significantly different from F0 and F is significantly different from F1, then the two samples are in a paternity relationship; if F is significantly different from F0 and F is not significantly different from F1, then the two samples are non-paternity.
2. The method for processing bioinformation on the paternity test of contaminated samples according to claim 1, characterized in that in step (1), the number of the polymorphic SNP sites with mutation rate of 0.4-0.6 is more than 200, and the number of the polymorphic SNP sites is sequenced.
3. The method of claim 2, wherein the sequencing method of the second generation sequencing technology comprises:
s1011: constructing a probe;
s1012: extracting DNA in a sample;
s1013: a library of building blocks;
s1014: performing hybridization capture and sequencing on the library target region by adopting the probe of the step S1011;
s1015: and splitting and filtering the sequencing data by the quality value to obtain the sequencing data.
4. The method of claim 2, wherein the determining whether one sample is contaminated by another sample comprises: comparing the measured sequences with a human reference genome, calculating the sequencing depth of A and a of each sample at each SNP locus, drawing typing graphs of the two samples according to the wild type proportion and the sequencing depth of each locus, and judging whether the samples are polluted or not according to the typing graphs; if one sample is contaminated, the other sample is not contaminated; and calculating the correlation between the two samples, and if the correlation is high, judging that one sample is polluted by the other sample.
5. The method of claim 4, wherein the determining whether the sample is contaminated according to the typing map comprises:
drawing a typing map by taking the sequencing depth of the SNP locus as an abscissa and the wild type proportion of the SNP locus as an ordinate, taking a typing map in which points at the ordinate 0 and 1 are not dispersed as a typing map G of a clean sample, taking a typing map in which the ordinate is uniformly distributed from 0 to 1 as a typing map Y of a severely polluted sample, and taking a typing map in which the points at the ordinate 0 and 1 are more dispersed and are not overlapped with points near the ordinate 0.5 as a typing map L of a slightly polluted sample; if the typing map of the sample is Y, the sample is polluted; if the typing pattern of the sample is G or L, the sample is uncontaminated.
6. The method of claim 5, wherein M is obtained by correcting homozygote for at least 85% or less of the sites where A is present, or at least 15% of the sites where A is present, in step S102.
7. The method of claim 4, wherein if the correlation between the two samples is greater than 0.7, the correlation between the two samples is determined to be high.
8. The method of claim 1, wherein in step S103, only the sites where the proportion of A is 100% or 0% are designated as homozygotes, and the other sites are designated as heterozygotes.
9. The method as claimed in claim 2, wherein the step S104 comprises: f0 and F1 are obtained by calculation according to the frequency p of the wild type locus and the frequency q of the mutant type locus of the M homozygote SNP loci in the population and the Mendelian genetic law.
10. The method of claim 1, wherein the degree of deviation of F from F0 and F1 is judged by chi-square test in step S105, and no significant difference is considered when p is less than 0.05.
CN202111056059.9A 2021-09-09 2021-09-09 Biological information processing method for paternity test of contaminated sample Pending CN113930492A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111056059.9A CN113930492A (en) 2021-09-09 2021-09-09 Biological information processing method for paternity test of contaminated sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111056059.9A CN113930492A (en) 2021-09-09 2021-09-09 Biological information processing method for paternity test of contaminated sample

Publications (1)

Publication Number Publication Date
CN113930492A true CN113930492A (en) 2022-01-14

Family

ID=79275214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111056059.9A Pending CN113930492A (en) 2021-09-09 2021-09-09 Biological information processing method for paternity test of contaminated sample

Country Status (1)

Country Link
CN (1) CN113930492A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114540345A (en) * 2021-11-03 2022-05-27 武汉蓝沙医学检验实验室有限公司 Labeled fluorescent probe with hairpin structure and fluorescence detection method
CN115572770A (en) * 2022-09-05 2023-01-06 上海蓝沙生物科技有限公司 Method for judging genetic relationship through SNP (single nucleotide polymorphism) mismatch rate

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114540345A (en) * 2021-11-03 2022-05-27 武汉蓝沙医学检验实验室有限公司 Labeled fluorescent probe with hairpin structure and fluorescence detection method
CN114540345B (en) * 2021-11-03 2024-04-09 武汉蓝沙医学检验实验室有限公司 Label fluorescent probe with hairpin structure and fluorescent detection method
CN115572770A (en) * 2022-09-05 2023-01-06 上海蓝沙生物科技有限公司 Method for judging genetic relationship through SNP (single nucleotide polymorphism) mismatch rate
CN115572770B (en) * 2022-09-05 2023-06-30 上海蓝沙生物科技有限公司 Method for judging genetic relationship through SNP mismatch rate

Similar Documents

Publication Publication Date Title
CN109887548B (en) ctDNA ratio detection method and detection device based on capture sequencing
CN106834502B (en) A kind of spinal muscular atrophy related gene copy number detection kit and method based on gene trap and two generation sequencing technologies
EP3274475B1 (en) Alignment and variant sequencing analysis pipeline
CN110029157B (en) Method for detecting haploid copy number variation of tumor single cell genome
CN113930492A (en) Biological information processing method for paternity test of contaminated sample
CN110648722B (en) Device for evaluating neonatal genetic disease risk
CN116030892B (en) System and method for identifying chromosome reciprocal translocation breakpoint position
CN114530198A (en) Screening method of SNP (single nucleotide polymorphism) sites for detecting sample pollution level and detection method of sample pollution level
CN112746097A (en) Method for detecting sample cross contamination and method for predicting cross contamination source
CN110444253B (en) Method and system suitable for mixed pool gene positioning
Smart et al. A novel phylogenetic approach for de novo discovery of putative nuclear mitochondrial (pNumt) haplotypes
CN105046105B (en) The Haplotype map and its construction method of chromosome span
CN115312121A (en) Target gene locus detection method, apparatus, medium, and program product
CN113564266B (en) SNP typing genetic marker combination, detection kit and application
CN117253539B (en) Method and system for detecting sample pollution in high-throughput sequencing based on germ line mutation
CN112102944A (en) NGS-based brain tumor molecular diagnosis analysis method
CN110232951B (en) Method, computer readable medium and application for judging saturation of sequencing data
CN116312779A (en) Method and apparatus for detecting sample contamination and identifying sample mismatch
WO2019132010A1 (en) Method, apparatus and program for estimating base type in base sequence
Schull et al. Champagne: whole-genome phylogenomic character matrix method places Myomorpha basal in Rodentia
CN114171116A (en) Method for evaluating fetal DNA concentration by free and self DNA of pregnant woman and application
CN115572770B (en) Method for judging genetic relationship through SNP mismatch rate
CN115198036B (en) Phage identification and host prediction method based on nanopore and high-throughput sequencing data
CN113969310B (en) Fetal DNA concentration evaluation method and application
CN113999900B (en) Method for evaluating fetal DNA concentration by using free DNA of pregnant woman and application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination