CN117275585A - Method for constructing lung cancer early-screening model based on LP-WGS and DNA methylation and electronic equipment - Google Patents

Method for constructing lung cancer early-screening model based on LP-WGS and DNA methylation and electronic equipment Download PDF

Info

Publication number
CN117275585A
CN117275585A CN202311311093.5A CN202311311093A CN117275585A CN 117275585 A CN117275585 A CN 117275585A CN 202311311093 A CN202311311093 A CN 202311311093A CN 117275585 A CN117275585 A CN 117275585A
Authority
CN
China
Prior art keywords
methylation
fragment
sequencing data
lung cancer
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311311093.5A
Other languages
Chinese (zh)
Inventor
赵杰
李晓敏
薛茹月
吴梦思
杨梅佳
邓望龙
张旭
张超
李砺锋
王小强
祁闯
段晓冉
闫芮
任用
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
First Affiliated Hospital of Zhengzhou University
Original Assignee
First Affiliated Hospital of Zhengzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by First Affiliated Hospital of Zhengzhou University filed Critical First Affiliated Hospital of Zhengzhou University
Priority to CN202311311093.5A priority Critical patent/CN117275585A/en
Publication of CN117275585A publication Critical patent/CN117275585A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/40ICT specially adapted for the handling or processing of patient-related medical or healthcare data for data related to laboratory analysis, e.g. patient specimen analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Public Health (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Primary Health Care (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The embodiment of the application provides a lung cancer early-screening model construction method and electronic equipment based on LP-WGS and DNA methylation, which comprises the steps of collecting peripheral blood of a lung cancer patient and a healthy person, extracting cfDNA, and establishing a sample set; based on cfDNA, sequencing of low-depth whole genome and methylation targeting is carried out, and a sequencing library is constructed; feature extraction is carried out on the low-depth whole genome sequencing data and the methylation targeting sequencing data to obtain fragment features of the whole genome sequencing data and methylation features of the methylation targeting sequencing data; constructing a lung cancer early-screening prediction model based on multi-feature cross stacking according to fragment features of whole genome sequencing data, methylation features of methylation targeting sequencing data and fragment group features of methylation data; training and verifying the lung cancer early-screening prediction model through a sample set to obtain a final lung cancer early-screening prediction model and a final prediction result; the method has the beneficial effects of noninvasive and high-accuracy prediction of early lung cancer, and is suitable for the technical field of biomedicine.

Description

Method for constructing lung cancer early-screening model based on LP-WGS and DNA methylation and electronic equipment
Technical Field
The application relates to the technical field of biomedicine, in particular to a lung cancer early-screening model construction method based on LP-WGS and DNA methylation and electronic equipment.
Background
Lung cancer is the most common cancer in china, and most lung cancer patients have no obvious specific symptoms at the beginning of the disease onset, and typically 64.6% of cancer cases are diagnosed at advanced stages (stage III/IV). Screening at a stage of lower tumor burden is therefore critical to reduce mortality associated with lung cancer. Low Dose Computed Tomography (LDCT) is commonly used for lung cancer screening, but radiation exposure and fuzzy risk assessment results in lower compliance rates (35.6%) and may lead to overdiagnosis.
Liquid biopsies represent a promising method of cancer screening, requiring only small amounts of biological fluid. It has the advantage of easy acquisition and cost effectiveness, and can be re-sampled, which is beneficial for compliance with cancer screening. Blood marker detection has been investigated as a potential biomarker for lung cancer for many years, such as cytokeratin 19 fragment (CYFRA 21-1), neuron-specific enolase (NSE), and squamous cell carcinoma antigen (SCC-Ag). However, these biomarkers have unsatisfactory performance in early diagnosis, low sensitivity and high false positive rate. The circulating free DNA (cfDNA) released by tumor tissue has unique genetic and epigenetic patterns similar to the source.
Epigenetic dysregulation during tumorigenesis is an early event, compared to rare somatic mutations detectable in blood, involving genome-wide DNA methylation and extensive changes in chromatin structure. Methylation modifications often occur in specific genomic regions, such as CpG islands, which provide an opportunity for extensive changes generated during the analysis of tumorigenesis by targeted sequencing. Recent studies have shown that epigenomic based models are superior to mutation based models in early lung cancer detection. DNA methylation is closely related to the occurrence and progression of various tumors. Many studies have found that different diseases and even different stages of the same disease may have specific methylation patterns. The frequency of CpG island hypermethylation in tumor cells is much higher than that of gene mutation. Thus, by detecting the methylation level of a particular gene or the entire genome, the risk of lung cancer developing can be predicted.
Furthermore, studies using Whole Genome Sequencing (WGS) have found that cfDNA is non-randomly fragmented during tumorigenesis, and that fragment histology characteristics such as fragmentation patterns and terminal motifs vary greatly over different disease processes. The change in fragment pattern reflects changes in the structure of chromatin prior to cfDNA release, and the terminal motifs reflect changes in chromatin accessibility and nuclease activity. To date, there have been several cfDNA fragment features for lung cancer screening, including fragment size coverage, fragment size distribution, end motifs, breakpoint motifs and copy number variation. The results of the enzymatic digestion experiments indicate that lower DNA methylation levels are predictive of higher nucleosome accessibility and allow nuclease cleavage in the nucleosome to generate shortened DNA fragments, indicating that DNA methylation may be an important regulator of cfDNA fragmentation.
Combining these and other promising liquid biopsy marker assays with current screening programs may greatly improve early screening and diagnosis of lung cancer. To our knowledge, although methylation and fragment characteristics are used for early detection of lung cancer, respectively, few studies integrate these two epigenetic characteristics, which may exhibit very high performance in view of their complementary contributions. In view of this, the present application is presented.
Disclosure of Invention
In order to solve one of the technical defects, the embodiment of the application provides a noninvasive lung cancer early-screening model construction method and electronic equipment based on LP-WGS and DNA methylation, wherein the lung cancer early-screening model construction method and the electronic equipment can be used for carrying out high-accuracy prediction on early lung cancer.
According to a first aspect of embodiments of the present application, there is provided a method of constructing an early lung cancer screening model based on LP-WGS and DNA methylation, comprising the steps of:
s10, collecting peripheral blood of a lung cancer patient and a healthy person, extracting cfDNA in the peripheral blood, and establishing a sample set;
s20, based on cfDNA, performing low-depth whole genome and methylation-targeted sequencing to construct a sequencing library;
s30, extracting features of low-depth whole genome sequencing data and methylation targeting sequencing data to obtain fragment features of the whole genome sequencing data, methylation features of the methylation targeting sequencing data and fragment group features of the methylation data;
s40, constructing a lung cancer early screening prediction model based on multi-feature cross stacking according to fragment features of whole genome sequencing data and methylation features of methylation targeting sequencing data;
s50, training and verifying the lung cancer early-screening prediction model through a sample set to obtain a final lung cancer early-screening prediction model and a final prediction result.
Preferably, the step S40 is to construct a lung cancer early-screening prediction model based on multi-feature cross stacking; comprising the following steps:
s401, establishing a single-feature prediction model;
comprising the following steps: establishing a classification model for the characteristics of the plurality of sequencing data of the sample by using the machine learning model;
s402, establishing an integrated model of a single feature;
comprising the following steps: splicing a plurality of machine learning scores of each feature to form a new feature vector, and then establishing an integrated model of a single feature based on logistic regression;
s403, establishing a multi-feature joint integration model;
comprising the following steps: and splicing a plurality of feature scores of each sample to form a new feature vector, and then establishing a multi-feature combined integration model based on logistic regression.
Preferably, the machine learning model comprises: at least one of a gradient hoist model, an XGBoost model, a random forest model, a logistic regression model, and a multi-layer perceptron.
Preferably, the step S30 is to perform feature extraction on the low-depth whole genome sequencing data and the methylation-targeted sequencing data in the sequencing library to obtain fragment features of the whole genome sequencing data, methylation features of the methylation-targeted sequencing data and fragment group features of the methylation data; comprising the following steps:
s301, preprocessing low-depth whole genome sequencing data and methylation targeting sequencing data respectively;
s302, extracting features of the low-depth whole genome sequencing data after pretreatment to obtain fragment features of the whole genome sequencing data;
and S303, extracting features of the preprocessed methylation targeted sequencing data to obtain methylation features of the methylation targeted sequencing data and fragment group features of the methylation data.
Preferably, the low-depth whole genome sequencing data after pretreatment is sequencing fragment information, comprising: chromosome number, fragment start position, fragment end position, fragment length, GC content and corrected weight value of each fragment in hg19 human reference genome;
fragment characteristics of the whole genome sequencing data include: full genome copy number variation, cfDNA length fragment ratio, cfDNA fragment size distribution, cfDNA nucleosome pattern, and cfDNA4bp motif end ratio.
Preferably, in S301, the methylation is characterized by: hypermethylation abnormal fragment ratio, methylation fragment length ratio and methylation fragment 4bp motif terminal ratio.
Preferably, in S301, the low depth whole genome sequencing data in the sequencing library is pre-processed; comprising the following steps:
s301-11, performing quality control and joint sequence removal on the original data after sequencing and unloading by using fastp software;
s301-12, comparing fastq data obtained in the step S301-11 with hg19 human reference genome by using BWA-MEM to obtain a bam file after comparison;
s301-13, removing reads of the hg19 genome blacklist region, the interval region, the patch sequence, the highly variable region and the centromere-telomere region in the bam file obtained in the step S301-12 by using public database information;
s301-14, combining the sequencing data into a single cfDNA fragment by using the compared position information of the double-end pairing sequencing data, and removing fragments with the length of more than 400 bases from the combined fragments;
s301-15, performing GC correction.
Preferably, in S301, methylation-targeted sequencing data is preprocessed; comprising the following steps:
s301-21, performing quality control and sequencing joint removal on methylation targeted sequencing data by using fastp software to obtain fastq data;
s301-22, comparing fastq data with an hg19 human reference genome to obtain a compared bam file;
s301-23, removing PCR repetition in the bam file, and establishing an index for the bam file after de-duplication;
s301-24, stitching and connecting CpG sites in the matched test sequence in the bam file to obtain methylation status haplotypes of each fragment.
Preferably, the step S303 is performed to perform feature extraction on the preprocessed methylation-targeted sequencing data to obtain methylation features of the methylation-targeted sequencing data and fragment group features of the methylation data, and includes:
s303-1, determining a tumor specific methylation interval by comparing the difference of the hypermethylation fragments; aiming at a sample to be detected, in a tumor specific methylation interval, counting fragments with more than half of total CpG sites exceeded by methylated C bases or with the total number of methylated C sites being more than or equal to 5, and taking the fragments as abnormal fragments; dividing the abnormal fragment number by the total covering piece number of the region to obtain hypermethylation abnormal fragment proportion;
s303-2, acquiring cfDNA fragments in all panel intervals, respectively extracting 4bp base sequences at the upstream and downstream of the starting position and the termination position, and counting the fragment proportion of 256 4bp base terminal sequences to obtain the terminal duty ratio of 4bp base sequences of the methylated fragments;
s303-3, counting cfDNA fragments in each tumor specific methylation interval respectively, and based on the length information of the cfDNA fragments, respectively obtaining the number of short fragments with the length of 100-150bp and the number of long fragments with the length of 151-210bp, and obtaining the length ratio of the methylation fragment group in the interval.
According to a second aspect of embodiments of the present application, there is provided an electronic device, including:
a memory; a processor; a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method as described above.
By adopting the lung cancer early-screening model construction method based on LP-WGS and DNA methylation and the electronic equipment, the method has the beneficial effects that:
in the method, a sample set of fragment characteristics based on whole genome sequencing data and methylation characteristics of methylation targeting sequencing data is established, and a lung cancer early-screening prediction model based on multi-characteristic cross stacking is trained and verified through the sample set to obtain a final lung cancer early-screening prediction model and a prediction result; the lung cancer early-screening prediction model has the advantages of noninvasive detection and high prediction accuracy and high accuracy.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a schematic flow chart of a method for constructing an early lung cancer sieve model based on LP-WGS and DNA methylation provided in the examples of the present application;
FIG. 2 is a schematic flow chart of preprocessing and feature extraction of low-depth whole genome sequencing data and methylation-targeted sequencing data in a sequencing library according to an embodiment of the present application;
FIG. 3 is a coverage of CTCF transcription factor fragments in the examples of the present application;
FIG. 4 is a graph showing the copy number variation characteristics of each sample to be tested in the embodiment of the present application;
FIG. 5 is a schematic view of clinical sample characteristics in an embodiment of the present application;
FIG. 6 is a schematic diagram of predictive performance evaluation of a test set and a validation set in one embodiment.
Detailed Description
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of exemplary embodiments of the present application is given with reference to the accompanying drawings, and it is apparent that the described embodiments are only some of the embodiments of the present application and not exhaustive of all the embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.
As shown in fig. 1 to 4, in the embodiment of the present application, there is provided a lung cancer early-screening model construction method based on LP-WGS and DNA methylation, comprising the steps of:
s10, collecting peripheral blood of a lung cancer patient and a healthy person, extracting cfDNA in the peripheral blood, and establishing a sample set;
s20, based on cfDNA, performing low-depth whole genome and methylation-targeted sequencing to construct a sequencing library;
s30, extracting features of low-depth whole genome sequencing data and methylation targeting sequencing data to obtain fragment features of the whole genome sequencing data, methylation features of the methylation targeting sequencing data and fragment group features of the methylation data;
s40, constructing a lung cancer early screening prediction model based on multi-feature cross stacking according to fragment features of whole genome sequencing data and methylation features of methylation targeting sequencing data;
s50, training and verifying the lung cancer early-screening prediction model through a sample set to obtain a final lung cancer early-screening prediction model and a final prediction result.
In this embodiment, S10, collecting peripheral blood of a lung cancer patient and a healthy person, and extracting cfDNA in the peripheral blood; may include: separating plasma by a two-step centrifugation method, and extracting cfDNA in the plasma by using an Apostle nucleic acid automatic extractor; detecting the DNA concentration by using a Qubit4.0 fluorescence quantitative instrument;
wherein, the blood sampling amount of peripheral blood can be: 20 mL/person.
In this embodiment, the step S20 is to perform sequencing of low-depth whole genome and methylation targeting based on cfDNA, and construct a sequencing library; may include:
s201, for sequencing of a low-depth whole genome, KAPAHyperPlusKit (Roche) can be used for sequencing library preparation according to a standard operation flow of a specification, and the input amount of library building cfDNA is 10ng; the library was double-ended sequenced using an IlluminaNovaSeq6000 sequencer, with a sequencing length of 150bp and a sequencing depth of the library of about 3.3x.
S202, for methylation-targeted sequencing, a Next enzymatic conversion method methylation library building kit can be used, and pre-library preparation is performed according to a specification standard flow; wherein, the input amount of the library-building cfDNA is 10-50ng; library capture was performed according to standard protocols using the TWIST custom probe Lucas22 and supporting reagents. The library after completion of the capture and library construction was subjected to double-ended sequencing using an IlluminaNovaSeq6000 sequencer with a sequencing length of 150bp and a sequencing depth of about 1000x.
In the embodiment, a sample set based on fragment characteristics of whole genome sequencing data and methylation characteristics of methylation targeting sequencing data is established, and a lung cancer early-screening prediction model based on multi-characteristic cross stacking is trained and verified through the sample set to obtain a final lung cancer early-screening prediction model and a prediction result; the lung cancer early-screening prediction model has the advantages of noninvasive detection and high prediction accuracy and high accuracy; compared with the traditional clinical detection means, the method has better detection sensitivity and specificity, can perform real-time dynamic monitoring, and has clinical application value.
In this embodiment, the step S30 is to perform feature extraction on the low-depth whole genome sequencing data and the methylation-targeted sequencing data to obtain fragment features of the whole genome sequencing data, methylation features of the methylation-targeted sequencing data and fragment group features of the methylation data; comprising the following steps:
s301, preprocessing low-depth whole genome sequencing data and methylation targeting sequencing data respectively;
s302, extracting features of the low-depth whole genome sequencing data after pretreatment to obtain fragment features of the whole genome sequencing data;
and S303, extracting features of the preprocessed methylation targeted sequencing data to obtain methylation features of the methylation targeted sequencing data and fragment group features of the methylation data.
Specifically, in S301, the methylation is characterized by: hypermethylation abnormal fragment ratio, methylation fragment length ratio and methylation fragment 4bp motif terminal ratio.
Further, in S301, the low-depth whole genome sequencing data is preprocessed; comprising the following steps:
s301-11, performing quality control and joint sequence removal on the original data after sequencing and unloading by using fastp software;
s301-12, comparing fastq data obtained in the step S301-11 with the hg19 human reference genome by using BWA-MEM, obtaining a bam file after comparison, marking repeated sequences in the bam file by using sambolter software and removing repeated and low-quality reads in the bam file by using software samtools.
The low-quality data can influence the subsequent analysis result, so that the accuracy of data analysis can be improved by removing the low-quality reading segment; in this embodiment, the low quality reads refer to sequencing reads with quality values below 30 when aligned, which is a standard common to the industry;
s301-13, removing reads of the hg19 genome blacklist region, the interval region, the patch sequence, the highly variable region and the centromere-telomere region in the bam file obtained in the step S301-12 by using public database information;
s301-14, combining the sequencing data into a single cfDNA fragment by using the compared position information of the double-end pairing sequencing data, and removing fragments with the length of more than 400 bases from the combined fragments;
because the difference of the GC content of the fragments can lead to coverage deviation in the experimental PCR amplification process, GC correction is required to be carried out on the fragment data obtained in the step S301-14;
s301-15, GC correction: comprising the following steps:
s301-151, constructing a base line: selecting a plurality of (29 in a specific embodiment) healthy individuals, screening fragments with the length of 50-400bp from the files obtained in the steps S301-11 to S301-14 for each healthy sample, and calculating the GC content (GC content is the percentage of GC bases in the fragments) of each fragment; layering the GC contents at 1% intervals, and respectively counting the number of fragments corresponding to each layering GC content under each chromosome (removing X and Y chromosomes);
s301-152, after the segment counts of different GC contents of different chromosomes of 29 healthy samples obtained through the baseline construction process are counted, further calculating the median of 29 distributions, namely GC bias correction reference distribution;
s301-153, for an actual detection sample, a file is obtained after processing in steps S301-11 to S301-14, wherein: comprises chromosome number of human reference genome of each fragment at hg19, fragment start position, fragment end position, fragment length and GC content information;
calculating the corresponding fragment number (N_i) under each GC content of each detection sample according to the statistics of the S301-151, correcting the reference distribution (N_n) according to the deviation of the healthy samples C obtained by the S301-152, and calculating the weight value (G_i=N_n/N_i) under each GC content; therefore, corresponding weight values can be allocated to fragments with different GC contents, and each fragment is weighted according to the GC contents by using the weight values, so that GC correction is completed.
The low-depth whole genome sequencing data after pretreatment is sequencing fragment information, and comprises the following steps: chromosome number, fragment start position, fragment end position, fragment length, GC content and corrected weight value of each fragment in hg19 human reference genome;
fragment characteristics of the whole genome sequencing data include: full genome copy number variation, cfDNA length fragment ratio, cfDNA fragment size distribution, cfDNA nucleosome pattern, and cfDNA4bp motif end ratio.
Further, the step S302 is to perform feature extraction on the low-depth whole genome sequencing data after pretreatment to obtain fragment features of the whole genome sequencing data; comprising the following steps:
s302-1, extracting the characteristics of cfDNA long and short fragment ratio (FSR); comprising the following steps:
splitting the hg19 reference genome into windows with the size of 5Mb, calculating the GC content of each window, and filtering out windows with the GC content less than 0.3 and the sequence alignment rate less than 0.8;
after the steps, a plurality of reserved windows are obtained; wherein each window is represented by a chromosome number, a start position, and an end position;
according to the acquired position information of the detection sample fragments and the acquired position information of a plurality of reserved windows, fragments in each 5Mb window interval can be counted respectively, and based on the length information of the fragments, short fragments with the length of 100-150bp and long fragments with the length of 151-210bp are obtained respectively;
the calculation expression of the extraction characteristic long-short fragment ratio R is as follows:
wherein: n (N) long For all long fragments cfDNA with length of 151-210bp, N short For all short fragments of cfDNA with the length of 100-150bp,the weight value corresponding to the GC content of the chromosome of the fragment i.
S302-2, feature extraction of cfDNA Fragment Size Distribution (FSD); comprising the following steps:
screening fragments with the length within the range of 50-400bp in each detection sample, and dividing the fragments into 70 length intervals (for example, 50-55bp,56-60bp, …,396-400 bp) according to the interval of 5bp of the length interval;
layering fragments in the autosomal long and short arm intervals according to 70 length intervals, calculating FSD values in each chromosome arm interval according to the following formula, and finally obtaining a plurality of characteristics:
wherein FSD i An FSD score representing the i-th chromosome arm interval; n (N) i Representing the total number of fragments covered by the ith chromosome arm interval; j represents the j-th segment covered in the i-th interval;
and the GC ratio of the j-th fragment is represented as a corresponding weight value.
S302-3, extracting characteristics of the terminal duty ratio of the cfDNA4bp motif; comprising the following steps:
in each fragment of each sample, 4bp base sequences upstream and downstream of the start and end positions are extracted respectively, and the fragment duty ratio of 256 combinations of fragment end 4mer sequences after GC correction is counted and used as the characteristic of the corrected end motif.
S302-4, feature extraction of cfDNA Nucleosome Patterns (NP); comprising the following steps:
firstly, downloading transcription factor related files from a GTRD database and a CIS-BS database, and reserving common transcription factors existing in both databases; removing transcription factors with binding sites less than 10000 on autosomes to obtain multiple transcription factors for subsequent analysis.
Secondly, taking the central area of each binding site of each transcription factor on the genome as a window interval, calculating all coverage fragments on opposite sites at a distance from the central position, accumulating GC weight values corresponding to the fragments as coverage of the opposite sites, carrying out smooth noise reduction on the coverage of the window of the central site, and finally obtaining a coverage distribution waveform diagram of each transcription factor fragment.
And extracting three aspects of features in the coverage mode curve, and constructing a feature matrix, wherein the steps comprise:
1) Coverage of the central region of the transcription factor binding site, wherein: the central region is the range from-30 bp to +30bp of transcription factor binding site;
2) Average coverage of the transcription factor binding site-1 kb to +1kb region;
3) The resulting amplitude values are calculated using a fast fourier transform.
S302-5, feature extraction of genome-wide copy number variation, including:
dividing the autosomes of the reference genome into non-overlapping windows of length 5Mb (in a specific embodiment, 589 non-overlapping windows total), calculating the average sequencing depth for each window separately, normalizing using the global mean, correcting the sequencing depth according to the GC content of each window;
constructing a base line, selecting a plurality of (20 in a specific embodiment) healthy individuals, respectively carrying out the analysis of the previous step, further calculating the median of a plurality of healthy samples about the depth of each window, and filtering the window discrete degree in the base line sample;
for each detection sample, obtaining a depth value of each window after GC correction through the analysis; then, according to the log value of the sequencing depth of each window of the baseline sample, calculating the log2 (the sequencing depth of the sample to be tested divided by the average sequencing depth of the baseline sample) of the copy number variation log value of each window of each sample to be tested, so as to obtain the copy number variation characteristic of each sample to be tested; (in one particular embodiment, 485 copy number variation features).
In this embodiment, in S301, the methylation-targeted sequencing data is preprocessed; comprising the following steps:
s301-21, performing quality control and sequencing joint removal on methylation targeted sequencing data by using fastp software to obtain fastq data;
s301-22, comparing fastq data with an hg19 human reference genome to obtain a compared bam file;
s301-23, removing PCR repetition in the bam file, and establishing an index for the bam file after de-duplication;
s301-24, stitching and connecting CpG sites in the matched test sequence in the bam file to obtain methylation status haplotypes of each fragment.
Wherein: stitching refers to combining the sequenced paired sequences into one piece, wherein the two sequences can be combined in various conditions such as overlapping region, non-overlapping region, inconsistent overlapping region and the like; for the overlapping area, the processing manner of this embodiment is: in the uncovered region, cpG sites are replaced by U, methylation sites are marked as 1, and unmethylated sites are marked as 0; read1 and Read2 cover the site simultaneously, if the bases are identical, the highest base and its mass value are retained; if the bases are different and the base matrix values of the two bases are not less than 30, the base is marked as N bases; if one of the bases is more than or equal to 30, the base and the mass value which are more than or equal to 30 are reserved, if both the base and the mass value are less than or equal to 30, the base and the mass value are marked as N, and finally, the haplotype methylation character string is generated.
In this embodiment, the step S303 is to perform feature extraction on the preprocessed methylation-targeted sequencing data to obtain methylation features of the methylation-targeted sequencing data and fragment group features of the methylation data, and includes:
s303-1, determining a tumor specific methylation interval by comparing the difference of the hypermethylation fragments; aiming at a sample to be detected, in a tumor specific methylation interval, counting fragments with more than half of total CpG sites exceeded by methylated C bases or with the total number of methylated C sites being more than or equal to 5, and taking the fragments as abnormal fragments; dividing the abnormal fragment number by the total covering piece number of the region to obtain hypermethylation abnormal fragment proportion;
s303-2, acquiring cfDNA fragments in all panel intervals, respectively extracting 4bp base sequences at the upstream and downstream of the starting position and the termination position, and counting the fragment proportion of 256 4bp base terminal sequences to obtain the terminal duty ratio of 4bp base sequences of the methylated fragments;
s303-3, counting cfDNA fragments in each tumor specific methylation interval respectively, and based on the length information of the cfDNA fragments, respectively obtaining the number of short fragments with the length of 100-150bp and the number of long fragments with the length of 151-210bp, and obtaining the length ratio of the methylation fragment group in the interval.
In this embodiment, based on the combination of multiple groups of science and multiple features, the tumor signals in each dimension in the cfDNA fragment can be fully utilized, and compared with the traditional clinical detection means and a single group of science model, the cfDNA fragment has higher sensitivity and specificity.
Example two
As shown in fig. 2, in the first embodiment, in this embodiment, S40, a lung cancer early-screening prediction model based on multi-feature cross stacking is constructed; comprising the following steps:
s401, establishing a single-feature prediction model;
comprising the following steps: establishing a classification model for the characteristics of the plurality of sequencing data of the sample by using the machine learning model;
s402, establishing an integrated model of a single feature;
comprising the following steps: splicing a plurality of machine learning scores of each feature to form a new feature vector, and then establishing an integrated model of a single feature based on logistic regression;
s403, establishing a multi-feature joint integration model;
comprising the following steps: and splicing a plurality of feature scores of each sample to form a new feature vector, and then establishing a multi-feature combined integration model based on logistic regression.
Specifically, the machine learning model includes: at least one of a gradient hoist (LGBM) model, an XGBoost model, a random forest model (RF), a Logistic Regression (LR) model, and a multi-layer perceptron (MLPC).
In this embodiment, the step S50 of training and verifying the lung cancer early-screening prediction model through the sample set to obtain a final lung cancer early-screening prediction model and a prediction result may include:
and training the single-feature prediction model through a divided training set by adopting a verification strategy of 5-fold cross verification to obtain a trained single-feature prediction model, wherein different types of features in the trained single-feature prediction model correspond to corresponding optimal machine learning models.
In the embodiment, according to the fragment characteristics of six full genome sequencing data in lung cancer and healthy human cfDNA and the methylation characteristics of methylation targeting sequencing data, a lung cancer early screening prediction model based on multi-characteristic cross stacking is constructed, and the accuracy, compliance and accessibility of lung cancer screening are improved.
The embodiment of the application also provides electronic equipment, which comprises:
a memory; a processor; a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method as described above.
The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored; the computer program is executed by a processor to implement the method as described above.
In the embodiments of the present application, the method, the electronic device, and the computer readable storage medium are based on the same inventive concept, and since the principles of solving the problems by the method, the electronic device, and the computer readable storage medium are similar, implementation of the method, the electronic device, and the computer readable storage medium may be referred to each other, and repeated descriptions are omitted.
In one particular embodiment, as shown in fig. 5 and 6, the performance of the model is verified.
The study of this example included 326 healthy, 326 tumor samples with stage I lung cancer at 58.5% and stage I lung cancer at 6.6% and a sample list as shown in fig. 6.
Sensitivity and specificity in the validation set were calculated using the model established in example 1, demarcating classification thresholds based on the specificity of test set 0.95, in the validation set, LP-WGS single panel assay AUC 0.9439 (95% ci: 0.9148-0.973), sensitivity 0.687 (95% ci: 0.6076-0.7664), specificity 0.9847 (95% ci: 0.9637-1); methylation detection AUC 0.948 (95% CI: 0.9199-0.976), sensitivity 0.7481 (95% CI: 0.6738-0.8224), specificity 0.9313 (95% CI: 0.888-0.9746); model performance AUC 0.9617 (95% ci: 0.9376-0.9857) after combination; sensitivity 0.7634 (95% CI: 0.6906-0.8362); 0.9542 (95% CI: 0.9184-0.99) has improved chemical properties compared to the single group, see in particular FIG. 6 results.
Therefore, the model of the application has good early lung cancer prediction capability.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The schemes in the embodiments of the present application may be implemented in various computer languages, for example, C language, VHDL language, verilog language, object-oriented programming language Java, and transliteration scripting language JavaScript, etc.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (10)

1. The lung cancer early-screening model construction method based on LP-WGS and DNA methylation is characterized by comprising the following steps:
s10, collecting peripheral blood of a lung cancer patient and a healthy person, extracting cfDNA in the peripheral blood, and establishing a sample set;
s20, based on cfDNA, performing low-depth whole genome and methylation-targeted sequencing to construct a sequencing library;
s30, extracting features of low-depth whole genome sequencing data and methylation targeting sequencing data to obtain fragment features of the whole genome sequencing data, methylation features of the methylation targeting sequencing data and fragment group features of the methylation data;
s40, constructing a lung cancer early screening prediction model based on multi-feature cross stacking according to fragment features of whole genome sequencing data and methylation features of methylation targeting sequencing data;
s50, training and verifying the lung cancer early-screening prediction model through a sample set to obtain a final lung cancer early-screening prediction model and a final prediction result.
2. The method for constructing an early-screening model of lung cancer based on LP-WGS and DNA methylation according to claim 1, wherein S40, a multi-feature cross-stacked early-screening model of lung cancer is constructed; comprising the following steps:
s401, establishing a single-feature prediction model;
comprising the following steps: establishing a classification model for the characteristics of the plurality of sequencing data of the sample by using the machine learning model;
s402, establishing an integrated model of a single feature;
comprising the following steps: splicing a plurality of machine learning scores of each feature to form a new feature vector, and then establishing an integrated model of a single feature based on logistic regression;
s403, establishing a multi-feature joint integration model;
comprising the following steps: and splicing a plurality of feature scores of each sample to form a new feature vector, and then establishing a multi-feature combined integration model based on logistic regression.
3. The method for constructing an early lung cancer sieve model based on LP-WGS and DNA methylation according to claim 2, wherein the machine learning model comprises: at least one of a gradient hoist model, an XGBoost model, a random forest model, a logistic regression model, and a multi-layer perceptron.
4. The method for constructing an early lung cancer screening model based on LP-WGS and DNA methylation according to claim 1, wherein the step S30 is performed with feature extraction on low-depth whole genome sequencing data and methylation-targeted sequencing data to obtain fragment features of the whole genome sequencing data, methylation features of the methylation-targeted sequencing data and fragment group features of the methylation data; comprising the following steps:
s301, preprocessing low-depth whole genome sequencing data and methylation targeting sequencing data respectively;
s302, extracting features of the low-depth whole genome sequencing data after pretreatment to obtain fragment features of the whole genome sequencing data;
and S303, extracting features of the preprocessed methylation targeted sequencing data to obtain methylation features of the methylation targeted sequencing data and fragment group features of the methylation data.
5. The method for constructing an early lung cancer screening model based on LP-WGS and DNA methylation of claim 4, wherein the pre-processed low-depth whole genome sequencing data is sequencing fragment information, comprising: chromosome number, fragment start position, fragment end position, fragment length, GC content and corrected weight value of each fragment in hg19 human reference genome;
fragment characteristics of the whole genome sequencing data include: full genome copy number variation, cfDNA length fragment ratio, cfDNA fragment size distribution, cfDNA nucleosome pattern, and cfDNA4bp motif end ratio.
6. The method for constructing an early lung cancer sieve model based on LP-WGS and DNA methylation of claim 5, wherein in S301, the methylation is characterized by: hypermethylation abnormal fragment ratio, methylation fragment length ratio and methylation fragment 4bp motif terminal ratio.
7. The method for constructing an early lung cancer screening model based on LP-WGS and DNA methylation according to claim 6, wherein in S301, low-depth whole genome sequencing data in a sequencing library is preprocessed; comprising the following steps:
s301-11, performing quality control and joint sequence removal on the original data after sequencing and unloading by using fastp software;
s301-12, comparing fastq data obtained in the step S301-11 with hg19 human reference genome by using BWA-MEM to obtain a bam file after comparison;
s301-13, removing reads of the hg19 genome blacklist region, the interval region, the patch sequence, the highly variable region and the centromere-telomere region in the bam file obtained in the step S301-12 by using public database information;
s301-14, combining the sequencing data into a single cfDNA fragment by using the compared position information of the double-end pairing sequencing data, and removing fragments with the length of more than 400 bases from the combined fragments;
s301-15, performing GC correction.
8. The method for constructing an early lung cancer sieve model based on LP-WGS and DNA methylation according to claim 6, wherein in S301, methylation-targeted sequencing data is preprocessed; comprising the following steps:
s301-21, performing quality control and sequencing joint removal on methylation targeted sequencing data by using fastp software to obtain fastq data;
s301-22, comparing fastq data with an hg19 human reference genome to obtain a compared bam file;
s301-23, removing PCR repetition in the bam file, and establishing an index for the bam file after de-duplication;
s301-24, stitching and connecting CpG sites in the matched test sequence in the bam file to obtain methylation status haplotypes of each fragment.
9. The method for constructing an early lung cancer sieve model based on LP-WGS and DNA methylation according to claim 6, wherein the step S303 of extracting features from the preprocessed methylation-targeted sequencing data to obtain methylation features of the methylation-targeted sequencing data and fragment group features of the methylation data comprises:
s303-1, determining a tumor specific methylation interval by comparing the difference of the hypermethylation fragments; aiming at a sample to be detected, in a tumor specific methylation interval, counting fragments with more than half of total CpG sites exceeded by methylated C bases or with the total number of methylated C sites being more than or equal to 5, and taking the fragments as abnormal fragments; dividing the abnormal fragment number by the total covering piece number of the region to obtain hypermethylation abnormal fragment proportion;
s303-2, acquiring cfDNA fragments in all panel intervals, respectively extracting 4bp base sequences at the upstream and downstream of the starting position and the termination position, and counting the fragment proportion of 256 4bp base terminal sequences to obtain the terminal duty ratio of 4bp base sequences of the methylated fragments;
s303-3, counting cfDNA fragments in each tumor specific methylation interval respectively, and based on the length information of the cfDNA fragments, respectively obtaining the number of short fragments with the length of 100-150bp and the number of long fragments with the length of 151-210bp, and obtaining the length ratio of the methylation fragment group in the interval.
10. An electronic device, comprising:
a memory; a processor; a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1 to 9.
CN202311311093.5A 2023-10-10 2023-10-10 Method for constructing lung cancer early-screening model based on LP-WGS and DNA methylation and electronic equipment Pending CN117275585A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311311093.5A CN117275585A (en) 2023-10-10 2023-10-10 Method for constructing lung cancer early-screening model based on LP-WGS and DNA methylation and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311311093.5A CN117275585A (en) 2023-10-10 2023-10-10 Method for constructing lung cancer early-screening model based on LP-WGS and DNA methylation and electronic equipment

Publications (1)

Publication Number Publication Date
CN117275585A true CN117275585A (en) 2023-12-22

Family

ID=89200717

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311311093.5A Pending CN117275585A (en) 2023-10-10 2023-10-10 Method for constructing lung cancer early-screening model based on LP-WGS and DNA methylation and electronic equipment

Country Status (1)

Country Link
CN (1) CN117275585A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117831623A (en) * 2024-03-04 2024-04-05 阿里巴巴(中国)有限公司 Object detection method, object detection model training method, transcription factor binding site detection method, and target object processing method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117831623A (en) * 2024-03-04 2024-04-05 阿里巴巴(中国)有限公司 Object detection method, object detection model training method, transcription factor binding site detection method, and target object processing method

Similar Documents

Publication Publication Date Title
CN113257350B (en) ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device
CN112086129B (en) Method and system for predicting cfDNA of tumor tissue
CN109767810B (en) High-throughput sequencing data analysis method and device
CN112951418B (en) Method and device for evaluating methylation of linked regions based on liquid biopsy, terminal equipment and storage medium
CN106909806A (en) The method and apparatus of fixed point detection variation
CN112951327B (en) Drug sensitivity prediction method, electronic device, and computer-readable storage medium
CN108021788B (en) Method and device for extracting biomarkers based on deep sequencing data of cell free DNA
CN113096728B (en) Method, device, storage medium and equipment for detecting tiny residual focus
CN113838533B (en) Cancer detection model, construction method thereof and kit
CN115132274B (en) Methylation level analysis method and device for circulating cell-free DNA transcription factor binding site
CN112397151A (en) Methylation marker screening and evaluating method and device based on target capture sequencing
CN117275585A (en) Method for constructing lung cancer early-screening model based on LP-WGS and DNA methylation and electronic equipment
CN113257360B (en) Cancer screening model, and construction method and construction device of cancer screening model
CN116064755B (en) Device for detecting MRD marker based on linkage gene mutation
CN116356001B (en) Dual background noise mutation removal method based on blood circulation tumor DNA
CN111833963A (en) cfDNA classification method, device and application
CN113362893A (en) Construction method and application of tumor screening model
Eigentler et al. Which melanoma patient carries a BRAF-mutation? A comparison of predictive models
CN113862351B (en) Kit and method for identifying extracellular RNA biomarkers in body fluid sample
EP4318493A1 (en) Artificial-intelligence-based method for detecting tumor-derived mutation of cell-free dna, and method for early diagnosis of cancer, using same
CN114067908B (en) Method, device and storage medium for evaluating single-sample homologous recombination defects
CN110462056B (en) Sample source detection method, device and storage medium based on DNA sequencing data
CN111518881B (en) System for diagnosing hormonal femoral head necrosis through molecular markers
US11535896B2 (en) Method for analysing cell-free nucleic acids
CN117577182B (en) System for rapidly identifying drug identification sites and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination