CN113870949B - Deep learning-based nanopore sequencing data base identification method - Google Patents

Deep learning-based nanopore sequencing data base identification method Download PDF

Info

Publication number
CN113870949B
CN113870949B CN202111172443.5A CN202111172443A CN113870949B CN 113870949 B CN113870949 B CN 113870949B CN 202111172443 A CN202111172443 A CN 202111172443A CN 113870949 B CN113870949 B CN 113870949B
Authority
CN
China
Prior art keywords
base
sequence
data
base sequence
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111172443.5A
Other languages
Chinese (zh)
Other versions
CN113870949A (en
Inventor
汪国华
高文韬
邹权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeast Forestry University
Yangtze River Delta Research Institute of UESTC Huzhou
Original Assignee
Northeast Forestry University
Yangtze River Delta Research Institute of UESTC Huzhou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeast Forestry University, Yangtze River Delta Research Institute of UESTC Huzhou filed Critical Northeast Forestry University
Priority to CN202111172443.5A priority Critical patent/CN113870949B/en
Publication of CN113870949A publication Critical patent/CN113870949A/en
Application granted granted Critical
Publication of CN113870949B publication Critical patent/CN113870949B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biotechnology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A deep learning-based nanopore sequencing data base identification method relates to the field of bioinformatics, and aims at the problem of low accuracy of nanopore sequencing in the prior art, and comprises the following steps: downloading 50 groups of nanopore original data including pneumococcus, enterobacter and proteus as a training set; II, secondly: carrying out base recognition on 50 groups of original data to obtain a base sequence; thirdly, the method comprises the following steps: acquiring an Illumina sequencing sequence with the accuracy rate of more than 99%, taking the Illumina sequencing sequence with the accuracy rate of more than 99% as a reference genome, taking the reference genome as a ground route, and correcting a base sequence by using a Tombo algorithm; fourthly, the method comprises the following steps: converting the corrected base sequence into corresponding electrical signal data by using a Re-squiggle method, and then marking the electrical signal data; fifthly: and training a neural network by using the marked electric signal data and the original data, and performing base recognition by using the trained neural network. The method realizes the high-accuracy recognition of the base sequence of the nanopore sequencing data.

Description

Deep learning-based nanopore sequencing data base identification method
Technical Field
The invention relates to the field of bioinformatics, in particular to a deep learning-based nanopore sequencing data base identification method.
Background
The nanopore third generation sequencer available from Oxford corporation has the advantages of portability, low cost, long sequencing reads, etc., compared to the second generation sequencer and the third generation sequencer available from PacBio corporation. However, the accuracy of nanopore sequencing is much lower than the second generation sequencing technology and the HIFI sequencing technology of PacBio. The accuracy of the base recognition tool provided by the official part is only about 90 percent, and the method is not open source. The Nanopore of the Nanopore sequencer is essentially a nanoscale protein pore with voltage detection devices on both sides. In operation, primers are used to pull single-stranded DNA/RNA through the nanopore, causing different current changes when different types of nucleotides pass through the nanopore. The sequencer records all changes in current by translating the electrical signal into the corresponding base sequence. nanopore is single molecule sequencing, noiseAcoustic signals and random errors have a large impact on the accuracy of base recognition. The unloading data of the nanopore sequencer are divided into fasta and fast 5. Among them, fasta is a gene sequence obtained by treatment using an official base recognition tool (Guppy), and the accuracy is about 90%. The fast5 file contains the original electrical signal text acquired by the sequencer. Taking the official tool Guppy R9.4 as an example, 5 bases pass through the nanopore at a time, so there are 45-102 possible gene sequences. Further complications arise due to the presence of base modifications. The currently known base modifications are 5mC, and if 5mC is used as the base signal of class 5 except A, C, G, T, 5 bases in a single pass through the nanopore will be 553125 possible sequences. And the nucleotide and the nanopore are of nanoscale molecular structures, and the official base recognition tool cannot well predict a real base sequence through an electric signal. This is a major factor affecting the accuracy of nanopore sequencing. Therefore, using a relevant method of deep learning to construct a model, it is very necessary to make a reliable prediction of the nanopore sequencing raw data.
Disclosure of Invention
The purpose of the invention is: aiming at the problem of low accuracy of the nanopore sequencing in the prior art, a deep learning-based nanopore sequencing data base identification method is provided.
The technical scheme adopted by the invention to solve the technical problems is as follows:
the deep learning-based method for identifying the base of the nanopore sequencing data comprises the following steps:
the method comprises the following steps: downloading 50 groups of nanopore original data including pneumococcus, enterobacter and proteus as a training set;
step two: carrying out base recognition on 50 groups of original data to obtain a base sequence;
step three: acquiring an Illumina sequencing sequence with the accuracy rate of more than 99%, taking the Illumina sequencing sequence with the accuracy rate of more than 99% as a reference genome, taking the reference genome as a ground route, and correcting a base sequence by using a Tombo algorithm;
step four: converting the corrected base sequence into corresponding electrical signal data by using a Re-squiggle method, and then marking the electrical signal data;
step five: and training a neural network by using the marked electric signal data and the original data, and performing base recognition by using the trained neural network.
Further, the neural network comprises a first convolution layer, a second convolution layer, a BERT module, a full connection layer and a CTC decoding module;
the first convolutional layer is used for down-sampling the marked electrical signal data,
the second convolutional layer is used for carrying out feature extraction on the electrical signal data after down sampling,
a BN layer is arranged behind the first convolution layer and the second convolution layer and used for preventing the mean value and the variance from being saturated,
the BERT module is used for training according to the extracted characteristics and outputting a base sequence corresponding to the electric signal data,
the full junction layer processes the base sequences corresponding to the electrical signal data by using a softmax function to obtain the probability of each base sequence corresponding to the original electrical signal,
the CTC decoding module processes the probability of each base sequence corresponding to the original electric signal to obtain a final base sequence,
the convolution kernel in the first convolution layer has a size of 1 × 3, a step size of 1 × 2, an output channel of 128,
the convolution kernel in the second convolution layer has a size of 1 × 3, a step size of 1 × 2, an output channel of 128,
the BERT module comprises 12 layers of transformers, 768-dimensional Embedding hidden layers and 12-head attention mechanism layers.
Further, the marked electrical signal data is characterized by:
Figure BDA0003293814830000021
wherein c represents sequencing data, xcRepresents the corresponding feature of the sequencing data, ω is the weight of the convolution kernel, where the parameter k is set to 3, i and j are the initial position of the sequence, T is the length of the sequence, and x represents the accumulation.
Further, the BN layer is represented as:
Figure BDA0003293814830000022
where α, β and ∈ are modelled parameters, xbnIs the sequence characteristic of the convolutional layer output, E is the function to calculate the expectation, and Var is the variance function.
Further, the softmax function is expressed as:
Figure BDA0003293814830000031
wherein z isiThe output value of the ith node is expressed, C is the number of classification categories, e is the base number of a natural logarithm function and is a mathematical constant, and Zc is the output value of the C node.
Further, the CTC decoding module specifically performs the following steps:
aiming at a predicted sequence output by a BERT layer, firstly, a candidate base sequence is generated by iteration by using a beacon search algorithm, the beam width is 3, then, the candidate bases are scored, blank characters and redundant characters in the base sequence are removed, the base sequence with the highest score is selected as a final prediction result,
the probability of blank characters existing in the base sequence is:
Figure BDA0003293814830000032
x is the output sequence of the BERT layer, pi represents the path corresponding to the intermediate result, beta-1(l) Represents all paths satisfying the condition in the searching process of the algorithm, I is the output result, P (I | x) represents the probability of blank characters in the sequence,
expressing the CTC loss function by using the base sequence space character probability, which is equal to the minimized logarithm field-ln (P pi | x)), and expressing the CTC loss function as follows:
Figure BDA0003293814830000033
where ln () represents the natural logarithm.
Further, the base recognition of the 50 sets of raw data in the second step is performed by a base recognition tool Guppy.
The invention has the beneficial effects that:
(1) the invention uses a deep neural network model with better performance, introduces the idea of solving the problem of natural language processing into the base recognition of the nanopore sequencing data, and has better performance compared with an official base recognition tool.
(2) The invention provides a good basis for genomics research, and the high-accuracy base identification is beneficial to the analysis of downstream genome data.
(3) The model of the invention has better generalization performance and is suitable for the base recognition of the nanopore sequencing data of various species including microorganisms, plants, animals and the like.
The method comprises the steps of utilizing a convolutional layer to carry out down-sampling and feature extraction on nanopore electric signal data, utilizing a BERT module to predict a base sequence corresponding to an electric signal, and utilizing a CTC algorithm to remove redundant data. Realizing the high-accuracy recognition of the base sequence of the nanopore sequencing data.
Drawings
FIG. 1 is a flow chart of a method for base recognition of nanopore sequencing data based on a deep neural network model according to an embodiment of the present application;
FIG. 2 is a diagram illustrating the effect of the deep neural network model of the present application;
FIG. 3 is a graphical representation of the comparison of the accuracy of the present application with official base recognition tools and Guppy-KP on the test set;
FIG. 4 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on a test set 1;
FIG. 5 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 2;
FIG. 6 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 3;
FIG. 7 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 4;
FIG. 8 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 5;
FIG. 9 is a schematic illustration of comparison of the sequence identity indicators of the present application with official base recognition tools on a test set 6;
FIG. 10 is a schematic diagram 7 comparing the sequence identity indicators on the test set of the present application with the official base recognition tool;
FIG. 11 is a schematic diagram 8 comparing the sequence identity indicators on the test set of the present application with official base recognition tools;
FIG. 12 is a schematic diagram of a comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 9;
FIG. 13 is a graph showing the comparison of error rates on a test set of 9 species for the present application and the official base recognition tool.
Detailed Description
It should be noted that, in the present invention, the embodiments disclosed in the present application may be combined with each other without conflict.
The first embodiment is as follows: the method for recognizing a base based on deep learning nanopore sequencing data according to this embodiment will be described in detail with reference to FIG. 1.
As shown in FIG. 1, the method comprises the following steps S1-S8:
s1, downloading 50 groups of raw data of nanopore including pneumococcus (Klebsiella pneumoniae), Enterobacter (Enterobacteriaceae), Proteobacteria (Proteobacteria) and sequencing data of 9 fungi to form a data set.
Wherein, 50 groups of obtained nanopore original sequencing data are used as a training set of the model, and the gene sequences of the other 9 species are used as a test set.
S2, base recognition was performed on 50 sets of raw data using the nanopore official base recognition tool Guppy.
The official base recognition tool Guppy was used to convert unknown nanopores into base sequences for finding their corresponding next generation sequenced reference genomes.
And S3, correcting the base sequence after Guppy processing by using the Tombo algorithm and annotating the corrected sequence by using the dynamic time warping algorithm by using the Illumina sequencing sequence as a reference genome.
S4, converting the real DNA sequence into a real electric signal by adopting a Re-squiggle method, and generating the marking data in a (base sequence, electric signal) format.
S5, constructing a neural network model based on the convolutional neural network and the BERT network, wherein the model comprises two convolutional layers, a BERT module, a full connection layer and a CTC decoding module. And performing feature extraction on the input sequence by using a convolution module. The method adopts two convolutional layers to carry out preprocessing and feature extraction on input sequence data, and comprises the following steps:
s51, the size of the convolution kernel in the first convolution layer is 1 x 3, the step size is 1 x 2, the output channel is 128, and the convolution kernel is used for down-sampling data and reducing the calculation complexity.
S52, the size of convolution kernel in the second convolution layer is 1 x 3, the step size is 1 x 2, the output channel is 128, and the method is used for feature extraction. The input signal vector x is calculated as follows:
Figure BDA0003293814830000051
and S53, a Batch Normalization (BN) layer is arranged behind each convolution module and used for preventing the mean value and the variance from being saturated and improving the generalization performance of the model. The calculation formula is as follows:
Figure BDA0003293814830000052
and S6, inputting the extracted features into a BERT module, and outputting the probability of each base sequence corresponding to the nanopore original electric signal after full-connection layer processing. And inputting the features extracted after the down sampling into a BERT module for training. The BERT module contains 12 layers of Transformer, 768 dimensions of Embedding hidden layers and 12-headed attention mechanism layers. This is followed by a full ligation layer and the probability of a base at each position is calculated using the softmax function. i and j respectively represent the sequence number of the sequence, and xi + j represents the sequence characteristics extracted after each character in the sequence is convolved and is used as the input of the subsequent BERT layer. The calculation formula is as follows:
Figure BDA0003293814830000053
and S7, removing the repeated base sequence and the blank sequence by using a CTC decoding module, and finally outputting the high-accuracy nanopore base sequence. The high-order feature distribution distance between the nanopore original electrical signal and the base sequence was calculated using the CTC loss function. The CTC decoder iteratively generates candidate base sequences by using a beamsearch algorithm, wherein the beam width is 3, and then scores the candidate bases. Blank characters in the sequences are removed in the process, and the base sequences with the highest scores are selected as final prediction results. The base sequence with high accuracy is obtained from the raw signal data of the nanopore. e is the base of the natural logarithmic function and is a mathematical constant. Zc represents the output value of the c-th node. The probability calculation process of blank characters existing in the base sequence and the CTC loss function formula are as follows:
Figure BDA0003293814830000061
L(S)=-lnΠ(x,z)∈Sp(z|x)=-∑(x,z)∈Slnp(z|x)
wherein x represents an input sequence, π represents a path of a base searched by a beamsearch, and z represents an output sequence.
S8, converting the original electric signal of the nanopore sequencing into a base sequence with higher accuracy than that of an official tool by adopting a trained prediction model.
The recognition effect of the present invention is further described below with a set of specific experimental examples.
First, to evaluate the performance of the base recognition tools, we performed comparative analysis on 4 base recognition tools including our model on the same dataset. Wherein, the ourethod represents the deep neural network model of the invention, Guppy and Albacore are official base recognition tools of Oxford formula, and Guppy-KP is a model retrained on an official basis.
Table one shows the error rates on the test set for 4 tools including our method.
Wherein deletion, insertion and mismatch respectively represent deletion error, insertion error and matching error of sequencing data. The base recognition accuracy was defined as follows:
Figure BDA0003293814830000062
m represents the number of bases matched, S represents the number of bases with matching errors, I represents the number of bases with insertion errors, and D represents the number of bases with deletion errors. On the Klebsiella Pneumoniae NUH29 dataset, the error rate of the method of the present invention was 11.06%, lower than that of other base recognition tools. On the Klebsiella Pneumoniae KSB2 dataset, the error rates of the method of the invention, Albore, Guppy were 11.26%, 15.80%, 15.73%, respectively, which were lower than the official base recognition tool.
Secondly, we also used genome assembly consistency as an index to evaluate model performance. FIG. 4 shows the consensus sequence identity of the 4 base recognition tools comprising the present invention. We used 6 indicators of polymer insertion errors, other insertion errors, polymer deletion errors, other deletion errors, substitution errors, and Dcm errors to evaluate model performance.
Performance evaluation on a test set shows that the base recognition error rate and the genome assembly consistency index of the invention are superior to those of base recognition tools provided by the official authorities.
It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations that fall within the spirit and scope of the invention be limited only by the claims and the description.

Claims (6)

1. The deep learning-based method for identifying the base of the nanopore sequencing data comprises the following steps:
the method comprises the following steps: downloading 50 groups of nanopore original data including pneumococcus, enterobacter and proteus as a training set;
step two: carrying out base recognition on 50 groups of original data to obtain a base sequence;
step three: acquiring an Illumina sequencing sequence with the accuracy rate of more than 99%, taking the Illumina sequencing sequence with the accuracy rate of more than 99% as a reference genome, taking the reference genome as a ground route, and correcting a base sequence by using a Tombo algorithm;
step four: converting the corrected base sequence into corresponding electrical signal data by using a Re-squiggle method, and then marking the electrical signal data;
step five: training a neural network by using the marked electric signal data and the original data, and performing base identification by using the trained neural network;
the neural network comprises a first convolution layer, a second convolution layer, a BERT module, a full connection layer and a CTC decoding module;
the first convolutional layer is used for down-sampling the marked electrical signal data,
the second convolutional layer is used for carrying out feature extraction on the electrical signal data after down sampling,
a BN layer is arranged behind the first convolution layer and the second convolution layer and used for preventing the mean value and the variance from being saturated,
the BERT module is used for training according to the extracted characteristics and outputting a base sequence corresponding to the electric signal data,
the full junction layer processes the base sequences corresponding to the electrical signal data by using a softmax function to obtain the probability of each base sequence corresponding to the original electrical signal,
the CTC decoding module processes the probability of each base sequence corresponding to the original electric signal to obtain a final base sequence,
the convolution kernel in the first convolution layer has a size of 1 × 3, a step size of 1 × 2, an output channel of 128,
the convolution kernel in the second convolution layer has a size of 1 × 3, a step size of 1 × 2, an output channel of 128,
the BERT module comprises 12 layers of transformers, 768-dimensional Embedding hidden layers and 12-head attention mechanism layers.
2. The method for deep learning based nanopore sequencing data base identity recognition according to claim 1 wherein the labeled electrical signal data characteristics are represented as:
Figure FDA0003542724260000011
wherein c represents sequencing data, xcRepresents the corresponding feature of the sequencing data, ω is the weight of the convolution kernel, where the parameter k is set to 3, i and j are the initial position of the sequence, T is the length of the sequence, and x represents the accumulation.
3. The deep learning-based nanopore sequencing data base identity method of claim 2, wherein the BN layer is represented as:
Figure FDA0003542724260000021
where α, β and ∈ are modelled parameters, xbnIs the sequence characteristic of the convolutional layer output, E is the function to calculate the expectation, and Var is the variance function.
4. The deep learning-based nanopore sequencing data base identity method of claim 3, wherein the softmax function is expressed as:
Figure FDA0003542724260000022
wherein z isiThe output value of the ith node is expressed, C is the number of classification categories, e is the base number of a natural logarithm function and is a mathematical constant, and Zc is the output value of the C node.
5. The deep learning-based nanopore sequencing data base identity method of claim 4, wherein the CTC decoding module specifically performs the following steps:
aiming at a predicted sequence output by a BERT layer, firstly, a candidate base sequence is generated by iteration by using a beacon search algorithm, the beam width is 3, then, the candidate bases are scored, blank characters and redundant characters in the base sequence are removed, the base sequence with the highest score is selected as a final prediction result,
the probability of blank characters existing in the base sequence is:
Figure FDA0003542724260000023
x is the output sequence of the BERT layer, pi represents the path corresponding to the intermediate result, beta-1(l) Represents all paths satisfying the condition in the searching process of the algorithm, I is the output result, P (I | x) represents the probability of blank characters in the sequence,
expressing the CTC loss function using the base sequence space character probability, equal to the minimized log domain-ln (P (π | x)), as:
Figure FDA0003542724260000024
where ln () represents the natural logarithm.
6. The method for base recognition based on deep learning nanopore sequencing data of claim 1, wherein the base recognition of the 50 sets of raw data in the second step is performed by a base recognition tool Guppy.
CN202111172443.5A 2021-10-08 2021-10-08 Deep learning-based nanopore sequencing data base identification method Active CN113870949B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111172443.5A CN113870949B (en) 2021-10-08 2021-10-08 Deep learning-based nanopore sequencing data base identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111172443.5A CN113870949B (en) 2021-10-08 2021-10-08 Deep learning-based nanopore sequencing data base identification method

Publications (2)

Publication Number Publication Date
CN113870949A CN113870949A (en) 2021-12-31
CN113870949B true CN113870949B (en) 2022-05-17

Family

ID=79002054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111172443.5A Active CN113870949B (en) 2021-10-08 2021-10-08 Deep learning-based nanopore sequencing data base identification method

Country Status (1)

Country Link
CN (1) CN113870949B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115029422A (en) * 2022-08-11 2022-09-09 南京普济生物有限公司 System and method for detecting chromosome microdeletion based on droplet type digital PCR
CN116486910B (en) * 2022-10-17 2023-12-22 北京普译生物科技有限公司 Deep learning training set establishment method for nanopore sequencing base recognition and application thereof

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11842794B2 (en) * 2019-03-19 2023-12-12 The University Of Hong Kong Variant calling in single molecule sequencing using a convolutional neural network
CN111243674B (en) * 2020-01-08 2023-07-04 华南理工大学 Base sequence identification method, device and storage medium
US11587551B2 (en) * 2020-04-07 2023-02-21 International Business Machines Corporation Leveraging unpaired text data for training end-to-end spoken language understanding systems
CN112183486B (en) * 2020-11-02 2023-08-01 中山大学 Method for rapidly identifying single-molecule nanopore sequencing base based on deep network
CN113361522B (en) * 2021-06-23 2022-05-17 北京百度网讯科技有限公司 Method and device for determining character sequence and electronic equipment

Also Published As

Publication number Publication date
CN113870949A (en) 2021-12-31

Similar Documents

Publication Publication Date Title
CN113870949B (en) Deep learning-based nanopore sequencing data base identification method
Sato et al. RNA secondary structural alignment with conditional random fields
Chuzhanova et al. Feature selection for genetic sequence classification.
Asai et al. Prediction of protein secondary structure by the hidden Markov model
CN112256828B (en) Medical entity relation extraction method, device, computer equipment and readable storage medium
CN111967294A (en) Unsupervised domain self-adaptive pedestrian re-identification method
CN107818141B (en) Biomedical event extraction method integrated with structured element recognition
CN112183486B (en) Method for rapidly identifying single-molecule nanopore sequencing base based on deep network
CN107403075B (en) Comparison method, device and system
CN112687328B (en) Method, apparatus and medium for determining phenotypic information of clinical descriptive information
JP2006075162A (en) Transcript mapping method of gene and system therefor
CN111026877A (en) Knowledge verification model construction and analysis method based on probability soft logic
CN115171807B (en) Molecular coding model training method, molecular coding method and molecular coding system
CN113033204A (en) Information entity extraction method and device, electronic equipment and storage medium
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning
CN111326215B (en) Method and system for searching nucleic acid sequence based on k-tuple frequency
CN112085245A (en) Protein residue contact prediction method based on deep residual error neural network
CN109326327B (en) Biological sequence clustering method based on SeqRank graph algorithm
CN115730599A (en) Chinese patent key information identification method based on structBERT, computer equipment, storage medium and program product
CN108920361B (en) String matching code similarity detection method
Böer Multiple alignment using hidden Markov models
Rusinova et al. Model Formalization for Genomes Comparative Analysis Using a Graph Database
Tang et al. Sequence fusion algorithm of tumor gene sequencing and alignment based on machine learning
US20230282312A1 (en) Construction method of ribosomal rna database
JP3237606B2 (en) Multiple character string alignment method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant