CN111881324B - High-throughput sequencing data general storage format structure, construction method and application thereof - Google Patents

High-throughput sequencing data general storage format structure, construction method and application thereof Download PDF

Info

Publication number
CN111881324B
CN111881324B CN202010748559.8A CN202010748559A CN111881324B CN 111881324 B CN111881324 B CN 111881324B CN 202010748559 A CN202010748559 A CN 202010748559A CN 111881324 B CN111881324 B CN 111881324B
Authority
CN
China
Prior art keywords
sequence
format
formats
component
quality score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010748559.8A
Other languages
Chinese (zh)
Other versions
CN111881324A (en
Inventor
郁春江
沈百荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Industrial Park Institute of Services Outsourcing
Original Assignee
Suzhou Industrial Park Institute of Services Outsourcing
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Industrial Park Institute of Services Outsourcing filed Critical Suzhou Industrial Park Institute of Services Outsourcing
Priority to CN202010748559.8A priority Critical patent/CN111881324B/en
Publication of CN111881324A publication Critical patent/CN111881324A/en
Application granted granted Critical
Publication of CN111881324B publication Critical patent/CN111881324B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • G06F16/86Mapping to a database
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a general storage format structure of high-throughput sequencing data, a construction method and application thereof. By the invention, different types of high-throughput sequencing data can be stored in one format, so that the defect that the interoperability of the data is influenced due to the diversity of the data formats is overcome. And meanwhile, the universal format is structured, and compared with text unstructured data, the universal format is easier and quicker to filter and extract data.

Description

High-throughput sequencing data general storage format structure, construction method and application thereof
Technical Field
The invention belongs to the technical field of biological information processing, and relates to a general storage format structure of high-throughput sequencing data, a construction method and application thereof.
Background
With the rapid development of high throughput sequencing technologies, the instrumentation or vendor used in sequencing, sequencing principles, and differences in development context or goal, such as readability, integration, space savings, and other factors, have produced an increasing variety of sequencing data. To analyze these data, a number of analysis software have been designed, but most of these define their own data storage formats (S.Pabinger, A.Dander, M.Fischer, R.Snajder, M.Sperk, M.Efremova, B.Krabichler, M.R.Speicher, J.Zschocke, and Z.Trajanoski, "A survey oftools for variant analysis ofnext-generation genome sequencing data," BriefBioInform, vol.15, no.2, pp.256-78, mar, 2014). For example, BAM/FASTQ/QSEQ, BAM/HDF5/FASTQ and BAM/SFF/FASTQ are file formats that can be handled by Illumina, pacBio and Ion Torrent sequencers, respectively. The above causes a variety of data formats.
Data interoperability is a key element in large data analysis, and many format conversion tools have been developed successively, whose main function is to convert high-throughput sequencing data from one format to another (H.Li, B.Handsaker, A.Wysoker, T.Fennell, J.Ruan, N.Homer, G.Marth, G.Abecasis, and R.Durbin, "The Sequence Alignment/Map format and SAMtools," Bioinformation, vol.25, no.16, pp.2078-9, aug 15,2009.; M.R. Breese, and Y.Liu, "NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets," Bioinformation, vol.29, no.4, pp.494-6, feb15, 2013.).
However, they are mostly developed for specific and limited formats, and format conversion not only loses information, but also requires great computational resources. If a format is encountered where no ready-made tool has been converted to the desired format, it is not an easy matter for non-professional program developers to wait for others to develop or write programs themselves to implement one.
Disclosure of Invention
Aiming at the technical problems, the invention aims to provide a general storage format structure of high-throughput sequencing data, a construction method and application thereof. Different types of high-throughput sequencing data can be stored in one format, thus overcoming the impact of data interoperability due to the diversity of data formats. And meanwhile, the universal format is structured, and compared with text unstructured data, the universal format is easier and quicker to filter and extract data.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
the invention provides a general storage format structure of high-throughput sequencing data, which comprises four components: head component, sequence component, quality fraction component, sequence information component, wherein:
the header component is used for storing header description information of the file;
the sequence component is used for storing sequence information, wherein the sequence information is a base sequence or a file path for storing the base sequence;
the quality score component is used for storing the quality score of the sequence, and the quality score of the sequence is a quality score character string or a file path for storing the quality score;
the sequence information component is used for storing records and features of the sequence.
Preferably, the high-throughput sequencing data universal storage format structure is designed based on XML and XML Schema technology.
Preferably, the header component contains a sub-element meta_info, in which name and value attributes are contained; the sequence component comprises one or more seq sub-elements to represent a sequence, and each seq sub-element has a unique identification and is used for a sequence information component; the quality score component comprises one or more quality subelements to represent the sequence quality score, and each quality subelement has a unique identification for the sequence information component; the sequence information component contains one or more seqinfo sub-elements, and one seqinfo sub-element represents a sequence record.
The invention also provides an editing tool based on the high-throughput sequencing data general storage format structure, which is used for creating and editing the high-throughput sequencing data general storage format file and converting the format between the NGS file and the NGSGF file.
Preferably, the editing tool is written in Java through NetBeans IDE 10.0.
Preferably, the editing tool executes corresponding operations through GUI and command line calls.
Preferably, the formats that the editing tool supports conversion include FASTA, FASTQ, SAM, VCF, CAF.
The invention also provides a construction method of the high-throughput sequencing data general storage format structure, which comprises the following steps:
1) Existing high-throughput data formats are collected and classified into five types: sequence and quality score formats, alignment formats, assembly formats, mutation formats, annotation and visualization formats;
2) Analyzing the specific specification of each format, and searching the content of commonality and characteristics;
3) The common storage format structure is designed based on the content of the commonality and the characteristics.
Preferably, the sequence and quality score formats include Fasta/CSFASTA, fastq/CSFASTQ, qseq, SCARF, QUAL, 2bit/nib, SFF formats; the comparison format includes: SAM, BAM, bowtie, maq format; the assembly format includes ACE, AFG, CAF format; the mutation format includes GVF, pileup, VCF format; the annotation and visualization formats include BED, bigBED, wig, bigWig, bedGraph, GFF/GTF formats.
The invention also provides application of the general storage format structure of the high-throughput sequencing data in representing, storing, editing and converting the high-throughput sequencing data.
The invention designs a format structure based on XML and XML Schema technology, which can store a plurality of different types of high-throughput sequencing data at present, the format structure prescribes the structure of data storage, and the specific storage content is changeable according to sequence information, so that the format not only can store the data in the existing format, but also can cope with the newly-appearing data format.
The beneficial effects of the invention are as follows:
firstly, the general storage format structure of the invention uses a component structure to divide the sequence and the description information into four parts, so that the format structure is clear and has good self-description, and is convenient for future expansion.
Secondly, the universal storage format structure of the present invention introduces a reference idea into the biological data format, a technique widely used in computer science. In this general storage format, in the form of links, different sequence information may refer to the same sequence or quality score if the content is the same or similar. It can avoid storing duplicate content.
Thirdly, the universal storage format structure of the invention fully utilizes the advantages of the currently popular NGS data format, and can store most of biological sequence information. In addition, the generic storage format structure inherits the flexibility and extensibility of XML. Due to the rapid development of NGS technology, new concepts and analysis tools are emerging, and old data formats are difficult to adapt to current requirements. The expandability of the general storage format structure of the invention overcomes the problem of specific data formats, and the flexibility of the general storage format structure can adapt to the needs of future development.
Finally, the general storage format structure of the present invention is well readable, so that it can be easily handled by a computer program, and is more readable to humans, and the stored content is easier to understand. This advantage can be attributed to the tree structure nature of XML.
Drawings
Fig. 1 shows 26 high-throughput data storage formats commonly used in the prior art.
Fig. 2 shows the general technical architecture of the present invention.
Fig. 3 shows the overall format structure of the NGSGF of the present invention.
Fig. 4 shows the format structure of the NGSGF header component of the present invention.
Fig. 5 shows the format structure of the NGSGF sequence component of the present invention.
Fig. 6 shows the format structure of the NGSGF quality score component of the present invention.
Fig. 7 shows the format structure of the NGSGF sequence information component of the present invention.
Fig. 8 shows a user interface screenshot of NGSGFEditor in embodiment 2 of the present invention.
Fig. 9 shows two item shots of NGSGFEditor in embodiment 2 of the present invention.
Fig. 10 shows the method of example 2 "step 1: newly created NGSGF file "interface screenshot.
Fig. 11 shows the method of example 2 "step 2: the sequence "interface screenshot" is added.
Fig. 12 shows the method of example 2 "step 3: the quality score interface screenshot is added.
Fig. 13 shows the method of example 2", step 4: sequence information "interface screenshot" is added.
Fig. 14 shows the method of example 2", step 5: and storing an NGSGF file interface screenshot.
Fig. 15 shows a screenshot of embodiment 3 of the present invention for converting FASTQ and NGSGF format files through the NGSGFEditor GUI.
FIG. 16 shows a screenshot of the display aid of the input "java-jarNGSGFEditor. Jar-h" in example 3 of the present invention.
Fig. 17 shows an interface screenshot of embodiment 3 of the present invention for converting SAM and NGSGF format files using NGSGFEditor command line.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments of the present invention and the accompanying drawings.
In order to solve the compatibility problem of the present NGS data, we have developed a new general storage format based on XML, hereinafter abbreviated as NGSGF, which can satisfy most NGS data types. NGSGF is based on extensible markup language (XML), which is widely used in the fields of data storage on the Internet, mathematics, biology, and the like. NGSGF is used to describe data produced by NGS technology, and different types of information used by NGS are integrated into NGSGF, such as alignment, assembly, and annotation information. Because of the high degree of extensibility of XML, NGSGF is easily extended with new features.
The invention firstly researches the data storage format adopted in the current high-throughput sequencing data field. A total of 26 commonly used high-throughput data formats were collected and they were divided into five types: sequence and quality score formats (Sequence or quality score), alignment formats (Alignment), assembly formats (Assembly), mutation formats (Variant), sequence annotation and visualization formats (Sequence annotation & visualization), which may include Fasta/CSFASTA, fastq/CSFASTQ, qseq, SCARF, QUAL, 2bit/nib, SFF formats; the alignment format may include: SAM, BAM, bowtie, maq format; the assembly format may include ACE, AFG, CAF format; the mutation format may include a GVF, pileup, VCF format; the annotation and visualization formats may include BED, bigBED, wig, bigWig, bedGraph, GFF/GTF formats, as shown in fig. 1.
The specific specifications of each format are then analyzed, primarily to analyze the content and the organization of the content stored in the format. After grasping the specific specification of each format, the content of commonalities and characteristics is sought. The common storage format is designed based on the content of commonalities and characteristics. As shown in fig. 3, the format NGSGF newly proposed by the present invention includes four components: a header component (header_lines), a sequence component (list_of_seqs), a quality score component (list_of_quals), a sequence information component (list_of_seqinfo), wherein the header component is a component storing header description information, and most existing NGS file formats contain header information to describe the stored content. As shown in fig. 4, the child element meta_info is contained in the header_lines. The meta_info contains a name attribute and a value attribute, and is used for storing the header description information of the NGS; the sequence component is a component storing sequence information, and the sequence information is a base sequence or a file path storing the base sequence. The deposit file path enables the NGSGF to store large sequence files. As shown in FIG. 5, one or more seq sub-elements are included in the list_of_seqs component to represent a sequence. Each seq subelement has a unique identifier for use in the list_of_seqinfos component; the quality score component is a component for storing the quality score of the sequence, and the quality score of the sequence is a quality score character string or a file path for storing the quality score. The deposit file path enables the NGSGF to store large quality score files. As shown in FIG. 6, one or more quality subelements are included in the list_of_quality component to represent the quality score of the sequence, each quality subelement having a unique identifier for use in the list_of_seqinfo component; the sequence information component is a component that stores sequence records and features, and as shown in fig. 7, one or more seqinfo sub-elements are contained in the list_of_seqinfos component, one of which represents one sequence record. Typically, NGSGF sequence records are stored in this component. The common storage format is capable of storing the content stored in the above 26 formats.
The design-based structure of the present invention also develops corresponding editing and conversion software (shown as NGSGF Editor and NGSGF Format Converter in the figures) that not only can edit high-throughput data files based on the format structure, but also can interwork existing text-based high-throughput data formats with general-purpose formats based on XML, as shown in fig. 2.
Example 1
The NGSGF format designed by the invention can store data in FASTA, FASTQ, SAM, CAF, VCF format commonly used for high-throughput sequencing.
1. Sequence format FASTA data is stored in NGSGF format
Data of FASTA:
>KM081703.1 Abbottina rivularis mitochondrion,complete genome↓
GCTAGTGTAGCTTAATCCAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAAGAAGCTCCGCATGCAC↓
>AF511507.1 Alligator sinensis mitochondrion,complete genome↓
CAATAAAGACTTAGTCCCGGTCTTCTTATTAACTACCACTTAACCTATACATGCAAGCATCCACGAACCA←
data of corresponding NGSGF:
2. sequence and quality score formatted FASTQ data is stored in NGSGF format
Data for FASTQ:
@EAS54_6_R1_2_1_413_324↓
CCCTTCTTGTCTTCAGCGTTTCTCC↓
+↓
;;3;;;;;;;;;;;;7;;;;;;;88↓
@EAS54_6_R1_2_1_540_792↓
TTGGCAGGCCAAGGCCGATGGATCA↓
+↓
;;;;;;;;;;;7;;;;;-;;;3;83↓
data of corresponding NGSGF:
3. SAM data in sequence alignment format is stored in NGSGF format
Data for SAM:
data of corresponding NGSGF:
4. sequence assembly format CAF data is stored in NGSGF format
CAF data:
DNA:22ak93c2.rlt↓
GTCGCnCATAAGATTACGAGATCTCGAGCTCGGTACCCTTCAAGCGATTCTCCTGCCTCA↓
BaseQuality:22ak93c2.r1t↓
4 4 8 4 4 4 4 4 4 4 4 4 6 8 17 21 14 7 6 6 6 7 7 6 8 14 16 21 15 20 20↓
24 26 21 18 18 14 14 19 23 10 8 8 15 20 16 29 26 34 29 39 29 31 29 31↓
|↓
Sequence:22ak93c2.r1t↓
Is_read↓
Padded↓
Staden_id 11↓
Clipping QUAL 39 331↓
Align_to_SCF 1 43 1 43↓
Align_to_SCF 44 317 45 318↓
Align_to_SCF 319 716 319 716↓
SCF_File 22ak93c2.r1tSCF↓
Primer Universal↓primer↓
Strand Reverse↓
Dye Dve_terminator↓
Template 22ak93c2↓
Clone bK216E10↓
Sequencing_vector″m13mp18″↓
Seq_vec SVEC 1 38″M13mp18″↓
Tag ALUS 43 180↓
Tag DONE 43 43″AUTO-EDIT:deleted C at 43(terminator,isolated,strong)″↓
Tag DONE 254 254″AUTO-EDIT:replaced T by g at 254(terminator,isolated,strong)″↓
Tag ALUS 269 402↓
Tag DONE 283 283″AUTO-EDIT:replaced T by g at 283(terminated,isolated,strong)″↓
Tag AMBG 298 302″AUTOEDIT:Check this edit cluster!″↓
Tag DONE 317 317″AUTO-EDIT:replaced C by a at 317(terminated,compound,strong)″↓
Tag DONE 318 318″AUTO-EDIT:inserted g at 318(terminated,compound,strong)″←
data of corresponding NGSGF:
5. sequence mutation format VCF data is stored in NGSGF format
Data of VCF:
##fileformat=VCFv4.2↓
##fileDate=20090805↓
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta↓
##INFO=<ID=NS,Number=1,Type=Integer,Description=″Number of Samples With Data″>↓
##FILTER=<ID=q10,Description=″Quality below 10″>↓
##FOREAT=<ID=GT,Number=1,Type=String,Description=″Genotype″>↓
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003↓
20 14370 rs6054257 G A 29 PAsS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.↓
data of corresponding NGSGF:
example 2 creation of NGSGF files
Ngsgmonitor is designed for creating and editing NGSGF files. It has a user friendly GUI that can also run in command lines. It would be very helpful for the user to use NGSGF files to operate. FIG. 8 shows the user interface of the NGSGFEditor, showing the interface during running software (A: start NGSGFEditor), converting format (B: conversion SAM file), opening file (C: open NGSGF file), editing file (D: edit NGSGF file).
The NGSGFEditor in this embodiment is written in Java via NetBeans IDE 10.0, which contains two items, as shown in fig. 9.
Here we use NGSGFEditor to create a FASTQ format NGSGF file. The contents of FASTQ are:
@EAS54_6_R1_2_1_413_324↓
CCCTTCTTGTCTTCAGCGTTTCTCC↓
+↓
;;3;;;;;;;;;;;;7;;;;;;;88↓
@EAS54_6_R1_2_1_540_792↓
TTGGCAGGCCAAGGCCGATGGATCA↓
+↓
;;;;;;;;;;;7;;;;;-;;;3;83←
step 1: new NGSGF file
Clicking the new button creates a new NGSGF file. As shown in fig. 10.
Step 2: addition sequence
(1) Right clicking (hereinafter, simply referred to as "right clicking") the "ngs" node, popup menu display. Clicking on the "list_of_seqs" menu creates a new node. As shown in fig. 11 (1).
(2) Right clicking on the "list_of_seqs" node increases the "seq" child node. As shown in fig. 11 (2).
(3) Right clicking on the "seq" node adds the "nid" attribute. As shown in fig. 11 (3).
(4) Right clicking on the "nid" node selects the "Edit" menu Edit node value. As shown in fig. 11 (4).
(5) An "origin" node like the "nid" node is added. Right clicking on the "origin" node selects the "Edit" menu and enters the sequence value. As shown in fig. 11 (5).
Step 3: increasing mass fraction
(1) Right click "ngs" to add the "list_of_quals" node. As shown in fig. 12 (1).
(2) "nid", "origin" nodes are added. Right clicking on the "origin" node increases the quality score. As shown in fig. 12 (2).
Step 4: adding sequence information
(1) Right clicking on the "ngs" node adds the "list_of_seqinfos" node. As shown in fig. 13 (1).
(2) Right-clicking on the "list_of_seqinfos" to "seqinfo" node. As shown in fig. 13 (2).
(3) Right clicking on the "seqinfo" node to the "seq" node. As shown in fig. 13 (3).
(4) Right clicking on the "seq" node adds the "seqref" attribute. As shown in fig. 13 (4).
(5) Right clicking on the "seqinfo" node to the "quat" node. As shown in fig. 13 (5).
(6) The "seqref" and "qualref" nodes are added in the "seq" and "qual" nodes. Right clicking on the "seqref" and "qualref" nodes inputs the reference value. In this example, the sequence of the first record is "s1", and the mass fraction is "q1". As shown in fig. 13 (6).
The second record of the FASTQ file is added like the first record.
Step 5: preserving NGSGF files
Finally, the sequence is stored in the "list_of_seqs" node, the quality score is stored in the "list_of_quals" node, and the FASTQ record is stored in the "list_of_seqinfo" node. As shown in fig. 14.
Example 3 conversion of NGS files and NGSGF files
The user may also use ngsgmonitor to convert between NGS files and NGSGF files.
Currently, NGSGFEditor supports FASTA, FASTQ, SAM, VCF, CAF five formats.
Ngsgmonitor may be executed under Windows and Linux systems.
Format conversion may be invoked through a GUI and command line.
1. Through NGSGFEditor GUI
1.1 conversion of FASTQ to NGSGF
(1) The FASTQ file is added using the "Add" button.
(2) The output directory selects a folder using the "Browse" button.
(3) Clicking the "Start" button.
As shown in fig. 15 (1).
1.2 conversion of NGSGF to FASTQ
(1) NGSGF files are added using the "Add" button.
(2) The input selects NGSGF and the output selects FASTQ.
(3) The output directory selects a folder using the "Browse" button.
(4) Clicking the "Start" button.
As shown in fig. 15 (2).
2. Using ngsgmonitor command lines
This example is implemented in the Linux system.
The input "java-jar ngsgmonitor. Jar-h" displays help. As shown in fig. 16.
2.1 conversion of SAM to NGSGF
The input "java-jar ngsgmonitor. Jar-c SAM2 NGSGF-input_path-o output_path" converts SAM into NGSGF. As shown in fig. 17 (1).
2.2 conversion of NGSGF to SAM
The input "java-jar ngsgmonitor. Jar-c NGSGF2 SAM-input path-o output path" converts NGSGF to SAM. As shown in fig. 17 (2).
It is understood that all other embodiments, which can be made by one of ordinary skill in the art without inventive effort, are within the scope of the present invention based on the embodiments of the present invention.

Claims (3)

1. The method for constructing the high-throughput sequencing data general storage format structure is designed based on XML and XML Schema technology and comprises the following steps:
1) Existing high-throughput data formats are collected and classified into five types: sequence and quality score formats, alignment formats, assembly formats, mutation formats, annotation and visualization formats;
2) Analyzing the specific specification of each format, and searching the content of commonality and characteristics;
3) The common storage format is designed based on the content of commonality and characteristics, and the high-throughput sequencing data common storage format structure comprises four components: head component, sequence component, quality fraction component, sequence information component, wherein:
the header component is used for storing header description information of a file, and comprises a sub-element meta_info, wherein the meta_info comprises a name attribute and a value attribute;
the sequence component is used for storing sequence information, the sequence information is a base sequence or a file path for storing the base sequence, the sequence component comprises one or more seq sub-elements for representing the sequence, and each seq sub-element is provided with a unique identifier and is used for positioning the sequence information component;
the quality score component is used for storing a sequence quality score, the sequence quality score is a quality score character string or a file path for storing the quality score, the quality score component comprises one or more quality subelements for representing the sequence quality score, and each quality subelements has a unique identifier for positioning the sequence information component;
the sequence information component is used for storing records and features of a sequence, and comprises one or more seqinfo sub-elements, wherein one seqinfo sub-element represents a sequence record.
2. The method for constructing a high-throughput sequencing data universal storage format structure according to claim 1, wherein the sequence and mass fraction formats comprise Fasta/CSFASTA, fastq/CSFASTQ, qseq, SCARF, QUAL, 2bit/nib, SFF formats; the comparison format includes: SAM, BAM, bowtie, maq format; the assembly format includes ACE, AFG, CAF format; the mutation format includes GVF, pileup, VCF format; the annotation and visualization formats include BED, bigBED, wig, bigWig, bedGraph, GFF/GTF formats.
3. Use of the high-throughput sequencing data generic storage format structure obtained by the construction method of claim 1 or 2 for representing, storing, editing and converting high-throughput sequencing data.
CN202010748559.8A 2020-07-30 2020-07-30 High-throughput sequencing data general storage format structure, construction method and application thereof Active CN111881324B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010748559.8A CN111881324B (en) 2020-07-30 2020-07-30 High-throughput sequencing data general storage format structure, construction method and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010748559.8A CN111881324B (en) 2020-07-30 2020-07-30 High-throughput sequencing data general storage format structure, construction method and application thereof

Publications (2)

Publication Number Publication Date
CN111881324A CN111881324A (en) 2020-11-03
CN111881324B true CN111881324B (en) 2023-12-15

Family

ID=73204229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010748559.8A Active CN111881324B (en) 2020-07-30 2020-07-30 High-throughput sequencing data general storage format structure, construction method and application thereof

Country Status (1)

Country Link
CN (1) CN111881324B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103546160A (en) * 2013-09-22 2014-01-29 上海交通大学 Multi-reference-sequence based gene sequence stage compression method
CN104169927A (en) * 2012-02-28 2014-11-26 皇家飞利浦有限公司 Compact next generation sequencing database and efficient sequence processing using same
WO2015180203A1 (en) * 2014-05-30 2015-12-03 周家锐 High-throughput dna sequencing quality score lossless compression system and compression method
WO2016105579A1 (en) * 2014-12-22 2016-06-30 Board Of Regents Of The University Of Texas System Systems and methods for processing sequence data for variant detection and analysis
CN105760706A (en) * 2014-12-15 2016-07-13 深圳华大基因研究院 Compression method for next generation sequencing data
CN106446600A (en) * 2016-05-20 2017-02-22 同济大学 CRISPR/Cas9-based sgRNA design method
WO2017214765A1 (en) * 2016-06-12 2017-12-21 深圳大学 Multi-thread fast storage lossless compression method and system for fastq data
CN107609350A (en) * 2017-09-08 2018-01-19 厦门极元科技有限公司 A kind of data processing method of two generations sequencing data analysis platform
WO2019150287A1 (en) * 2018-01-30 2019-08-08 Encapsa Technology Llc Method and system for encapsulating and storing information from multiple disparate data sources
CN110517726A (en) * 2019-07-15 2019-11-29 西安电子科技大学 A kind of microbe composition and concentration detection method based on high-flux sequence data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3005200A2 (en) * 2013-06-03 2016-04-13 Good Start Genetics, Inc. Methods and systems for storing sequence read data

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104169927A (en) * 2012-02-28 2014-11-26 皇家飞利浦有限公司 Compact next generation sequencing database and efficient sequence processing using same
CN103546160A (en) * 2013-09-22 2014-01-29 上海交通大学 Multi-reference-sequence based gene sequence stage compression method
WO2015180203A1 (en) * 2014-05-30 2015-12-03 周家锐 High-throughput dna sequencing quality score lossless compression system and compression method
CN105760706A (en) * 2014-12-15 2016-07-13 深圳华大基因研究院 Compression method for next generation sequencing data
WO2016105579A1 (en) * 2014-12-22 2016-06-30 Board Of Regents Of The University Of Texas System Systems and methods for processing sequence data for variant detection and analysis
CN106446600A (en) * 2016-05-20 2017-02-22 同济大学 CRISPR/Cas9-based sgRNA design method
WO2017214765A1 (en) * 2016-06-12 2017-12-21 深圳大学 Multi-thread fast storage lossless compression method and system for fastq data
CN107609350A (en) * 2017-09-08 2018-01-19 厦门极元科技有限公司 A kind of data processing method of two generations sequencing data analysis platform
WO2019150287A1 (en) * 2018-01-30 2019-08-08 Encapsa Technology Llc Method and system for encapsulating and storing information from multiple disparate data sources
CN110517726A (en) * 2019-07-15 2019-11-29 西安电子科技大学 A kind of microbe composition and concentration detection method based on high-flux sequence data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NGS-FC: A Next-Generation Sequencing Data Format Converter;Chunjiang Yu等;IEEE;第1683-1691页 *
XML for Data Representation and Model Specification in Neuroscience;Sharon M. Crook等;Neuroinformatics;第53–66页 *

Also Published As

Publication number Publication date
CN111881324A (en) 2020-11-03

Similar Documents

Publication Publication Date Title
US11651149B1 (en) Event selection via graphical user interface control
US11423216B2 (en) Providing extraction results for a particular field
US10783318B2 (en) Facilitating modification of an extracted field
US9026901B2 (en) Viewing annotations across multiple applications
CN102135938B (en) Software product testing method and system
CN108469952B (en) Code generation method and matched tool for managing game configuration
JP2000181917A (en) Structured document managing method, executing device therefor and medium recording processing program therefor
US20100175055A1 (en) Method and system to identify gui objects for non-markup-language-presented applications
US20060101392A1 (en) Strongly-typed UI automation model generator
WO2016161178A1 (en) System and method for automated cross-application dependency mapping
CN112667735A (en) Visualization model establishing and analyzing system and method based on big data
CN108804300A (en) Automated testing method and system
CN113326026B (en) Method and terminal for generating micro-service business process interface
Kienle et al. Evolution of web systems
Borowski et al. Graph Buddy—an interactive code dependency browsing and visualization tool
CN111881324B (en) High-throughput sequencing data general storage format structure, construction method and application thereof
CN112015382B (en) Processor architecture analysis method, device, equipment and storage medium
CN116107524B (en) Low-code application log processing method, medium, device and computing equipment
US20050154976A1 (en) Method and system for automated metamodel system software code standardization
US20080307432A1 (en) Method and apparatus for exchanging data using data transformation
JP2009211599A (en) Mapping definition creation system and mapping definition creation program
CN112395818A (en) Hardware algorithm model construction method based on SysML
CN111124548B (en) Rule analysis method and system based on YAML file
Verbeek et al. Visualizing state spaces with Petri nets
CN115248803B (en) Collection method and device suitable for network disk file, network disk and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant