CN111881324B

CN111881324B - High-throughput sequencing data general storage format structure, construction method and application thereof

Info

Publication number: CN111881324B
Application number: CN202010748559.8A
Authority: CN
Inventors: 郁春江; 沈百荣
Original assignee: Suzhou Industrial Park Institute of Services Outsourcing
Current assignee: Suzhou Industrial Park Institute of Services Outsourcing
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2023-12-15
Anticipated expiration: 2040-07-30
Also published as: CN111881324A

Abstract

The invention provides a general storage format structure of high-throughput sequencing data, a construction method and application thereof. By the invention, different types of high-throughput sequencing data can be stored in one format, so that the defect that the interoperability of the data is influenced due to the diversity of the data formats is overcome. And meanwhile, the universal format is structured, and compared with text unstructured data, the universal format is easier and quicker to filter and extract data.

Description

High-throughput sequencing data general storage format structure, construction method and application thereof

Technical Field

The invention belongs to the technical field of biological information processing, and relates to a general storage format structure of high-throughput sequencing data, a construction method and application thereof.

Background

With the rapid development of high throughput sequencing technologies, the instrumentation or vendor used in sequencing, sequencing principles, and differences in development context or goal, such as readability, integration, space savings, and other factors, have produced an increasing variety of sequencing data. To analyze these data, a number of analysis software have been designed, but most of these define their own data storage formats (S.Pabinger, A.Dander, M.Fischer, R.Snajder, M.Sperk, M.Efremova, B.Krabichler, M.R.Speicher, J.Zschocke, and Z.Trajanoski, "A survey oftools for variant analysis ofnext-generation genome sequencing data," BriefBioInform, vol.15, no.2, pp.256-78, mar, 2014). For example, BAM/FASTQ/QSEQ, BAM/HDF5/FASTQ and BAM/SFF/FASTQ are file formats that can be handled by Illumina, pacBio and Ion Torrent sequencers, respectively. The above causes a variety of data formats.

Data interoperability is a key element in large data analysis, and many format conversion tools have been developed successively, whose main function is to convert high-throughput sequencing data from one format to another (H.Li, B.Handsaker, A.Wysoker, T.Fennell, J.Ruan, N.Homer, G.Marth, G.Abecasis, and R.Durbin, "The Sequence Alignment/Map format and SAMtools," Bioinformation, vol.25, no.16, pp.2078-9, aug 15,2009.; M.R. Breese, and Y.Liu, "NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets," Bioinformation, vol.29, no.4, pp.494-6, feb15, 2013.).

However, they are mostly developed for specific and limited formats, and format conversion not only loses information, but also requires great computational resources. If a format is encountered where no ready-made tool has been converted to the desired format, it is not an easy matter for non-professional program developers to wait for others to develop or write programs themselves to implement one.

Disclosure of Invention

Aiming at the technical problems, the invention aims to provide a general storage format structure of high-throughput sequencing data, a construction method and application thereof. Different types of high-throughput sequencing data can be stored in one format, thus overcoming the impact of data interoperability due to the diversity of data formats. And meanwhile, the universal format is structured, and compared with text unstructured data, the universal format is easier and quicker to filter and extract data.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

the invention provides a general storage format structure of high-throughput sequencing data, which comprises four components: head component, sequence component, quality fraction component, sequence information component, wherein:

the header component is used for storing header description information of the file;

the sequence component is used for storing sequence information, wherein the sequence information is a base sequence or a file path for storing the base sequence;

the quality score component is used for storing the quality score of the sequence, and the quality score of the sequence is a quality score character string or a file path for storing the quality score;

the sequence information component is used for storing records and features of the sequence.

Preferably, the high-throughput sequencing data universal storage format structure is designed based on XML and XML Schema technology.

Preferably, the header component contains a sub-element meta_info, in which name and value attributes are contained; the sequence component comprises one or more seq sub-elements to represent a sequence, and each seq sub-element has a unique identification and is used for a sequence information component; the quality score component comprises one or more quality subelements to represent the sequence quality score, and each quality subelement has a unique identification for the sequence information component; the sequence information component contains one or more seqinfo sub-elements, and one seqinfo sub-element represents a sequence record.

The invention also provides an editing tool based on the high-throughput sequencing data general storage format structure, which is used for creating and editing the high-throughput sequencing data general storage format file and converting the format between the NGS file and the NGSGF file.

Preferably, the editing tool is written in Java through NetBeans IDE 10.0.

Preferably, the editing tool executes corresponding operations through GUI and command line calls.

Preferably, the formats that the editing tool supports conversion include FASTA, FASTQ, SAM, VCF, CAF.

The invention also provides a construction method of the high-throughput sequencing data general storage format structure, which comprises the following steps:

1) Existing high-throughput data formats are collected and classified into five types: sequence and quality score formats, alignment formats, assembly formats, mutation formats, annotation and visualization formats;

2) Analyzing the specific specification of each format, and searching the content of commonality and characteristics;

3) The common storage format structure is designed based on the content of the commonality and the characteristics.

Preferably, the sequence and quality score formats include Fasta/CSFASTA, fastq/CSFASTQ, qseq, SCARF, QUAL, 2bit/nib, SFF formats; the comparison format includes: SAM, BAM, bowtie, maq format; the assembly format includes ACE, AFG, CAF format; the mutation format includes GVF, pileup, VCF format; the annotation and visualization formats include BED, bigBED, wig, bigWig, bedGraph, GFF/GTF formats.

The invention also provides application of the general storage format structure of the high-throughput sequencing data in representing, storing, editing and converting the high-throughput sequencing data.

The invention designs a format structure based on XML and XML Schema technology, which can store a plurality of different types of high-throughput sequencing data at present, the format structure prescribes the structure of data storage, and the specific storage content is changeable according to sequence information, so that the format not only can store the data in the existing format, but also can cope with the newly-appearing data format.

The beneficial effects of the invention are as follows:

firstly, the general storage format structure of the invention uses a component structure to divide the sequence and the description information into four parts, so that the format structure is clear and has good self-description, and is convenient for future expansion.

Secondly, the universal storage format structure of the present invention introduces a reference idea into the biological data format, a technique widely used in computer science. In this general storage format, in the form of links, different sequence information may refer to the same sequence or quality score if the content is the same or similar. It can avoid storing duplicate content.

Thirdly, the universal storage format structure of the invention fully utilizes the advantages of the currently popular NGS data format, and can store most of biological sequence information. In addition, the generic storage format structure inherits the flexibility and extensibility of XML. Due to the rapid development of NGS technology, new concepts and analysis tools are emerging, and old data formats are difficult to adapt to current requirements. The expandability of the general storage format structure of the invention overcomes the problem of specific data formats, and the flexibility of the general storage format structure can adapt to the needs of future development.

Finally, the general storage format structure of the present invention is well readable, so that it can be easily handled by a computer program, and is more readable to humans, and the stored content is easier to understand. This advantage can be attributed to the tree structure nature of XML.

Drawings

Fig. 1 shows 26 high-throughput data storage formats commonly used in the prior art.

Fig. 2 shows the general technical architecture of the present invention.

Fig. 3 shows the overall format structure of the NGSGF of the present invention.

Fig. 4 shows the format structure of the NGSGF header component of the present invention.

Fig. 5 shows the format structure of the NGSGF sequence component of the present invention.

Fig. 6 shows the format structure of the NGSGF quality score component of the present invention.

Fig. 7 shows the format structure of the NGSGF sequence information component of the present invention.

Fig. 8 shows a user interface screenshot of NGSGFEditor in embodiment 2 of the present invention.

Fig. 9 shows two item shots of NGSGFEditor in embodiment 2 of the present invention.

Fig. 10 shows the method of example 2 "step 1: newly created NGSGF file "interface screenshot.

Fig. 11 shows the method of example 2 "step 2: the sequence "interface screenshot" is added.

Fig. 12 shows the method of example 2 "step 3: the quality score interface screenshot is added.

Fig. 13 shows the method of example 2", step 4: sequence information "interface screenshot" is added.

Fig. 14 shows the method of example 2", step 5: and storing an NGSGF file interface screenshot.

Fig. 15 shows a screenshot of embodiment 3 of the present invention for converting FASTQ and NGSGF format files through the NGSGFEditor GUI.

FIG. 16 shows a screenshot of the display aid of the input "java-jarNGSGFEditor. Jar-h" in example 3 of the present invention.

Fig. 17 shows an interface screenshot of embodiment 3 of the present invention for converting SAM and NGSGF format files using NGSGFEditor command line.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments of the present invention and the accompanying drawings.

In order to solve the compatibility problem of the present NGS data, we have developed a new general storage format based on XML, hereinafter abbreviated as NGSGF, which can satisfy most NGS data types. NGSGF is based on extensible markup language (XML), which is widely used in the fields of data storage on the Internet, mathematics, biology, and the like. NGSGF is used to describe data produced by NGS technology, and different types of information used by NGS are integrated into NGSGF, such as alignment, assembly, and annotation information. Because of the high degree of extensibility of XML, NGSGF is easily extended with new features.

The invention firstly researches the data storage format adopted in the current high-throughput sequencing data field. A total of 26 commonly used high-throughput data formats were collected and they were divided into five types: sequence and quality score formats (Sequence or quality score), alignment formats (Alignment), assembly formats (Assembly), mutation formats (Variant), sequence annotation and visualization formats (Sequence annotation & visualization), which may include Fasta/CSFASTA, fastq/CSFASTQ, qseq, SCARF, QUAL, 2bit/nib, SFF formats; the alignment format may include: SAM, BAM, bowtie, maq format; the assembly format may include ACE, AFG, CAF format; the mutation format may include a GVF, pileup, VCF format; the annotation and visualization formats may include BED, bigBED, wig, bigWig, bedGraph, GFF/GTF formats, as shown in fig. 1.

The specific specifications of each format are then analyzed, primarily to analyze the content and the organization of the content stored in the format. After grasping the specific specification of each format, the content of commonalities and characteristics is sought. The common storage format is designed based on the content of commonalities and characteristics. As shown in fig. 3, the format NGSGF newly proposed by the present invention includes four components: a header component (header_lines), a sequence component (list_of_seqs), a quality score component (list_of_quals), a sequence information component (list_of_seqinfo), wherein the header component is a component storing header description information, and most existing NGS file formats contain header information to describe the stored content. As shown in fig. 4, the child element meta_info is contained in the header_lines. The meta_info contains a name attribute and a value attribute, and is used for storing the header description information of the NGS; the sequence component is a component storing sequence information, and the sequence information is a base sequence or a file path storing the base sequence. The deposit file path enables the NGSGF to store large sequence files. As shown in FIG. 5, one or more seq sub-elements are included in the list_of_seqs component to represent a sequence. Each seq subelement has a unique identifier for use in the list_of_seqinfos component; the quality score component is a component for storing the quality score of the sequence, and the quality score of the sequence is a quality score character string or a file path for storing the quality score. The deposit file path enables the NGSGF to store large quality score files. As shown in FIG. 6, one or more quality subelements are included in the list_of_quality component to represent the quality score of the sequence, each quality subelement having a unique identifier for use in the list_of_seqinfo component; the sequence information component is a component that stores sequence records and features, and as shown in fig. 7, one or more seqinfo sub-elements are contained in the list_of_seqinfos component, one of which represents one sequence record. Typically, NGSGF sequence records are stored in this component. The common storage format is capable of storing the content stored in the above 26 formats.

The design-based structure of the present invention also develops corresponding editing and conversion software (shown as NGSGF Editor and NGSGF Format Converter in the figures) that not only can edit high-throughput data files based on the format structure, but also can interwork existing text-based high-throughput data formats with general-purpose formats based on XML, as shown in fig. 2.

Example 1

The NGSGF format designed by the invention can store data in FASTA, FASTQ, SAM, CAF, VCF format commonly used for high-throughput sequencing.

1. Sequence format FASTA data is stored in NGSGF format

Data of FASTA:

＞KM081703.1 Abbottina rivularis mitochondrion，complete genome↓

GCTAGTGTAGCTTAATCCAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAAGAAGCTCCGCATGCAC↓

＞AF511507.1 Alligator sinensis mitochondrion，complete genome↓

CAATAAAGACTTAGTCCCGGTCTTCTTATTAACTACCACTTAACCTATACATGCAAGCATCCACGAACCA←

data of corresponding NGSGF:

2. sequence and quality score formatted FASTQ data is stored in NGSGF format

Data for FASTQ:

@EAS54_6_R1_2_1_413_324↓

CCCTTCTTGTCTTCAGCGTTTCTCC↓

+↓

；；3；；；；；；；；；；；；7；；；；；；；88↓

@EAS54_6_R1_2_1_540_792↓

TTGGCAGGCCAAGGCCGATGGATCA↓

+↓

；；；；；；；；；；；7；；；；；-；；；3；83↓

data of corresponding NGSGF:

3. SAM data in sequence alignment format is stored in NGSGF format

Data for SAM:

data of corresponding NGSGF:

4. sequence assembly format CAF data is stored in NGSGF format

CAF data:

DNA：22ak93c2.rlt↓

GTCGCnCATAAGATTACGAGATCTCGAGCTCGGTACCCTTCAAGCGATTCTCCTGCCTCA↓

↓

BaseQuality：22ak93c2.r1t↓

4 4 8 4 4 4 4 4 4 4 4 4 6 8 17 21 14 7 6 6 6 7 7 6 8 14 16 21 15 20 20↓

24 26 21 18 18 14 14 19 23 10 8 8 15 20 16 29 26 34 29 39 29 31 29 31↓

|↓

Sequence：22ak93c2.r1t↓

Is_read↓

Padded↓

Staden_id 11↓

Clipping QUAL 39 331↓

Align_to_SCF 1 43 1 43↓

Align_to_SCF 44 317 45 318↓

Align_to_SCF 319 716 319 716↓

SCF_File 22ak93c2.r1tSCF↓

Primer Universal↓primer↓

Strand Reverse↓

Dye Dve_terminator↓

Template 22ak93c2↓

Clone bK216E10↓

Sequencing_vector″m13mp18″↓

Seq_vec SVEC 1 38″M13mp18″↓

Tag ALUS 43 180↓

Tag DONE 43 43″AUTO-EDIT：deleted C at 43(terminator，isolated，strong)″↓

Tag DONE 254 254″AUTO-EDIT：replaced T by g at 254(terminator，isolated，strong)″↓

Tag ALUS 269 402↓

Tag DONE 283 283″AUTO-EDIT：replaced T by g at 283(terminated，isolated，strong)″↓

Tag AMBG 298 302″AUTOEDIT：Check this edit cluster！″↓

Tag DONE 317 317″AUTO-EDIT：replaced C by a at 317(terminated，compound，strong)″↓

Tag DONE 318 318″AUTO-EDIT：inserted g at 318(terminated，compound，strong)″←

data of corresponding NGSGF:

5. sequence mutation format VCF data is stored in NGSGF format

Data of VCF:

##fileformat＝VCFv4.2↓

##fileDate＝20090805↓

##reference＝file：///seq/references/1000GenomesPilot-NCBI36.fasta↓

##INFO＝<ID＝NS，Number＝1，Type＝Integer，Description＝″Number of Samples With Data″>↓

##FILTER＝<ID＝q10，Description＝″Quality below 10″>↓

##FOREAT＝<ID＝GT，Number＝1，Type＝String，Description＝″Genotype″>↓

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003↓

20 14370 rs6054257 G A 29 PAsS NS＝3；DP＝14；AF＝0.5；DB；H2 GT：GQ：DP：HQ 0|0：48：1：51，51 1|0：48：8：51，51 1/1：43：5：.，.↓

data of corresponding NGSGF:

example 2 creation of NGSGF files

Ngsgmonitor is designed for creating and editing NGSGF files. It has a user friendly GUI that can also run in command lines. It would be very helpful for the user to use NGSGF files to operate. FIG. 8 shows the user interface of the NGSGFEditor, showing the interface during running software (A: start NGSGFEditor), converting format (B: conversion SAM file), opening file (C: open NGSGF file), editing file (D: edit NGSGF file).

The NGSGFEditor in this embodiment is written in Java via NetBeans IDE 10.0, which contains two items, as shown in fig. 9.

Here we use NGSGFEditor to create a FASTQ format NGSGF file. The contents of FASTQ are:

@EAS54_6_R1_2_1_413_324↓

CCCTTCTTGTCTTCAGCGTTTCTCC↓

+↓

；；3；；；；；；；；；；；；7；；；；；；；88↓

@EAS54_6_R1_2_1_540_792↓

TTGGCAGGCCAAGGCCGATGGATCA↓

+↓

；；；；；；；；；；；7；；；；；-；；；3；83←

step 1: new NGSGF file

Clicking the new button creates a new NGSGF file. As shown in fig. 10.

Step 2: addition sequence

(1) Right clicking (hereinafter, simply referred to as "right clicking") the "ngs" node, popup menu display. Clicking on the "list_of_seqs" menu creates a new node. As shown in fig. 11 (1).

(2) Right clicking on the "list_of_seqs" node increases the "seq" child node. As shown in fig. 11 (2).

(3) Right clicking on the "seq" node adds the "nid" attribute. As shown in fig. 11 (3).

(4) Right clicking on the "nid" node selects the "Edit" menu Edit node value. As shown in fig. 11 (4).

(5) An "origin" node like the "nid" node is added. Right clicking on the "origin" node selects the "Edit" menu and enters the sequence value. As shown in fig. 11 (5).

Step 3: increasing mass fraction

(1) Right click "ngs" to add the "list_of_quals" node. As shown in fig. 12 (1).

(2) "nid", "origin" nodes are added. Right clicking on the "origin" node increases the quality score. As shown in fig. 12 (2).

Step 4: adding sequence information

(1) Right clicking on the "ngs" node adds the "list_of_seqinfos" node. As shown in fig. 13 (1).

(2) Right-clicking on the "list_of_seqinfos" to "seqinfo" node. As shown in fig. 13 (2).

(3) Right clicking on the "seqinfo" node to the "seq" node. As shown in fig. 13 (3).

(4) Right clicking on the "seq" node adds the "seqref" attribute. As shown in fig. 13 (4).

(5) Right clicking on the "seqinfo" node to the "quat" node. As shown in fig. 13 (5).

(6) The "seqref" and "qualref" nodes are added in the "seq" and "qual" nodes. Right clicking on the "seqref" and "qualref" nodes inputs the reference value. In this example, the sequence of the first record is "s1", and the mass fraction is "q1". As shown in fig. 13 (6).

The second record of the FASTQ file is added like the first record.

Step 5: preserving NGSGF files

Finally, the sequence is stored in the "list_of_seqs" node, the quality score is stored in the "list_of_quals" node, and the FASTQ record is stored in the "list_of_seqinfo" node. As shown in fig. 14.

Example 3 conversion of NGS files and NGSGF files

The user may also use ngsgmonitor to convert between NGS files and NGSGF files.

Currently, NGSGFEditor supports FASTA, FASTQ, SAM, VCF, CAF five formats.

Ngsgmonitor may be executed under Windows and Linux systems.

Format conversion may be invoked through a GUI and command line.

1. Through NGSGFEditor GUI

1.1 conversion of FASTQ to NGSGF

(1) The FASTQ file is added using the "Add" button.

(2) The output directory selects a folder using the "Browse" button.

(3) Clicking the "Start" button.

As shown in fig. 15 (1).

1.2 conversion of NGSGF to FASTQ

(1) NGSGF files are added using the "Add" button.

(2) The input selects NGSGF and the output selects FASTQ.

(3) The output directory selects a folder using the "Browse" button.

(4) Clicking the "Start" button.

As shown in fig. 15 (2).

2. Using ngsgmonitor command lines

This example is implemented in the Linux system.

The input "java-jar ngsgmonitor. Jar-h" displays help. As shown in fig. 16.

2.1 conversion of SAM to NGSGF

The input "java-jar ngsgmonitor. Jar-c SAM2 NGSGF-input_path-o output_path" converts SAM into NGSGF. As shown in fig. 17 (1).

2.2 conversion of NGSGF to SAM

The input "java-jar ngsgmonitor. Jar-c NGSGF2 SAM-input path-o output path" converts NGSGF to SAM. As shown in fig. 17 (2).

It is understood that all other embodiments, which can be made by one of ordinary skill in the art without inventive effort, are within the scope of the present invention based on the embodiments of the present invention.

Claims

1. The method for constructing the high-throughput sequencing data general storage format structure is designed based on XML and XML Schema technology and comprises the following steps:

3) The common storage format is designed based on the content of commonality and characteristics, and the high-throughput sequencing data common storage format structure comprises four components: head component, sequence component, quality fraction component, sequence information component, wherein:

the header component is used for storing header description information of a file, and comprises a sub-element meta_info, wherein the meta_info comprises a name attribute and a value attribute;

the sequence component is used for storing sequence information, the sequence information is a base sequence or a file path for storing the base sequence, the sequence component comprises one or more seq sub-elements for representing the sequence, and each seq sub-element is provided with a unique identifier and is used for positioning the sequence information component;

the quality score component is used for storing a sequence quality score, the sequence quality score is a quality score character string or a file path for storing the quality score, the quality score component comprises one or more quality subelements for representing the sequence quality score, and each quality subelements has a unique identifier for positioning the sequence information component;

the sequence information component is used for storing records and features of a sequence, and comprises one or more seqinfo sub-elements, wherein one seqinfo sub-element represents a sequence record.

2. The method for constructing a high-throughput sequencing data universal storage format structure according to claim 1, wherein the sequence and mass fraction formats comprise Fasta/CSFASTA, fastq/CSFASTQ, qseq, SCARF, QUAL, 2bit/nib, SFF formats; the comparison format includes: SAM, BAM, bowtie, maq format; the assembly format includes ACE, AFG, CAF format; the mutation format includes GVF, pileup, VCF format; the annotation and visualization formats include BED, bigBED, wig, bigWig, bedGraph, GFF/GTF formats.

3. Use of the high-throughput sequencing data generic storage format structure obtained by the construction method of claim 1 or 2 for representing, storing, editing and converting high-throughput sequencing data.