CN107480466B - Genome data storage method and electronic equipment - Google Patents

Genome data storage method and electronic equipment Download PDF

Info

Publication number
CN107480466B
CN107480466B CN201710546293.7A CN201710546293A CN107480466B CN 107480466 B CN107480466 B CN 107480466B CN 201710546293 A CN201710546293 A CN 201710546293A CN 107480466 B CN107480466 B CN 107480466B
Authority
CN
China
Prior art keywords
storage unit
statistical information
information
indicating
storing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710546293.7A
Other languages
Chinese (zh)
Other versions
CN107480466A (en
Inventor
蔡文君
何光铸
王东辉
孔令雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ronglian Technology Group Co., Ltd
Original Assignee
United Electronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by United Electronics Co ltd filed Critical United Electronics Co ltd
Priority to CN201710546293.7A priority Critical patent/CN107480466B/en
Publication of CN107480466A publication Critical patent/CN107480466A/en
Application granted granted Critical
Publication of CN107480466B publication Critical patent/CN107480466B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device
    • G06F3/0676Magnetic disk device
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B99/00Subject matter not provided for in other groups of this subclass

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a genome data storage method, which comprises the following steps: in the process of genome comparison, obtaining gene sequence comparison information and creating gene sequence statistical information; storing the gene sequence comparison information in a magnetic disk, and storing corresponding indexes in an internal memory according to the comparison position of the gene sequence comparison information in a genome; the index is the storage position of the gene sequence comparison information in a magnetic disk; classifying the genome statistical information to obtain first statistical information and second statistical information; storing first statistical information in a memory, wherein the first statistical information is statistical information of which the access frequency is higher than a preset frequency in a variation detection process; and storing second statistical information in the magnetic disk, wherein the second statistical information is statistical information which cannot be stored in the memory and/or statistical information of which the access frequency is lower than the preset frequency in the variation detection process. The invention also discloses electronic equipment adopting the genome data storage method.

Description

Genome data storage method and electronic equipment
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a genome data storage method and an electronic device.
Background
The genome variation detection and calculation process generally comprises the steps of comparison, sorting, de-duplication, re-comparison, variation detection, filtering and the like. The main steps are to use the BAM file (SAM is called sequence alignment map, sequence alignment chart) as the output file to write into the hard disk, and then read it from the hard disk to the memory in the next step, and then process the next step.
In the process of implementing the invention, the inventor finds that the prior art has the following problems:
in human genome data analysis, the original data is about 100GB generally, and hundreds of GB files need to be read and written in the middle main analysis step, so that the whole calculation process consumes a large amount of I/O resources and the program efficiency is low.
The inventors have found that the main causes of the problem are:
1. the intermediate file is too large to be directly placed into the memory.
64GB memory is a typical machine configuration for common bioinformatics analysis. Human whole genome analysis data, the intermediate result is generally about 100GB, and can not directly exist in the memory, and the mutation detection process needs to load a reference sequence and an index file into the memory, so that the space for storing the intermediate result is further reduced.
2. The format of the intermediate file cannot be directly used for calculation.
The common intermediate file format is SAM/BAM format, which is a line record format, that is, each line stores one record, and the record can not be directly used for calculation when being directly put into a memory. The data required by the variation detection is mainly the statistical information of the comparison condition of each site, and comprises the information of the distribution of the number of various bases of each site, the insertion deletion (InDel) sequence and frequency, the soft shearing (soft clipping) sequence in the comparison and the like.
Disclosure of Invention
In view of the above, the present invention provides a method for storing genome data and an electronic device, which can solve the problem of low efficiency caused by the need of frequently inputting and outputting a large number of binary files in the process of detecting genome variation.
The present invention provides a genome data storage method based on the above object, comprising:
in the comparison process, obtaining gene sequence comparison information and creating gene sequence statistical information;
storing the gene sequence comparison information in a magnetic disk, and storing corresponding indexes in an internal memory according to the comparison position of the gene sequence comparison information in a genome; the index is the storage position of the gene sequence comparison information in a magnetic disk;
classifying the genome statistical information to obtain first statistical information and second statistical information;
storing first statistical information in a memory, wherein the first statistical information is statistical information of which the access frequency is higher than a preset frequency in a variation detection process;
and storing second statistical information in the magnetic disk, wherein the second statistical information is statistical information which cannot be stored in the memory and/or statistical information of which the access frequency is lower than the preset frequency in the variation detection process.
Optionally, the first statistical information includes statistical information of base weight quality values, statistical information of positive and negative chains, statistical information of insertion deletion, and statistical information of soft shearing.
Optionally, for a site where no insertion deletion or soft splicing occurs and at most 2 base types occur, storing first statistical information of the site by using a first data structure;
the first data structure, comprising:
a first header for indicating a base type;
a first quality value storage unit for storing a base weight quality value;
a first positive strand number storage unit for indicating the number of positive strands;
a first minus-strand number storage unit for indicating the number of minus strands.
Optionally, for a site with an insertion deletion and 3-4 base types, storing first statistical information of the site by using a first data structure and a second data structure;
the second data structure, comprising:
statistical information of base weight quality values and statistical information of positive and negative chains of the 4 base types respectively; the storage structure of the statistical information of the base weight quality value and the statistical information of the positive and negative chains of each base type specifically comprises the following steps: a second mass value storage unit for indicating a base weight mass value, a second positive strand number storage unit for indicating the number of positive strands, and a second negative strand number storage unit for indicating the number of negative strands;
the first insertion statistical information specifically includes: a first insertion sequence storage unit for storing an insertion sequence, a first low-quality insertion number storage unit for storing a low-quality insertion number;
the first missing statistical information specifically includes: a first missing length storage section for indicating a missing length, a first high-quality missing number storage section for indicating a number of high-quality missing, a first low-quality missing number storage section for indicating a number of low-quality missing;
the first data structure, comprising:
a second head filled with 11;
the first insertion information storage unit for indicating whether or not there is an insertion, specifically includes: a first insertion information sub-storage unit for indicating whether or not there is an insertion, an insertion length sub-storage unit for indicating an insertion length, and a low quality insertion number sub-storage unit for indicating a low quality insertion number;
the first missing information storage unit for indicating whether there is a missing includes: a first missing information sub-storage unit for indicating whether or not there is a missing;
a pointer to a corresponding second data structure storage location.
Optionally, for a site with more than 1 insertion deletion and an insertion length greater than 12 bases, the first statistical information of the site is stored by using a first data structure and a third data structure, and for the first statistical information of such a site, a memory pool is created in the memory for storage;
the third data structure, comprising:
statistical information of base weight quality values and statistical information of positive and negative chains of the 4 base types respectively; the storage structure of the statistical information of the base weight quality value and the statistical information of the positive and negative chains of each base type specifically comprises the following steps: a third quality value storage unit for indicating a base weight quality value, a third positive strand number storage unit for indicating the number of positive strands, and a third negative strand number storage unit for indicating the number of negative strands;
the second insertion statistical information specifically includes: an insertion length storage section for indicating an insertion length, a second insertion sequence storage section for indicating an insertion sequence, a second low-quality insertion number storage section for indicating a low-quality insertion number, and a high-quality insertion number storage section for indicating a high-quality insertion number;
the second missing statistical information specifically includes: a second missing length storage for indicating the missing length, a second high quality missing number storage for indicating the number of high quality misses, a second low quality missing number storage for indicating the number of low quality misses;
the first data structure, comprising:
a third head filled with 11;
the second insertion information storage unit for indicating whether or not there is an insertion, specifically includes: a second insertion information sub-storage unit for indicating whether or not there is an insertion, a first memory pool information sub-storage unit for indicating whether or not a memory pool is used, and a first occupied length sub-storage unit for indicating an occupied length in the memory pool;
the second missing information storage unit for indicating whether or not there is a missing specifically includes: a second missing information sub-storage unit for indicating whether there is a miss, a second memory pool information sub-storage unit for indicating whether the memory pool is used, and a second occupied length sub-storage unit for indicating the occupied length in the memory pool.
Optionally, for the soft-clip statistical information, a dynamic array is used to record, and each record includes:
a soft-clip position storage unit for storing a position of soft clip on the genome;
a soft-clipping left-side number storage section for indicating the number of times soft clipping occurs to the left of the corresponding site;
a soft-clip right-side number storage for indicating the number of times soft-clip occurs to the right of the corresponding site.
Optionally, the index includes a double-ended comparison information index and a single-ended comparison information index;
for the double-end comparison information index, a double-end comparison array structure is adopted for storage, and the double-end comparison array structure comprises:
a first ID storage unit for storing an ID representing a gene sequence;
a first alignment position storage unit for storing a position at which a gene sequence is aligned on a genome;
an insert length storage unit for storing an insert length of a gene sequence;
a first comparison quality value storage unit for storing a comparison quality value indicating a gene sequence;
a first average quality value storage unit for storing an average quality value of a gene sequence;
for the single-ended comparison information index, storing by adopting a single-ended comparison array structure, wherein the single-ended comparison array structure comprises:
a second ID storage unit for storing an ID representing a gene sequence;
a second alignment position storage unit for storing a position of the gene sequence aligned on the genome;
a second alignment quality value storage unit for storing an alignment quality value representing a gene sequence;
a second average quality value storage unit for storing an average quality value of the gene sequence;
wherein, for each gene sequence for comparison, the corresponding indexes are arranged in sequence according to the comparison position of the gene sequence on the genome.
Optionally, storing the gene sequence alignment information in a disk specifically includes:
therefore, the gene sequence comparison information is divided into 512 files and stored in a disk, each file stores the gene sequence comparison information of a certain genome interval, and the storage data structure of each gene sequence comparison information comprises:
a sequence length storage unit for storing a sequence length of a gene sequence;
a sequence storage unit for storing a gene sequence;
a quality value storage unit for storing a quality value representing a gene sequence;
a start position storage unit for storing a start position of an alignment algorithm used for aligning gene sequences;
a positive/negative chain storage unit for storing positive/negative chain information of gene sequences during alignment;
a region length storage unit for storing the length of a genomic region selected when gene sequences are aligned;
a left position storage part for representing the left riveted position of the gene sequence during the comparison;
a right position storage unit for storing the right position of the gene sequence to be aligned.
Optionally, the method further includes:
subtracting interference caused by repeated sequences in the genome statistical information in a de-duplication process;
and/or the presence of a gas in the gas,
in the process of comparing the genes, extracting the gene sequences of the comparing regions of the genome, and adjusting the genome statistical information of the gene sequences of the comparing regions after comparing the gene sequences of the comparing regions again.
From the above, the genome data storage method and the electronic device provided by the invention have the advantages that an exquisite data storage structure is designed according to the characteristics of the intermediate file in the whole process of mutation detection, some main intermediate data are stored in the memory, and the data can be directly called from the memory, so that each step in the whole process of mutation detection does not need to be subjected to a large amount of I/O reading and writing of a disk, and the efficiency of the whole mutation detection analysis process is obviously improved.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of a method for storing genome data provided by the present invention;
FIG. 2a is a schematic representation of the first data structure when no indels and soft-cuts occur and at most 2 base types occur;
FIG. 2b is a schematic representation of the second data structure when an insertion deletion (InDel) occurs and 3-4 base types occur;
FIG. 2c is a schematic representation of the first data structure when an insertion deletion (InDel) occurs and 3-4 base types occur;
FIG. 2d is a schematic representation of the third data structure in the presence of more than 1 indel at a site greater than 12 bases in length;
FIG. 2e is a schematic diagram of the first data structure when more than 1 indels are present at sites with an insertion length greater than 12 bases;
FIG. 2f is a schematic diagram of a dynamic array when recorded using the dynamic array for the soft-clip statistics;
FIG. 2g is a schematic illustration of the index;
FIG. 2h is a diagram showing a stored data structure of the alignment information of each gene sequence;
FIG. 3 is a schematic flow chart of an embodiment of a method for aligning genomic sequences provided by the present invention;
FIG. 4 is a schematic structural diagram of an embodiment of a genomic data storage device provided by the present invention;
fig. 5 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it is understood that "first" and "second" are only used for convenience of expression and should not be construed as limitations to the embodiments of the present invention, and the descriptions thereof in the following embodiments are omitted.
In view of the above, a first aspect of the embodiments of the present invention provides an embodiment of a method for storing genome data, which can solve the problem of low efficiency caused by frequent input and output of a large number of binary files in a genome variation detection process. Fig. 1 is a schematic flow chart of an embodiment of a genomic data storage method provided by the present invention.
The genome data storage method comprises the following steps:
step 101: in the comparison process, obtaining gene sequence comparison information and creating gene sequence statistical information; the gene sequence comparison information is gene sequence comparison result information generated in the process of genome comparison, and the gene sequence statistical information can be extracted from the gene sequence comparison result information;
step 102: storing the gene sequence comparison information in a magnetic disk, and storing corresponding indexes in an internal memory according to the comparison position of the gene sequence comparison information in a genome; the index is the storage position of the gene sequence comparison information in a magnetic disk;
step 103: classifying the gene sequence statistical information to obtain first statistical information and second statistical information;
step 104: storing the first statistical information in a memory, wherein the first statistical information is statistical information of which the access frequency is higher than a preset frequency in a variation detection process;
step 105: and storing the second statistical information in a magnetic disk, wherein the second statistical information is statistical information which cannot be stored in an internal memory and/or statistical information of which the access frequency is lower than a preset frequency in the variation detection process.
It can be seen from the foregoing embodiments that, in the genomic data storage method provided in the embodiments of the present invention, a delicate data storage structure is designed for the characteristics of the intermediate files in the entire mutation detection process (including the steps of comparison, sorting, deduplication, re-comparison, mutation detection, filtering, and the like), some of the main intermediate data are stored in the memory, and these data can be directly called from the memory, so that each step in the entire mutation detection process does not require a large number of I/O reads and writes of the disk, and the efficiency of the entire mutation detection analysis process is significantly improved.
In some alternative embodiments, the first statistical information comprises base weighted quality value statistical information, sign and sign statistics, indel statistical information, and soft-clip statistical information; the method specifically comprises the following steps:
statistical information of the base Weighted quality value (Weighted Count):
since each base aligned to the reference gene sequence has a mass value between 0 and 40, the weights assigned are as shown in the following table:
Base Quality Scores Parameter* Weight
0–10 [0–Weight0] 0
11–13 (Weight0–Weight1) 1
14–17 (Weight1–Weight2) 2
18–20 (Weight2–Weight3) 3
21–40 (Weight3–40) 4
adding the weights of all the bases aligned to the same position to obtain the weight sum of the quality value of the base type;
the positive and negative chain statistics (Strand Count): counting the number of gene sequences at the same position by forward and reverse alignment;
the InDel statistics and insertion sequence information (InDel Count): comparing the insertion deletion sequence at a certain position of the genome in the gene sequence and accumulating the occurrence times;
the Soft Clip statistics (Soft Clip Count): the number of soft clips (soft clips) occurring at a certain position of the genome in the gene sequences was aligned.
In some alternative embodiments, considering the simplest case, for a site where no insertion deletion or soft-splicing occurs and at most 2 base types occur, the first statistical information for that site is stored using a first data structure; optionally, the first data structure is an 8bytes data structure Counter (container), and the 8bytes data structure Counter is used to store information of a site, and the entire human genome includes about 3G sites, so that about 24GB of memory is required;
as shown in fig. 2a, the first data structure stores two-base statistics (base1information and base2 information) using the same 4bytes data structure, including:
a first header for indicating a base type; optionally, the first header (base) is represented by 2bits for base type and base A, C, G, T is represented by 00, 01, 10 and 11, respectively;
a first quality value storage unit for storing a base weight quality value; optionally, the first quality value storage unit (weighted count) may use 14bits to represent the weighted quality value sum, and the maximum value is 16383;
a first positive strand number storage unit for indicating the number of positive strands; optionally, the first positive strand number storage (+ vestrand count) uses 1byte (8bits) to represent the number of positive strands, and the maximum value is 255;
a first minus-strand number storage unit for storing a number of minus strands; optionally, the first minus-strand number storage unit (-vetrand count) uses 1byte (8bits) to represent the number of minus strands, and the maximum value is 255.
In some alternative embodiments, for a site where an insertion deletion (InDel) occurs and 3-4 base types occur, first statistical information for the site is stored using a first data structure and a second data structure; optionally, a 32-byte data structure overflow counter (overflow container) is used to store information of one site, statistical information (base a information, base C information, base G information, and base Tinformation) of the base ACGT is respectively expressed by 6bytes, and Insertion information (Insertion Info.) and deletion information (deletion Info.) are respectively expressed by 4 bytes; there are approximately 200M such sites in the 30X genome-wide data empirically;
the second data structure, as shown in fig. 2b, includes:
statistical information of base weight quality values and statistical information of positive and negative chains of the 4 base types respectively; the storage structure of the statistical information of the base weight quality value and the statistical information of the positive and negative chains of each base type specifically comprises the following steps: a second quality value storage unit (weighted count, optionally 2bytes for weighted quality value sum, maximum value of 65535) for indicating the number of positive strands, a second positive strand number storage unit (+ ve strand count, optionally 2bytes for positive strand number, maximum value of 65535) for indicating the number of positive strands, and a second negative strand number storage unit (-vestrand count, optionally 2bytes for negative strand number, maximum value of 65535) for indicating the number of negative strands;
the first insertion statistical information specifically includes: a first Insertion sequence storage unit (Insertion Pattern, optionally, 3bytes, here, 12 bases at the maximum) for storing an Insertion sequence, and a first low-quality Insertion number storage unit (LQ count, optionally, 1byte, maximum 255) for storing a low-quality Insertion number;
the first missing statistical information specifically includes: len, optionally using 1byte, max 255, a first high quality missing number storage (HQ count, optionally using 1byte, max 255) for indicating the number of high quality misses, a first low quality missing number storage (LQ count, optionally using 1byte, max 255) for indicating the number of low quality misses; optionally, 1byte unused space is also included;
when using the Overf lowCounter, the stored content of the corresponding first data structure may be changed, and the first data structure, as shown in fig. 2c, includes:
a second head filled with 11; the data originally used to store both base1information and base2information would both be filled with "11" indicating that an OverflowCounter was used;
a first Insertion Information storage unit (Insertion Information) for indicating whether or not there is an Insertion, optionally, using 14bits to store the Insertion Information, specifically including: a first insertion information sub-storage unit (1bit) for indicating whether or not an insertion exists, an insertion length sub-storage unit (ins.len. using 4 bits) for indicating an insertion length, and a low quality insertion count sub-storage unit (LQ count, using 8bits) for indicating a low quality insertion count; optionally, 1bit is set to 0;
a first Deletion Information storage unit (Deletion Information) for indicating whether or not there is a Deletion, and optionally, 14bits are used to store the Deletion Information, 1bit is used to indicate whether or not there is a Deletion (first Deletion Information sub-storage unit), 1bit is set to 0, and 12bits are Unused (Ununsed);
a pointer (array index pointing dynamic array of overflow counter) for pointing to the corresponding storage location of the second data structure, optionally using 4bytes to hold a pointer to the location of the overflow counter data.
In some alternative embodiments, for a site where more than 1 indels occur and the insertion length is greater than 12 bases, the first statistical information of the site is stored by using a first data structure and a third data structure, and for the first statistical information of such a site, a Memory Pool (specially opening up a block of Memory) is created in the Memory for storage, the statistical information of the bases ACGT (base a information, base C information, base G information, and base T information) are respectively represented by 6bytes, and a pointer of the indels information is recorded in the overlaflowcounter, as shown in fig. 2 d;
the third data structure, as shown in fig. 2d, includes:
statistical information of base weight quality values and statistical information of positive and negative chains of the 4 base types respectively; the storage structure of the statistical information of the base weight quality value and the statistical information of the positive and negative chains of each base type specifically comprises the following steps: a third quality value storage unit (weighted count, optionally 2bytes for weighted quality value sum, maximum value of 65535) for indicating the number of positive strands, (+ ve strand count, optionally 2bytes for positive strand number, maximum value of 65535) and a third negative strand count (-vestrand count, optionally 2bytes for negative strand number, maximum value of 65535) for indicating the number of negative strands;
the second Insertion statistic information (Insertion Ptr), optionally expressed in terms of 4bytes, specifically includes: an Insertion length storage unit (Insertion length, optionally, 1byte for an Insertion length), a second Insertion sequence storage unit (Insertion pattern, length variable, one base per 2bits), a second low-quality Insertion number storage unit (LQ count, optionally, 1byte for a low-quality Insertion number), and a high-quality Insertion number storage unit (HQcount, optionally, 1byte for a high-quality Insertion number) for indicating the number of low-quality insertions;
the second missing statistical information (Deletion Ptr), which is optionally represented by 4bytes, specifically includes: a second Deletion length storage unit (Deletion length, optionally 1byte for Deletion length), a second high quality number of deletions storage unit (HQ count, optionally 1byte for high quality Deletion number), and a second low quality number of deletions storage unit (LQ count, optionally 1byte for low quality Deletion number) for high quality Deletion number;
meanwhile, as shown in fig. 2e, the information recording change in the Counter includes:
a third head filled with 11; the data originally used to store both base1information and base2information would both be filled with "11" indicating that an OverflowCounter was used;
a second Insertion Information storage unit (Insertion Information) for indicating whether or not there is an Insertion, optionally indicated by 14bits, specifically including: a second insertion information sub-storage part (1bit) for indicating whether or not there is an insertion, a first memory pool information sub-storage part (1bit) for indicating whether or not a memory pool is used, a first occupation length sub-storage part (12bits) for indicating the occupation length in the memory pool;
a second Deletion Information storage unit (Deletion Information) for indicating whether or not there is a Deletion, optionally indicated by 14bits, specifically including: a second missing information sub-storage part (1bit) for indicating whether there is a missing, a second memory pool information sub-storage part (1bit) for indicating whether the memory pool is used, and a second occupied length sub-storage part (12bits) for indicating the occupied length in the memory pool.
Empirically, soft clipping occurs at only a few genomic locations, so it is not necessary to open up a separate piece of storage space for each site. Therefore, in some optional embodiments, for the soft-cut statistical information, a dynamic array is used for recording, as shown in fig. 2f, each record is in a format of { position, left counts, right counts }, and occupies 12bytes, which specifically includes:
a soft-clip position storage unit (position) for indicating the position of soft clips on the genome, and occupies 4 bytes;
a soft-cut left-side number store (leftcounts) for indicating the number of times soft-cuts occur to the left of the corresponding site, occupying 4 bytes;
the soft-cut right counts store (rightcounts) indicating the number of times soft cuts occur to the right of the corresponding site, takes 4 bytes.
In some alternative embodiments, as shown in fig. 2g, the index comprises a double End alignment (Pair End) information index and a Single End alignment (Single End) information index;
for the double-end comparison information index, a double-end comparison array structure is adopted for storage, and the double-end comparison array structure (PairEndAlignment Info, which occupies 12bytes) comprises the following steps:
a first ID storage unit (ReadID) for storing an ID indicating a gene sequence, and occupies 4 bytes;
a first alignment position storage (AlignedPosition) for indicating a position at which the gene sequence is aligned on the genome, occupying 4 bytes;
an Insert length storage (Insert Size) for indicating the Insert length of a gene sequence, occupying 2 bytes;
a first comparison quality value storage (MAPQ) for representing an alignment quality value of a gene sequence, occupying 1 byte;
a first Average quality value storage unit (Average base quality) for storing an Average quality value of a gene sequence, which occupies 1 byte;
for the single-ended comparison information index, storing by using a single-ended comparison array structure (SingleEndAlignmentInfo, occupying 10bytes), including:
a second ID storage unit (ReadID) for storing an ID indicating a gene sequence, and occupies 4 bytes;
a second alignment position storage (AlignedPosition) for indicating a position at which the gene sequence is aligned on the genome, occupying 4 bytes;
a second alignment quality value storage (MAPQ) for indicating an alignment quality value of the gene sequence, occupying 1 byte;
a second Average quality value storage unit (Average base quality) for storing an Average quality value of a gene sequence, which occupies 1 byte;
wherein, for each gene sequence for comparison, the corresponding indexes are arranged in sequence according to the comparison position of the gene sequence on the genome.
Because the sequence read during the mutation detection process is very random, in some alternative embodiments, the gene sequence alignment information is stored in a disk, which specifically includes:
therefore, the gene sequence comparison information is divided into 512 files (buckets) and stored in the disk, each file stores the gene sequence comparison information of a certain genome interval, as shown in fig. 2h, the storage data structure of each piece of gene sequence comparison information includes:
a sequence Length storage unit (Read Length) for storing a sequence Length of a gene sequence, which occupies 2 bytes;
a sequence storage unit (Packed Read) for expressing the gene sequence itself, the length of which is variable, and 2bits are used to express one base;
a quality value storage unit (Base qualites) for expressing a quality value of a gene sequence, the length of which is not constant;
a start position storage unit (DP StartPos.) for indicating the alignment algorithm start position of the gene sequence at the time of alignment, and occupies 4 bytes;
a positive and negative chain storage part (Strand) for representing the positive and negative chain information of the gene sequences during comparison, which occupies 1 bit;
a region length storage part (DPref. length) for expressing the length of the genome region selected during the comparison of the gene sequences, occupying 15 bits;
a Left position storage unit (Left Anchor) for indicating the Left-hand position of the gene sequence during alignment, and occupies 4 bytes;
the Right position storage part (Right anchor) for indicating the Right-hand position of the gene sequence at the time of alignment occupied 4 bytes.
In addition to the steps of creating statistics and indexes during the comparison process in the foregoing examples, in some alternative embodiments, the method further comprises:
subtracting interference caused by repeated sequences in the genome statistical information in a de-duplication (de-duplication) process;
and/or the presence of a gas in the gas,
extracting a gene sequence of a weight comparison region of a genome in a weight comparison (alignment) process, and adjusting the genome statistical information of the gene sequence of the weight comparison region after the gene sequence of the weight comparison region is re-compared;
in the mutation detection process, the statistical information is directly used to calculate the probability of various genotypes.
By the genome data storage method provided by the embodiment of the invention, a large amount of binary files do not need to be repeatedly output in the whole analysis process, and the data of a whole genome is analyzed through integral algorithm optimization and can be completed within 4 hours, while the general analysis process can be completed within dozens of hours; the I/O process in the mutation detection analysis process is greatly reduced, and the analysis efficiency of the program is greatly improved.
To facilitate understanding of the foregoing technical solutions, an example of a genome sequence alignment method is briefly introduced here to explain the genome alignment process in step 101 in the foregoing examples. FIG. 3 is a schematic flow chart of an embodiment of the method for aligning genomic sequences according to the present invention.
The genome sequence alignment method comprises the following steps:
step 201: and acquiring a reference genome sequence and a genome sequence file to be compared. The file acquisition mode here may be a conventional acquisition mode. Wherein, the format of the genome sequence files to be aligned can be FASTQ files.
The genome sequence alignment method is carried out by dividing sequence alignment into 3 grades; reading a part of sequence from the input genome sequence file to be compared each time, then sequentially executing 1-level, 2-level and 3-level comparison algorithms, wherein the sequence on the comparison is not existed in the previous level, and entering the comparison algorithm of the next level for continuous comparison; the method specifically comprises the following steps.
Step 202: reading partial genome sequences from the genome sequence files to be aligned.
Step 203: and (3) comparing the partial genome sequence with a reference genome sequence according to a bidirectional BWT comparison algorithm (level 1: bidirectional BWT comparison algorithm, bidirectional BWT: Bi-directional Burrows-Wheeler Transform). Wherein the bidirectional BWT alignment algorithm processes reads alignments that are tolerant of up to 4 base errors. Reads, read length, is the sequence obtained in high throughput sequencing, each read being a stretch of bases. In the process of biological information analysis, each read is aligned to a reference genome, so that the difference between a sequencing sequence and the reference genome can be obtained, and the variation can be found.
Optionally, the method for aligning genome sequences according to the bidirectional BWT alignment algorithm may specifically include the following steps:
segmenting reads by using the pigeon house principle, wherein each segment allows 0-2 base errors;
then using a bidirectional BWT comparison algorithm to perform search comparison, comprising:
establishing BWT of the reference genome sequence, a suffix array and BWT of the reverse order of the reference genome sequence;
backward search (backward) and forward search (forward) are used to search reads or each fragment of a read for its position on the reference genomic sequence in both the right-to-left and left-to-right directions, respectively.
The bi-directional BWT alignment is less efficient at handling multiple base mismatches. In the case of up to 4 base mismatches, reads are segmented according to the Pigeon house principle, with 0-2 base mismatches allowed per segment, thus the efficiency of processing alignments of up to 2 base mismatches with bi-directional BWT is greatly increased.
After establishing the BWT and corresponding index of the reference sequence and sa (suffix array), the common alignment software BWA uses backward search, i.e. searches reads or each fragment of reads for its position on the genome from right to left. In addition to establishing a conventional BWT index (denoted by B), the bidirectional BWT used in this patent also establishes a BWT index (denoted by B') for the reverse sequence of the reference sequence. With B, B', SA, the efficiency of sequence alignment is significantly improved by backward, forward searching for the location of reads or seeds on the genome in both directions.
Step 204: whether only one read of at least one pair of reads is aligned in the partial genomic sequence (i.e., only one read of at least one pair of reads is aligned in the partial genomic sequence); if yes, go to step 208; if not, go to step 205.
Step 205: and (3) according to a single-ended dynamic programming alignment algorithm (level 2), aligning each pair of reads on only one read in the partial genome sequence with the reference genome sequence again. After the aforementioned level 1 bidirectional BWT alignment algorithm, in a pair of reads (A, A '), one (A or A ') is aligned to the reference genomic sequence and the other (A ' or A) is not aligned to the reference genomic sequence, the alignment will continue using the level 2 alignment algorithm.
Optionally, the method for comparing genome sequences according to the single-ended dynamic programming alignment algorithm may specifically include the following steps:
determining that one read (A or A ') of a pair of reads (A, A') aligns to a particular position (pos position) on the reference genomic sequence; data reads obtained by double-end sequencing are paired, and if one read (A or A ') of one pair of reads (A, A ') is aligned to a pos position on a reference genome sequence, the theoretical alignment position of the other read (A ' or A) is in a certain region around the pos position, namely a candidate region (candidate region);
therefore, according to a preset position range threshold value, selecting a specific range around the specific position (pos position); the preset position range threshold value can be selected according to actual needs, for example, set by referring to an error tolerance range; specifically, in paired-end sequencing, where a pair of reads are aligned on the genome, then the distance between the two reads and the sum of the lengths of the two reads equals the length of the sequencing fragment (fragment), and the location of the candidate region is determined based on this principle. For example, if the sequencing fragment is 500bp and each read is 150bp, the theoretical distance between two reads after alignment to the genome is 200 bp. Because the length of the sequencing fragment is not equal, the theoretical distance is about 100bp to 200 bp;
comparing another (A' or A) of the pair of reads which is not compared with the specified range by using a dynamic programming algorithm; step 206: whether at least two reads in at least one pair of reads are not aligned in the partial genomic sequence (i.e., each read in at least one pair of reads is not aligned in the partial genomic sequence); if yes, go to step 108; if not, go to step 207.
Step 207: and (3) comparing each pair of reads which are not aligned with both reads in the partial genome sequence with the reference genome sequence again according to a paired-end dynamic programming alignment algorithm (grade 3). In a pair of reads (A, A '), through the aforementioned bidirectional BWT alignment algorithm of level 1 and the single-ended dynamic programming alignment algorithm of level 2, A and A ' in a certain pair of reads (A, A ') are not aligned with the reference genome sequence, and the alignment is continued by using the alignment algorithm of level 3.
Optionally, the method for comparing genome sequences according to a paired-end dynamic programming alignment algorithm may specifically include the following steps:
respectively constructing seeds (seeds, substrings of a read) for each of a pair of reads (A and A');
specifically, each read of a pair of reads (A, A') is divided into a plurality of segments, and seeds (seeds of a read) are constructed; when a pair of reads are aligned on a genome, the distance between the two reads is within a certain range, so the distance between the seeds of the two reads also should be within a certain range;
aligning each seed to a reference genomic sequence;
specifically, the regions aligned by the seeds in pairs (i.e. the distance between two seeds meets the requirement) are retrieved, and the candidate alignment regions of the pairs of seeds are determined. Then, comparing reads to the candidate area by using a dynamic programming algorithm.
If two (A and A') of the reads have corresponding seed alignment in a certain region of the reference genome sequence, the region is a candidate region of the final alignment position;
comparing two strips (A and A') of the reads respectively in the candidate area by using a dynamic programming algorithm; after the comparison is completed, step 208 is entered; step 208: whether all the genome sequence files to be compared are compared is finished; if not, returning to the step 102; if yes, go to step 109.
Step 209: and outputting a comparison result. Optionally, the BAM file is an output file of genome sequence alignment, and the BAM is a format for storing a genome sequence alignment result, and records the position of the genome sequence in the reference genome sequence and the detailed sequence alignment condition.
It can be seen from the above embodiments that, in the genome sequence comparison method provided by the present invention, by setting a multi-stage comparison algorithm, after the comparison of the previous-stage algorithm is completed, the next-stage comparison algorithm is used to continue comparing the parts that are not compared, so that the complexity of the algorithm matches the complexity of the data, and each stage of algorithm is optimized, thereby achieving the optimization of the overall algorithm speed. By adopting the genome sequence comparison method provided by the invention, the comparison time of a human whole genome sequence can be shortened to about 4 hours on the premise of ensuring the same resources and the comparison accuracy, the comparison time is obviously shortened compared with the comparison time in the prior art, and the data analysis efficiency is improved.
In view of the above, a second aspect of the embodiments of the present invention provides an embodiment of a genomic data storage device, which can solve the problem of low efficiency caused by frequent input and output of a large number of binary files in a genomic variation detection process. Fig. 4 is a schematic structural diagram of an embodiment of a genome data storage apparatus according to the present invention.
The genomic data storage device comprises:
a creating module 301, configured to obtain gene sequence comparison information and create gene sequence statistical information in a genome comparison process;
the comparison information storage module 302 is configured to store the gene sequence comparison information in a magnetic disk, and store a corresponding index in an internal memory according to a comparison position of the gene sequence comparison information in a genome; the index is the storage position of the gene sequence comparison information in a magnetic disk;
a statistical information classification module 303, configured to classify the genomic statistical information to obtain first statistical information and second statistical information;
a statistical information storage module 304, configured to store first statistical information in a memory, where the first statistical information is statistical information in which an access frequency is higher than a preset frequency in a variation detection process; and storing second statistical information in the magnetic disk, wherein the second statistical information is statistical information which cannot be stored in the memory and/or statistical information of which the access frequency is lower than the preset frequency in the variation detection process.
In some alternative embodiments, the first statistical information comprises base weighted quality value statistics, sign and sign statistics, indel statistics, and soft-clip statistics.
In some alternative embodiments, for a site where no indels or soft-cuts have occurred and no base type has occurred for at most 2, first statistical information for the site is stored using a first data structure;
the first data structure, comprising:
a first header for indicating a base type;
a first quality value storage unit for storing a base weight quality value;
a first positive strand number storage unit for indicating the number of positive strands;
a first minus-strand number storage unit for indicating the number of minus strands.
In some alternative embodiments, for a site where an indel occurs and 3-4 base types have occurred, first statistical information for the site is stored using a first data structure and a second data structure;
the second data structure, comprising:
statistical information of base weight quality values and statistical information of positive and negative chains of the 4 base types respectively; the storage structure of the statistical information of the base weight quality value and the statistical information of the positive and negative chains of each base type specifically comprises the following steps: a second mass value storage unit for indicating a base weight mass value, a second positive strand number storage unit for indicating the number of positive strands, and a second negative strand number storage unit for indicating the number of negative strands;
the first insertion statistical information specifically includes: a first insertion sequence storage unit for storing an insertion sequence, a first low-quality insertion number storage unit for storing a low-quality insertion number;
the first missing statistical information specifically includes: a first missing length storage section for indicating a missing length, a first high-quality missing number storage section for indicating a number of high-quality missing, a first low-quality missing number storage section for indicating a number of low-quality missing;
the first data structure, comprising:
a second head filled with 11;
the first insertion information storage unit for indicating whether or not there is an insertion, specifically includes: a first insertion information sub-storage unit for indicating whether or not there is an insertion, an insertion length sub-storage unit for indicating an insertion length, and a low quality insertion number sub-storage unit for indicating a low quality insertion number;
a first missing information storage unit for indicating whether or not there is a missing;
a pointer to a corresponding second data structure storage location.
In some alternative embodiments, for a site where more than 1 indel occurs and the insertion length is greater than 12 bases, the first statistical information of the site is stored using the first data structure and the third data structure, and for the first statistical information of such site, a memory pool is created in the memory for storage;
the third data structure, comprising:
statistical information of base weight quality values and statistical information of positive and negative chains of the 4 base types respectively; the storage structure of the statistical information of the base weight quality value and the statistical information of the positive and negative chains of each base type specifically comprises the following steps: a third quality value storage unit for indicating a base weight quality value, a third positive strand number storage unit for indicating the number of positive strands, and a third negative strand number storage unit for indicating the number of negative strands;
the second insertion statistical information specifically includes: an insertion length storage section for indicating an insertion length, a second insertion sequence storage section for indicating an insertion sequence, a second low-quality insertion number storage section for indicating a low-quality insertion number, and a high-quality insertion number storage section for indicating a high-quality insertion number;
the second missing statistical information specifically includes: a second missing length storage for indicating the missing length, a second high quality missing number storage for indicating the number of high quality misses, a second low quality missing number storage for indicating the number of low quality misses;
the first data structure, comprising:
a third head filled with 11;
the second insertion information storage unit for indicating whether or not there is an insertion, specifically includes: a second insertion information sub-storage unit for indicating whether or not there is an insertion, a first memory pool information sub-storage unit for indicating whether or not a memory pool is used, and a first occupied length sub-storage unit for indicating an occupied length in the memory pool;
the second missing information storage unit for indicating whether or not there is a missing specifically includes: a second missing information sub-storage unit for indicating whether there is a miss, a second memory pool information sub-storage unit for indicating whether the memory pool is used, and a second occupied length sub-storage unit for indicating the occupied length in the memory pool.
In some optional embodiments, for the soft-clip statistics, a dynamic array is used for recording, each record including:
a soft-clip position storage unit for storing a position of soft clip on the genome;
a soft-clipping left-side number storage section for indicating the number of times soft clipping occurs to the left of the corresponding site;
a soft-clip right-side number storage for indicating the number of times soft-clip occurs to the right of the corresponding site.
In some alternative embodiments, the index comprises a double-ended alignment information index and a single-ended alignment information index;
for the double-end comparison information index, a double-end comparison array structure is adopted for storage, and the double-end comparison array structure comprises:
a first ID storage unit for storing an ID representing a gene sequence;
a first alignment position storage unit for storing a position at which a gene sequence is aligned on a genome;
an insert length storage unit for storing an insert length of a gene sequence;
a first comparison quality value storage unit for storing a comparison quality value indicating a gene sequence;
a first average quality value storage unit for storing an average quality value of a gene sequence;
for the single-ended comparison information index, storing by adopting a single-ended comparison array structure, wherein the single-ended comparison array structure comprises:
a second ID storage unit for storing an ID representing a gene sequence;
a second alignment position storage unit for storing a position of the gene sequence aligned on the genome;
a second alignment quality value storage unit for storing an alignment quality value representing a gene sequence;
a second average quality value storage unit for storing an average quality value of the gene sequence;
wherein, for each gene sequence for comparison, the corresponding indexes are arranged in sequence according to the comparison position of the gene sequence on the genome.
In some alternative embodiments, the storing the gene sequence alignment information in a disk specifically comprises:
therefore, the gene sequence comparison information is divided into 512 files and stored in a disk, each file stores the gene sequence comparison information of a certain genome interval, and the storage data structure of each gene sequence comparison information comprises:
a sequence length storage unit for storing a sequence length of a gene sequence;
a sequence storage unit for storing a gene sequence;
a quality value storage unit for storing a quality value representing a gene sequence;
a start position storage unit for storing a start position of an alignment algorithm used for aligning gene sequences;
a positive/negative chain storage unit for storing positive/negative chain information of gene sequences during alignment;
a region length storage unit for storing the length of a genomic region selected when gene sequences are aligned;
a left position storage part for representing the left riveted position of the gene sequence during the comparison;
a right position storage unit for storing the right position of the gene sequence to be aligned.
In view of the above object, according to a third aspect of the embodiments of the present invention, an embodiment of an apparatus for performing the method for storing genome data is provided. Fig. 5 is a schematic diagram of a hardware structure of an embodiment of the apparatus for performing the genomic data storage method according to the present invention.
As shown in fig. 5, the apparatus includes:
one or more processors 401 and a memory 402, one processor 401 being exemplified in fig. 5.
The apparatus for performing the genomic data storage method may further include: an input device 403 and an output device 404.
The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 5 illustrates an example of a connection by a bus.
The memory 402, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the genomic data storage method in the embodiment of the present application (for example, the creating module 301, the comparison information storage module 302, the statistical information classification module 303, and the statistical information storage module 304 shown in fig. 4). The processor 401 executes various functional applications of the server and data processing by running the nonvolatile software programs, instructions and modules stored in the memory 402, that is, implements the genome data storage method of the above-described method embodiment.
The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data region may store data created according to the use of the genomic data storage device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 402 may optionally include memory located remotely from processor 401, which may be connected to the member user behavior monitoring device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the genome data storage device. The output device 404 may include a display device such as a display screen.
The one or more modules are stored in the memory 402 and, when executed by the one or more processors 401, perform the method of genomic data storage in any of the method embodiments described above. The technical effect of the embodiment of the device for executing the genome data storage method is the same as or similar to that of any method embodiment.
The embodiment of the present application further provides a non-transitory computer storage medium, where the computer storage medium stores computer executable instructions, and the computer executable instructions may execute the processing method of the list item operation in any method embodiment. Embodiments of the non-transitory computer storage medium may be the same or similar in technical effect to any of the method embodiments described above.
Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes in the methods of the above embodiments may be implemented by a computer program that can be stored in a computer-readable storage medium and that, when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. The technical effect of the embodiment of the computer program is the same as or similar to that of any of the method embodiments described above.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.
In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (8)

1. A method of genomic data storage, comprising:
in the comparison process, obtaining gene sequence comparison information and creating gene sequence statistical information;
storing the gene sequence comparison information in a magnetic disk, and storing corresponding indexes in an internal memory according to the comparison position of the gene sequence comparison information in a genome; the index is the storage position of the gene sequence comparison information in a magnetic disk;
classifying the genome statistical information to obtain first statistical information and second statistical information;
storing first statistical information in a memory, wherein the first statistical information is statistical information of which the access frequency is higher than a preset frequency in a variation detection process;
storing second statistical information in a magnetic disk, wherein the second statistical information is statistical information which cannot be stored in an internal memory and/or statistical information of which the access frequency is lower than a preset frequency in a variation detection process;
the first statistical information comprises statistical information of base weight quality values, statistical information of positive and negative chains, statistical information of insertion deletion and statistical information of soft shearing;
for a site which has no insertion deletion and soft shearing and has at most 2 base types, storing first statistical information of the site by adopting a first data structure;
the first data structure, comprising:
a first header for indicating a base type;
a first quality value storage unit for storing a base weight quality value;
a first positive strand number storage unit for indicating the number of positive strands;
a first minus-strand number storage unit for indicating the number of minus strands.
2. The method of claim 1, wherein for a site where an indel occurs and 3 to 4 base types occur, first statistical information of the site is stored using a first data structure and a second data structure;
the second data structure, comprising:
statistical information of base weight quality values and statistical information of positive and negative chains of the 4 base types respectively; the storage structure of the statistical information of the base weight quality value and the statistical information of the positive and negative chains of each base type specifically comprises the following steps: a second mass value storage unit for indicating a base weight mass value, a second positive strand number storage unit for indicating the number of positive strands, and a second negative strand number storage unit for indicating the number of negative strands;
the first insertion statistical information specifically includes: a first insertion sequence storage unit for storing an insertion sequence, a first low-quality insertion number storage unit for storing a low-quality insertion number;
the first missing statistical information specifically includes: a first missing length storage section for indicating a missing length, a first high-quality missing number storage section for indicating a number of high-quality missing, a first low-quality missing number storage section for indicating a number of low-quality missing;
the first data structure, comprising:
a second head filled with 11;
the first insertion information storage unit for indicating whether or not there is an insertion, specifically includes: a first insertion information sub-storage unit for indicating whether or not there is an insertion, an insertion length sub-storage unit for indicating an insertion length, and a low quality insertion number sub-storage unit for indicating a low quality insertion number;
the first missing information storage unit for indicating whether there is a missing includes: a first missing information sub-storage unit for indicating whether or not there is a missing;
a pointer to a corresponding second data structure storage location.
3. The method according to claim 1, wherein for a site where more than 1 indel occurs and the length of the insertion is greater than 12 bases, the first statistical information of the site is stored using a first data structure and a third data structure, and for the first statistical information of such a site, a memory pool is created in the memory for storage;
the third data structure, comprising:
statistical information of base weight quality values and statistical information of positive and negative chains of the 4 base types respectively; the storage structure of the statistical information of the base weight quality value and the statistical information of the positive and negative chains of each base type specifically comprises the following steps: a third quality value storage unit for indicating a base weight quality value, a third positive strand number storage unit for indicating the number of positive strands, and a third negative strand number storage unit for indicating the number of negative strands;
the second insertion statistical information specifically includes: an insertion length storage section for indicating an insertion length, a second insertion sequence storage section for indicating an insertion sequence, a second low-quality insertion number storage section for indicating a low-quality insertion number, and a high-quality insertion number storage section for indicating a high-quality insertion number;
the second missing statistical information specifically includes: a second missing length storage for indicating the missing length, a second high quality missing number storage for indicating the number of high quality misses, a second low quality missing number storage for indicating the number of low quality misses;
the first data structure, comprising:
a third head filled with 11;
the second insertion information storage unit for indicating whether or not there is an insertion, specifically includes: a second insertion information sub-storage unit for indicating whether or not there is an insertion, a first memory pool information sub-storage unit for indicating whether or not a memory pool is used, and a first occupied length sub-storage unit for indicating an occupied length in the memory pool;
the second missing information storage unit for indicating whether or not there is a missing specifically includes: a second missing information sub-storage unit for indicating whether there is a miss, a second memory pool information sub-storage unit for indicating whether the memory pool is used, and a second occupied length sub-storage unit for indicating the occupied length in the memory pool.
4. The method of claim 1, wherein for the soft-cut statistics, a dynamic array is used for the records, each record comprising:
a soft-clip position storage unit for storing a position of soft clip on the genome;
a soft-clipping left-side number storage section for indicating the number of times soft clipping occurs to the left of the corresponding site;
a soft-clip right-side number storage for indicating the number of times soft-clip occurs to the right of the corresponding site.
5. The method of claim 1, wherein the index comprises a double-ended alignment information index and a single-ended alignment information index;
for the double-end comparison information index, a double-end comparison array structure is adopted for storage, and the double-end comparison array structure comprises:
a first ID storage unit for storing an ID representing a gene sequence;
a first alignment position storage unit for storing a position at which a gene sequence is aligned on a genome;
an insert length storage unit for storing an insert length of a gene sequence;
a first comparison quality value storage unit for storing a comparison quality value indicating a gene sequence;
a first average quality value storage unit for storing an average quality value of a gene sequence;
for the single-ended comparison information index, storing by adopting a single-ended comparison array structure, wherein the single-ended comparison array structure comprises:
a second ID storage unit for storing an ID representing a gene sequence;
a second alignment position storage unit for storing a position of the gene sequence aligned on the genome;
a second alignment quality value storage unit for storing an alignment quality value representing a gene sequence;
a second average quality value storage unit for storing an average quality value of the gene sequence;
wherein, for each gene sequence for comparison, the corresponding indexes are arranged in sequence according to the comparison position of the gene sequence on the genome.
6. The method of claim 1, wherein storing the gene sequence alignment information in a disk comprises:
therefore, the gene sequence comparison information is divided into 512 files and stored in a disk, each file stores the gene sequence comparison information of a certain genome interval, and the storage data structure of each gene sequence comparison information comprises:
a sequence length storage unit for storing a sequence length of a gene sequence;
a sequence storage unit for storing a gene sequence;
a quality value storage unit for storing a quality value representing a gene sequence;
a start position storage unit for storing a start position of an alignment algorithm used for aligning gene sequences;
a positive/negative chain storage unit for storing positive/negative chain information of gene sequences during alignment;
a region length storage unit for storing the length of a genomic region selected when gene sequences are aligned;
a left position storage part for representing the left riveted position of the gene sequence during the comparison;
a right position storage unit for storing the right position of the gene sequence to be aligned.
7. The method of any one of claims 1-6, further comprising:
subtracting interference caused by repeated sequences in the genome statistical information in a de-duplication process;
and/or the presence of a gas in the gas,
in the process of comparing the genes, extracting the gene sequences of the comparing regions of the genome, and adjusting the genome statistical information of the gene sequences of the comparing regions after comparing the gene sequences of the comparing regions again.
8. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the one processor to cause the at least one processor to perform the method of any one of claims 1-7.
CN201710546293.7A 2017-07-06 2017-07-06 Genome data storage method and electronic equipment Active CN107480466B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710546293.7A CN107480466B (en) 2017-07-06 2017-07-06 Genome data storage method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710546293.7A CN107480466B (en) 2017-07-06 2017-07-06 Genome data storage method and electronic equipment

Publications (2)

Publication Number Publication Date
CN107480466A CN107480466A (en) 2017-12-15
CN107480466B true CN107480466B (en) 2020-08-11

Family

ID=60595629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710546293.7A Active CN107480466B (en) 2017-07-06 2017-07-06 Genome data storage method and electronic equipment

Country Status (1)

Country Link
CN (1) CN107480466B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197433A (en) * 2017-12-29 2018-06-22 厦门极元科技有限公司 Datarams and hard disk the shunting storage method of rapid DNA sequencing data analysis platform
CN108920902A (en) * 2018-06-29 2018-11-30 郑州云海信息技术有限公司 A kind of gene order processing method and its relevant device
CN110879782B (en) * 2019-11-08 2022-06-17 浪潮电子信息产业股份有限公司 Method, device, equipment and medium for testing gene comparison software
CN111081314A (en) * 2019-12-13 2020-04-28 北京市商汤科技开发有限公司 Method and apparatus for identifying genetic variation, electronic device, and storage medium
CN112270959A (en) * 2020-10-22 2021-01-26 深圳华大基因科技服务有限公司 Shared memory-based gene analysis method and device and computer equipment
CN115602246B (en) * 2022-10-31 2023-06-20 哈尔滨工业大学 Sequence alignment method based on group genome

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103201744A (en) * 2010-10-13 2013-07-10 考利达基因组股份有限公司 Methods for estimating genome-wide copy number variations
CN104361264A (en) * 2014-12-11 2015-02-18 天津工业大学 Quick counting method for quantity of nucleic acid fragments of genome
CN106202991A (en) * 2016-06-30 2016-12-07 厦门艾德生物医药科技股份有限公司 The detection method of abrupt information in a kind of genome multiplex amplification order-checking product

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103201744A (en) * 2010-10-13 2013-07-10 考利达基因组股份有限公司 Methods for estimating genome-wide copy number variations
CN104361264A (en) * 2014-12-11 2015-02-18 天津工业大学 Quick counting method for quantity of nucleic acid fragments of genome
CN106202991A (en) * 2016-06-30 2016-12-07 厦门艾德生物医药科技股份有限公司 The detection method of abrupt information in a kind of genome multiplex amplification order-checking product

Also Published As

Publication number Publication date
CN107480466A (en) 2017-12-15

Similar Documents

Publication Publication Date Title
CN107480466B (en) Genome data storage method and electronic equipment
US9235651B2 (en) Data retrieval apparatus, data storage method and data retrieval method
WO2020199336A1 (en) Genovariation recognition method and apparatus, and storage medium
US20130204851A1 (en) Method and apparatus for compressing and decompressing genetic information obtained by using next generation sequencing (ngs)
CN108881947A (en) A kind of infringement detection method and device of live stream
CN106201774B (en) NAND FLASH storage chip data storage structure analysis method
US10089411B2 (en) Method and apparatus and computer readable medium for computing string similarity metric
CN106778079A (en) A kind of DNA sequence dna k mer frequency statistics methods based on MapReduce
US10394763B2 (en) Method and device for generating pileup file from compressed genomic data
JP2022533492A (en) Flexible Seed Extension for Hashtable Genome Mapping
US8903696B2 (en) System and method for controlling granularity of transaction recording in discrete event simulation
US9213759B2 (en) System, apparatus, and method for executing a query including boolean and conditional expressions
CN114817645A (en) Time sequence data storage and reading method, device, equipment and storage medium
CN107967411B (en) Method and device for detecting off-target site and terminal equipment
CN110782946A (en) Method and device for identifying repeated sequence, storage medium and electronic equipment
US11567944B2 (en) Processing of sequencing data streams
US9858170B2 (en) Function-calling-information collection method and computer-readable recording medium
WO2011073680A1 (en) Improvements relating to hash tables
KR20160111327A (en) Information processing apparatus, and data management method
EP3663890B1 (en) Alignment method, device and system
CN113535962B (en) Data warehouse-in method, device, electronic device, program product and storage medium
KR102497634B1 (en) Method and apparatus for compressing fastq data through character frequency-based sequence reordering
US20170060998A1 (en) Method and apparatus for mining maximal repeated sequence
CN113468866A (en) Method and device for analyzing non-standard JSON string
CN105224697A (en) Sort method with filtercondition and the device for performing described method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 1002-1, 10th floor, No.56, Beisihuan West Road, Haidian District, Beijing 100080

Patentee after: Ronglian Technology Group Co., Ltd

Address before: 100080, Beijing, Haidian District, No. 56 West Fourth Ring Road, glorious Times Building, 10, 1002-1

Patentee before: UNITED ELECTRONICS Co.,Ltd.

CP03 Change of name, title or address