CN107480466B

CN107480466B - Genome data storage method and electronic equipment

Info

Publication number: CN107480466B
Application number: CN201710546293.7A
Authority: CN
Inventors: 蔡文君; 何光铸; 王东辉; 孔令雪
Original assignee: United Electronics Co ltd
Current assignee: Ronglian Technology Group Co., Ltd
Priority date: 2017-07-06
Filing date: 2017-07-06
Publication date: 2020-08-11
Anticipated expiration: 2037-07-06
Also published as: CN107480466A

Abstract

The invention discloses a genome data storage method, which comprises the following steps: in the process of genome comparison, obtaining gene sequence comparison information and creating gene sequence statistical information; storing the gene sequence comparison information in a magnetic disk, and storing corresponding indexes in an internal memory according to the comparison position of the gene sequence comparison information in a genome; the index is the storage position of the gene sequence comparison information in a magnetic disk; classifying the genome statistical information to obtain first statistical information and second statistical information; storing first statistical information in a memory, wherein the first statistical information is statistical information of which the access frequency is higher than a preset frequency in a variation detection process; and storing second statistical information in the magnetic disk, wherein the second statistical information is statistical information which cannot be stored in the memory and/or statistical information of which the access frequency is lower than the preset frequency in the variation detection process. The invention also discloses electronic equipment adopting the genome data storage method.

Description

Genome data storage method and electronic equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a genome data storage method and an electronic device.

Background

The genome variation detection and calculation process generally comprises the steps of comparison, sorting, de-duplication, re-comparison, variation detection, filtering and the like. The main steps are to use the BAM file (SAM is called sequence alignment map, sequence alignment chart) as the output file to write into the hard disk, and then read it from the hard disk to the memory in the next step, and then process the next step.

In the process of implementing the invention, the inventor finds that the prior art has the following problems:

in human genome data analysis, the original data is about 100GB generally, and hundreds of GB files need to be read and written in the middle main analysis step, so that the whole calculation process consumes a large amount of I/O resources and the program efficiency is low.

The inventors have found that the main causes of the problem are:

1. the intermediate file is too large to be directly placed into the memory.

64GB memory is a typical machine configuration for common bioinformatics analysis. Human whole genome analysis data, the intermediate result is generally about 100GB, and can not directly exist in the memory, and the mutation detection process needs to load a reference sequence and an index file into the memory, so that the space for storing the intermediate result is further reduced.

2. The format of the intermediate file cannot be directly used for calculation.

The common intermediate file format is SAM/BAM format, which is a line record format, that is, each line stores one record, and the record can not be directly used for calculation when being directly put into a memory. The data required by the variation detection is mainly the statistical information of the comparison condition of each site, and comprises the information of the distribution of the number of various bases of each site, the insertion deletion (InDel) sequence and frequency, the soft shearing (soft clipping) sequence in the comparison and the like.

Disclosure of Invention

In view of the above, the present invention provides a method for storing genome data and an electronic device, which can solve the problem of low efficiency caused by the need of frequently inputting and outputting a large number of binary files in the process of detecting genome variation.

The present invention provides a genome data storage method based on the above object, comprising:

in the comparison process, obtaining gene sequence comparison information and creating gene sequence statistical information;

storing the gene sequence comparison information in a magnetic disk, and storing corresponding indexes in an internal memory according to the comparison position of the gene sequence comparison information in a genome; the index is the storage position of the gene sequence comparison information in a magnetic disk;

classifying the genome statistical information to obtain first statistical information and second statistical information;

storing first statistical information in a memory, wherein the first statistical information is statistical information of which the access frequency is higher than a preset frequency in a variation detection process;

and storing second statistical information in the magnetic disk, wherein the second statistical information is statistical information which cannot be stored in the memory and/or statistical information of which the access frequency is lower than the preset frequency in the variation detection process.

Optionally, the first statistical information includes statistical information of base weight quality values, statistical information of positive and negative chains, statistical information of insertion deletion, and statistical information of soft shearing.

Optionally, for a site where no insertion deletion or soft splicing occurs and at most 2 base types occur, storing first statistical information of the site by using a first data structure;

the first data structure, comprising:

a first header for indicating a base type;

a first quality value storage unit for storing a base weight quality value;

a first positive strand number storage unit for indicating the number of positive strands;

a first minus-strand number storage unit for indicating the number of minus strands.

Optionally, for a site with an insertion deletion and 3-4 base types, storing first statistical information of the site by using a first data structure and a second data structure;

the second data structure, comprising:

statistical information of base weight quality values and statistical information of positive and negative chains of the 4 base types respectively; the storage structure of the statistical information of the base weight quality value and the statistical information of the positive and negative chains of each base type specifically comprises the following steps: a second mass value storage unit for indicating a base weight mass value, a second positive strand number storage unit for indicating the number of positive strands, and a second negative strand number storage unit for indicating the number of negative strands;

the first insertion statistical information specifically includes: a first insertion sequence storage unit for storing an insertion sequence, a first low-quality insertion number storage unit for storing a low-quality insertion number;

the first missing statistical information specifically includes: a first missing length storage section for indicating a missing length, a first high-quality missing number storage section for indicating a number of high-quality missing, a first low-quality missing number storage section for indicating a number of low-quality missing;

the first data structure, comprising:

a second head filled with 11;

the first insertion information storage unit for indicating whether or not there is an insertion, specifically includes: a first insertion information sub-storage unit for indicating whether or not there is an insertion, an insertion length sub-storage unit for indicating an insertion length, and a low quality insertion number sub-storage unit for indicating a low quality insertion number;

the first missing information storage unit for indicating whether there is a missing includes: a first missing information sub-storage unit for indicating whether or not there is a missing;

a pointer to a corresponding second data structure storage location.

Optionally, for a site with more than 1 insertion deletion and an insertion length greater than 12 bases, the first statistical information of the site is stored by using a first data structure and a third data structure, and for the first statistical information of such a site, a memory pool is created in the memory for storage;

the third data structure, comprising:

statistical information of base weight quality values and statistical information of positive and negative chains of the 4 base types respectively; the storage structure of the statistical information of the base weight quality value and the statistical information of the positive and negative chains of each base type specifically comprises the following steps: a third quality value storage unit for indicating a base weight quality value, a third positive strand number storage unit for indicating the number of positive strands, and a third negative strand number storage unit for indicating the number of negative strands;

the second insertion statistical information specifically includes: an insertion length storage section for indicating an insertion length, a second insertion sequence storage section for indicating an insertion sequence, a second low-quality insertion number storage section for indicating a low-quality insertion number, and a high-quality insertion number storage section for indicating a high-quality insertion number;

the second missing statistical information specifically includes: a second missing length storage for indicating the missing length, a second high quality missing number storage for indicating the number of high quality misses, a second low quality missing number storage for indicating the number of low quality misses;

the first data structure, comprising:

a third head filled with 11;

the second insertion information storage unit for indicating whether or not there is an insertion, specifically includes: a second insertion information sub-storage unit for indicating whether or not there is an insertion, a first memory pool information sub-storage unit for indicating whether or not a memory pool is used, and a first occupied length sub-storage unit for indicating an occupied length in the memory pool;

the second missing information storage unit for indicating whether or not there is a missing specifically includes: a second missing information sub-storage unit for indicating whether there is a miss, a second memory pool information sub-storage unit for indicating whether the memory pool is used, and a second occupied length sub-storage unit for indicating the occupied length in the memory pool.

Optionally, for the soft-clip statistical information, a dynamic array is used to record, and each record includes:

a soft-clip position storage unit for storing a position of soft clip on the genome;

a soft-clipping left-side number storage section for indicating the number of times soft clipping occurs to the left of the corresponding site;

a soft-clip right-side number storage for indicating the number of times soft-clip occurs to the right of the corresponding site.

Optionally, the index includes a double-ended comparison information index and a single-ended comparison information index;

for the double-end comparison information index, a double-end comparison array structure is adopted for storage, and the double-end comparison array structure comprises:

a first ID storage unit for storing an ID representing a gene sequence;

a first alignment position storage unit for storing a position at which a gene sequence is aligned on a genome;

an insert length storage unit for storing an insert length of a gene sequence;

a first comparison quality value storage unit for storing a comparison quality value indicating a gene sequence;

a first average quality value storage unit for storing an average quality value of a gene sequence;

for the single-ended comparison information index, storing by adopting a single-ended comparison array structure, wherein the single-ended comparison array structure comprises:

a second ID storage unit for storing an ID representing a gene sequence;

a second alignment position storage unit for storing a position of the gene sequence aligned on the genome;

a second alignment quality value storage unit for storing an alignment quality value representing a gene sequence;

a second average quality value storage unit for storing an average quality value of the gene sequence;

wherein, for each gene sequence for comparison, the corresponding indexes are arranged in sequence according to the comparison position of the gene sequence on the genome.

Optionally, storing the gene sequence alignment information in a disk specifically includes:

therefore, the gene sequence comparison information is divided into 512 files and stored in a disk, each file stores the gene sequence comparison information of a certain genome interval, and the storage data structure of each gene sequence comparison information comprises:

a sequence length storage unit for storing a sequence length of a gene sequence;

a sequence storage unit for storing a gene sequence;

a quality value storage unit for storing a quality value representing a gene sequence;

a start position storage unit for storing a start position of an alignment algorithm used for aligning gene sequences;

a positive/negative chain storage unit for storing positive/negative chain information of gene sequences during alignment;

a region length storage unit for storing the length of a genomic region selected when gene sequences are aligned;

a left position storage part for representing the left riveted position of the gene sequence during the comparison;

a right position storage unit for storing the right position of the gene sequence to be aligned.

Optionally, the method further includes:

subtracting interference caused by repeated sequences in the genome statistical information in a de-duplication process;

and/or the presence of a gas in the gas,

in the process of comparing the genes, extracting the gene sequences of the comparing regions of the genome, and adjusting the genome statistical information of the gene sequences of the comparing regions after comparing the gene sequences of the comparing regions again.

From the above, the genome data storage method and the electronic device provided by the invention have the advantages that an exquisite data storage structure is designed according to the characteristics of the intermediate file in the whole process of mutation detection, some main intermediate data are stored in the memory, and the data can be directly called from the memory, so that each step in the whole process of mutation detection does not need to be subjected to a large amount of I/O reading and writing of a disk, and the efficiency of the whole mutation detection analysis process is obviously improved.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of a method for storing genome data provided by the present invention;

FIG. 2a is a schematic representation of the first data structure when no indels and soft-cuts occur and at most 2 base types occur;

FIG. 2b is a schematic representation of the second data structure when an insertion deletion (InDel) occurs and 3-4 base types occur;

FIG. 2c is a schematic representation of the first data structure when an insertion deletion (InDel) occurs and 3-4 base types occur;

FIG. 2d is a schematic representation of the third data structure in the presence of more than 1 indel at a site greater than 12 bases in length;

FIG. 2e is a schematic diagram of the first data structure when more than 1 indels are present at sites with an insertion length greater than 12 bases;

FIG. 2f is a schematic diagram of a dynamic array when recorded using the dynamic array for the soft-clip statistics;

FIG. 2g is a schematic illustration of the index;

FIG. 2h is a diagram showing a stored data structure of the alignment information of each gene sequence;

FIG. 3 is a schematic flow chart of an embodiment of a method for aligning genomic sequences provided by the present invention;

FIG. 4 is a schematic structural diagram of an embodiment of a genomic data storage device provided by the present invention;

fig. 5 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it is understood that "first" and "second" are only used for convenience of expression and should not be construed as limitations to the embodiments of the present invention, and the descriptions thereof in the following embodiments are omitted.

In view of the above, a first aspect of the embodiments of the present invention provides an embodiment of a method for storing genome data, which can solve the problem of low efficiency caused by frequent input and output of a large number of binary files in a genome variation detection process. Fig. 1 is a schematic flow chart of an embodiment of a genomic data storage method provided by the present invention.

The genome data storage method comprises the following steps:

step 101: in the comparison process, obtaining gene sequence comparison information and creating gene sequence statistical information; the gene sequence comparison information is gene sequence comparison result information generated in the process of genome comparison, and the gene sequence statistical information can be extracted from the gene sequence comparison result information;

step 102: storing the gene sequence comparison information in a magnetic disk, and storing corresponding indexes in an internal memory according to the comparison position of the gene sequence comparison information in a genome; the index is the storage position of the gene sequence comparison information in a magnetic disk;

step 103: classifying the gene sequence statistical information to obtain first statistical information and second statistical information;

step 104: storing the first statistical information in a memory, wherein the first statistical information is statistical information of which the access frequency is higher than a preset frequency in a variation detection process;

step 105: and storing the second statistical information in a magnetic disk, wherein the second statistical information is statistical information which cannot be stored in an internal memory and/or statistical information of which the access frequency is lower than a preset frequency in the variation detection process.

It can be seen from the foregoing embodiments that, in the genomic data storage method provided in the embodiments of the present invention, a delicate data storage structure is designed for the characteristics of the intermediate files in the entire mutation detection process (including the steps of comparison, sorting, deduplication, re-comparison, mutation detection, filtering, and the like), some of the main intermediate data are stored in the memory, and these data can be directly called from the memory, so that each step in the entire mutation detection process does not require a large number of I/O reads and writes of the disk, and the efficiency of the entire mutation detection analysis process is significantly improved.

In some alternative embodiments, the first statistical information comprises base weighted quality value statistical information, sign and sign statistics, indel statistical information, and soft-clip statistical information; the method specifically comprises the following steps:

statistical information of the base Weighted quality value (Weighted Count):

since each base aligned to the reference gene sequence has a mass value between 0 and 40, the weights assigned are as shown in the following table:

Base Quality Scores	Parameter*	Weight
			0–10	[0–Weight0]	0
11–13	(Weight0–Weight1)	1
			14–17	(Weight1–Weight2)	2
18–20	(Weight2–Weight3)	3
			21–40	(Weight3–40)	4

adding the weights of all the bases aligned to the same position to obtain the weight sum of the quality value of the base type;

the positive and negative chain statistics (Strand Count): counting the number of gene sequences at the same position by forward and reverse alignment;

the InDel statistics and insertion sequence information (InDel Count): comparing the insertion deletion sequence at a certain position of the genome in the gene sequence and accumulating the occurrence times;

the Soft Clip statistics (Soft Clip Count): the number of soft clips (soft clips) occurring at a certain position of the genome in the gene sequences was aligned.

In some alternative embodiments, considering the simplest case, for a site where no insertion deletion or soft-splicing occurs and at most 2 base types occur, the first statistical information for that site is stored using a first data structure; optionally, the first data structure is an 8bytes data structure Counter (container), and the 8bytes data structure Counter is used to store information of a site, and the entire human genome includes about 3G sites, so that about 24GB of memory is required;

as shown in fig. 2a, the first data structure stores two-base statistics (base1information and base2 information) using the same 4bytes data structure, including:

a first header for indicating a base type; optionally, the first header (base) is represented by 2bits for base type and base A, C, G, T is represented by 00, 01, 10 and 11, respectively;

a first quality value storage unit for storing a base weight quality value; optionally, the first quality value storage unit (weighted count) may use 14bits to represent the weighted quality value sum, and the maximum value is 16383;

a first positive strand number storage unit for indicating the number of positive strands; optionally, the first positive strand number storage (+ vestrand count) uses 1byte (8bits) to represent the number of positive strands, and the maximum value is 255;

a first minus-strand number storage unit for storing a number of minus strands; optionally, the first minus-strand number storage unit (-vetrand count) uses 1byte (8bits) to represent the number of minus strands, and the maximum value is 255.

In some alternative embodiments, for a site where an insertion deletion (InDel) occurs and 3-4 base types occur, first statistical information for the site is stored using a first data structure and a second data structure; optionally, a 32-byte data structure overflow counter (overflow container) is used to store information of one site, statistical information (base a information, base C information, base G information, and base Tinformation) of the base ACGT is respectively expressed by 6bytes, and Insertion information (Insertion Info.) and deletion information (deletion Info.) are respectively expressed by 4 bytes; there are approximately 200M such sites in the 30X genome-wide data empirically;

the second data structure, as shown in fig. 2b, includes:

statistical information of base weight quality values and statistical information of positive and negative chains of the 4 base types respectively; the storage structure of the statistical information of the base weight quality value and the statistical information of the positive and negative chains of each base type specifically comprises the following steps: a second quality value storage unit (weighted count, optionally 2bytes for weighted quality value sum, maximum value of 65535) for indicating the number of positive strands, a second positive strand number storage unit (+ ve strand count, optionally 2bytes for positive strand number, maximum value of 65535) for indicating the number of positive strands, and a second negative strand number storage unit (-vestrand count, optionally 2bytes for negative strand number, maximum value of 65535) for indicating the number of negative strands;

the first insertion statistical information specifically includes: a first Insertion sequence storage unit (Insertion Pattern, optionally, 3bytes, here, 12 bases at the maximum) for storing an Insertion sequence, and a first low-quality Insertion number storage unit (LQ count, optionally, 1byte, maximum 255) for storing a low-quality Insertion number;

the first missing statistical information specifically includes: len, optionally using 1byte, max 255, a first high quality missing number storage (HQ count, optionally using 1byte, max 255) for indicating the number of high quality misses, a first low quality missing number storage (LQ count, optionally using 1byte, max 255) for indicating the number of low quality misses; optionally, 1byte unused space is also included;

when using the Overf lowCounter, the stored content of the corresponding first data structure may be changed, and the first data structure, as shown in fig. 2c, includes:

a second head filled with 11; the data originally used to store both base1information and base2information would both be filled with "11" indicating that an OverflowCounter was used;

a first Insertion Information storage unit (Insertion Information) for indicating whether or not there is an Insertion, optionally, using 14bits to store the Insertion Information, specifically including: a first insertion information sub-storage unit (1bit) for indicating whether or not an insertion exists, an insertion length sub-storage unit (ins.len. using 4 bits) for indicating an insertion length, and a low quality insertion count sub-storage unit (LQ count, using 8bits) for indicating a low quality insertion count; optionally, 1bit is set to 0;

a first Deletion Information storage unit (Deletion Information) for indicating whether or not there is a Deletion, and optionally, 14bits are used to store the Deletion Information, 1bit is used to indicate whether or not there is a Deletion (first Deletion Information sub-storage unit), 1bit is set to 0, and 12bits are Unused (Ununsed);

a pointer (array index pointing dynamic array of overflow counter) for pointing to the corresponding storage location of the second data structure, optionally using 4bytes to hold a pointer to the location of the overflow counter data.

In some alternative embodiments, for a site where more than 1 indels occur and the insertion length is greater than 12 bases, the first statistical information of the site is stored by using a first data structure and a third data structure, and for the first statistical information of such a site, a Memory Pool (specially opening up a block of Memory) is created in the Memory for storage, the statistical information of the bases ACGT (base a information, base C information, base G information, and base T information) are respectively represented by 6bytes, and a pointer of the indels information is recorded in the overlaflowcounter, as shown in fig. 2 d;

the third data structure, as shown in fig. 2d, includes:

statistical information of base weight quality values and statistical information of positive and negative chains of the 4 base types respectively; the storage structure of the statistical information of the base weight quality value and the statistical information of the positive and negative chains of each base type specifically comprises the following steps: a third quality value storage unit (weighted count, optionally 2bytes for weighted quality value sum, maximum value of 65535) for indicating the number of positive strands, (+ ve strand count, optionally 2bytes for positive strand number, maximum value of 65535) and a third negative strand count (-vestrand count, optionally 2bytes for negative strand number, maximum value of 65535) for indicating the number of negative strands;

the second Insertion statistic information (Insertion Ptr), optionally expressed in terms of 4bytes, specifically includes: an Insertion length storage unit (Insertion length, optionally, 1byte for an Insertion length), a second Insertion sequence storage unit (Insertion pattern, length variable, one base per 2bits), a second low-quality Insertion number storage unit (LQ count, optionally, 1byte for a low-quality Insertion number), and a high-quality Insertion number storage unit (HQcount, optionally, 1byte for a high-quality Insertion number) for indicating the number of low-quality insertions;

the second missing statistical information (Deletion Ptr), which is optionally represented by 4bytes, specifically includes: a second Deletion length storage unit (Deletion length, optionally 1byte for Deletion length), a second high quality number of deletions storage unit (HQ count, optionally 1byte for high quality Deletion number), and a second low quality number of deletions storage unit (LQ count, optionally 1byte for low quality Deletion number) for high quality Deletion number;

meanwhile, as shown in fig. 2e, the information recording change in the Counter includes:

a third head filled with 11; the data originally used to store both base1information and base2information would both be filled with "11" indicating that an OverflowCounter was used;

a second Insertion Information storage unit (Insertion Information) for indicating whether or not there is an Insertion, optionally indicated by 14bits, specifically including: a second insertion information sub-storage part (1bit) for indicating whether or not there is an insertion, a first memory pool information sub-storage part (1bit) for indicating whether or not a memory pool is used, a first occupation length sub-storage part (12bits) for indicating the occupation length in the memory pool;

a second Deletion Information storage unit (Deletion Information) for indicating whether or not there is a Deletion, optionally indicated by 14bits, specifically including: a second missing information sub-storage part (1bit) for indicating whether there is a missing, a second memory pool information sub-storage part (1bit) for indicating whether the memory pool is used, and a second occupied length sub-storage part (12bits) for indicating the occupied length in the memory pool.

Empirically, soft clipping occurs at only a few genomic locations, so it is not necessary to open up a separate piece of storage space for each site. Therefore, in some optional embodiments, for the soft-cut statistical information, a dynamic array is used for recording, as shown in fig. 2f, each record is in a format of { position, left counts, right counts }, and occupies 12bytes, which specifically includes:

a soft-clip position storage unit (position) for indicating the position of soft clips on the genome, and occupies 4 bytes;

a soft-cut left-side number store (leftcounts) for indicating the number of times soft-cuts occur to the left of the corresponding site, occupying 4 bytes;

the soft-cut right counts store (rightcounts) indicating the number of times soft cuts occur to the right of the corresponding site, takes 4 bytes.

In some alternative embodiments, as shown in fig. 2g, the index comprises a double End alignment (Pair End) information index and a Single End alignment (Single End) information index;

for the double-end comparison information index, a double-end comparison array structure is adopted for storage, and the double-end comparison array structure (PairEndAlignment Info, which occupies 12bytes) comprises the following steps:

a first ID storage unit (ReadID) for storing an ID indicating a gene sequence, and occupies 4 bytes;

a first alignment position storage (AlignedPosition) for indicating a position at which the gene sequence is aligned on the genome, occupying 4 bytes;

an Insert length storage (Insert Size) for indicating the Insert length of a gene sequence, occupying 2 bytes;

a first comparison quality value storage (MAPQ) for representing an alignment quality value of a gene sequence, occupying 1 byte;

a first Average quality value storage unit (Average base quality) for storing an Average quality value of a gene sequence, which occupies 1 byte;

for the single-ended comparison information index, storing by using a single-ended comparison array structure (SingleEndAlignmentInfo, occupying 10bytes), including:

a second ID storage unit (ReadID) for storing an ID indicating a gene sequence, and occupies 4 bytes;

a second alignment position storage (AlignedPosition) for indicating a position at which the gene sequence is aligned on the genome, occupying 4 bytes;

a second alignment quality value storage (MAPQ) for indicating an alignment quality value of the gene sequence, occupying 1 byte;

a second Average quality value storage unit (Average base quality) for storing an Average quality value of a gene sequence, which occupies 1 byte;

Because the sequence read during the mutation detection process is very random, in some alternative embodiments, the gene sequence alignment information is stored in a disk, which specifically includes:

therefore, the gene sequence comparison information is divided into 512 files (buckets) and stored in the disk, each file stores the gene sequence comparison information of a certain genome interval, as shown in fig. 2h, the storage data structure of each piece of gene sequence comparison information includes:

a sequence Length storage unit (Read Length) for storing a sequence Length of a gene sequence, which occupies 2 bytes;

a sequence storage unit (Packed Read) for expressing the gene sequence itself, the length of which is variable, and 2bits are used to express one base;

a quality value storage unit (Base qualites) for expressing a quality value of a gene sequence, the length of which is not constant;

a start position storage unit (DP StartPos.) for indicating the alignment algorithm start position of the gene sequence at the time of alignment, and occupies 4 bytes;

a positive and negative chain storage part (Strand) for representing the positive and negative chain information of the gene sequences during comparison, which occupies 1 bit;

a region length storage part (DPref. length) for expressing the length of the genome region selected during the comparison of the gene sequences, occupying 15 bits;

a Left position storage unit (Left Anchor) for indicating the Left-hand position of the gene sequence during alignment, and occupies 4 bytes;

the Right position storage part (Right anchor) for indicating the Right-hand position of the gene sequence at the time of alignment occupied 4 bytes.

In addition to the steps of creating statistics and indexes during the comparison process in the foregoing examples, in some alternative embodiments, the method further comprises:

subtracting interference caused by repeated sequences in the genome statistical information in a de-duplication (de-duplication) process;

and/or the presence of a gas in the gas,

extracting a gene sequence of a weight comparison region of a genome in a weight comparison (alignment) process, and adjusting the genome statistical information of the gene sequence of the weight comparison region after the gene sequence of the weight comparison region is re-compared;

in the mutation detection process, the statistical information is directly used to calculate the probability of various genotypes.

By the genome data storage method provided by the embodiment of the invention, a large amount of binary files do not need to be repeatedly output in the whole analysis process, and the data of a whole genome is analyzed through integral algorithm optimization and can be completed within 4 hours, while the general analysis process can be completed within dozens of hours; the I/O process in the mutation detection analysis process is greatly reduced, and the analysis efficiency of the program is greatly improved.

To facilitate understanding of the foregoing technical solutions, an example of a genome sequence alignment method is briefly introduced here to explain the genome alignment process in step 101 in the foregoing examples. FIG. 3 is a schematic flow chart of an embodiment of the method for aligning genomic sequences according to the present invention.

The genome sequence alignment method comprises the following steps:

step 201: and acquiring a reference genome sequence and a genome sequence file to be compared. The file acquisition mode here may be a conventional acquisition mode. Wherein, the format of the genome sequence files to be aligned can be FASTQ files.

The genome sequence alignment method is carried out by dividing sequence alignment into 3 grades; reading a part of sequence from the input genome sequence file to be compared each time, then sequentially executing 1-level, 2-level and 3-level comparison algorithms, wherein the sequence on the comparison is not existed in the previous level, and entering the comparison algorithm of the next level for continuous comparison; the method specifically comprises the following steps.

Step 202: reading partial genome sequences from the genome sequence files to be aligned.

Step 203: and (3) comparing the partial genome sequence with a reference genome sequence according to a bidirectional BWT comparison algorithm (level 1: bidirectional BWT comparison algorithm, bidirectional BWT: Bi-directional Burrows-Wheeler Transform). Wherein the bidirectional BWT alignment algorithm processes reads alignments that are tolerant of up to 4 base errors. Reads, read length, is the sequence obtained in high throughput sequencing, each read being a stretch of bases. In the process of biological information analysis, each read is aligned to a reference genome, so that the difference between a sequencing sequence and the reference genome can be obtained, and the variation can be found.

Optionally, the method for aligning genome sequences according to the bidirectional BWT alignment algorithm may specifically include the following steps:

segmenting reads by using the pigeon house principle, wherein each segment allows 0-2 base errors;

then using a bidirectional BWT comparison algorithm to perform search comparison, comprising:

establishing BWT of the reference genome sequence, a suffix array and BWT of the reverse order of the reference genome sequence;

backward search (backward) and forward search (forward) are used to search reads or each fragment of a read for its position on the reference genomic sequence in both the right-to-left and left-to-right directions, respectively.

The bi-directional BWT alignment is less efficient at handling multiple base mismatches. In the case of up to 4 base mismatches, reads are segmented according to the Pigeon house principle, with 0-2 base mismatches allowed per segment, thus the efficiency of processing alignments of up to 2 base mismatches with bi-directional BWT is greatly increased.

After establishing the BWT and corresponding index of the reference sequence and sa (suffix array), the common alignment software BWA uses backward search, i.e. searches reads or each fragment of reads for its position on the genome from right to left. In addition to establishing a conventional BWT index (denoted by B), the bidirectional BWT used in this patent also establishes a BWT index (denoted by B') for the reverse sequence of the reference sequence. With B, B', SA, the efficiency of sequence alignment is significantly improved by backward, forward searching for the location of reads or seeds on the genome in both directions.

Step 204: whether only one read of at least one pair of reads is aligned in the partial genomic sequence (i.e., only one read of at least one pair of reads is aligned in the partial genomic sequence); if yes, go to step 208; if not, go to step 205.

Step 205: and (3) according to a single-ended dynamic programming alignment algorithm (level 2), aligning each pair of reads on only one read in the partial genome sequence with the reference genome sequence again. After the aforementioned level 1 bidirectional BWT alignment algorithm, in a pair of reads (A, A '), one (A or A ') is aligned to the reference genomic sequence and the other (A ' or A) is not aligned to the reference genomic sequence, the alignment will continue using the level 2 alignment algorithm.

Optionally, the method for comparing genome sequences according to the single-ended dynamic programming alignment algorithm may specifically include the following steps:

determining that one read (A or A ') of a pair of reads (A, A') aligns to a particular position (pos position) on the reference genomic sequence; data reads obtained by double-end sequencing are paired, and if one read (A or A ') of one pair of reads (A, A ') is aligned to a pos position on a reference genome sequence, the theoretical alignment position of the other read (A ' or A) is in a certain region around the pos position, namely a candidate region (candidate region);

therefore, according to a preset position range threshold value, selecting a specific range around the specific position (pos position); the preset position range threshold value can be selected according to actual needs, for example, set by referring to an error tolerance range; specifically, in paired-end sequencing, where a pair of reads are aligned on the genome, then the distance between the two reads and the sum of the lengths of the two reads equals the length of the sequencing fragment (fragment), and the location of the candidate region is determined based on this principle. For example, if the sequencing fragment is 500bp and each read is 150bp, the theoretical distance between two reads after alignment to the genome is 200 bp. Because the length of the sequencing fragment is not equal, the theoretical distance is about 100bp to 200 bp;

comparing another (A' or A) of the pair of reads which is not compared with the specified range by using a dynamic programming algorithm; step 206: whether at least two reads in at least one pair of reads are not aligned in the partial genomic sequence (i.e., each read in at least one pair of reads is not aligned in the partial genomic sequence); if yes, go to step 108; if not, go to step 207.

Step 207: and (3) comparing each pair of reads which are not aligned with both reads in the partial genome sequence with the reference genome sequence again according to a paired-end dynamic programming alignment algorithm (grade 3). In a pair of reads (A, A '), through the aforementioned bidirectional BWT alignment algorithm of level 1 and the single-ended dynamic programming alignment algorithm of level 2, A and A ' in a certain pair of reads (A, A ') are not aligned with the reference genome sequence, and the alignment is continued by using the alignment algorithm of level 3.

Optionally, the method for comparing genome sequences according to a paired-end dynamic programming alignment algorithm may specifically include the following steps:

respectively constructing seeds (seeds, substrings of a read) for each of a pair of reads (A and A');

specifically, each read of a pair of reads (A, A') is divided into a plurality of segments, and seeds (seeds of a read) are constructed; when a pair of reads are aligned on a genome, the distance between the two reads is within a certain range, so the distance between the seeds of the two reads also should be within a certain range;

aligning each seed to a reference genomic sequence;

specifically, the regions aligned by the seeds in pairs (i.e. the distance between two seeds meets the requirement) are retrieved, and the candidate alignment regions of the pairs of seeds are determined. Then, comparing reads to the candidate area by using a dynamic programming algorithm.

If two (A and A') of the reads have corresponding seed alignment in a certain region of the reference genome sequence, the region is a candidate region of the final alignment position;

comparing two strips (A and A') of the reads respectively in the candidate area by using a dynamic programming algorithm; after the comparison is completed, step 208 is entered; step 208: whether all the genome sequence files to be compared are compared is finished; if not, returning to the step 102; if yes, go to step 109.

Step 209: and outputting a comparison result. Optionally, the BAM file is an output file of genome sequence alignment, and the BAM is a format for storing a genome sequence alignment result, and records the position of the genome sequence in the reference genome sequence and the detailed sequence alignment condition.

It can be seen from the above embodiments that, in the genome sequence comparison method provided by the present invention, by setting a multi-stage comparison algorithm, after the comparison of the previous-stage algorithm is completed, the next-stage comparison algorithm is used to continue comparing the parts that are not compared, so that the complexity of the algorithm matches the complexity of the data, and each stage of algorithm is optimized, thereby achieving the optimization of the overall algorithm speed. By adopting the genome sequence comparison method provided by the invention, the comparison time of a human whole genome sequence can be shortened to about 4 hours on the premise of ensuring the same resources and the comparison accuracy, the comparison time is obviously shortened compared with the comparison time in the prior art, and the data analysis efficiency is improved.

In view of the above, a second aspect of the embodiments of the present invention provides an embodiment of a genomic data storage device, which can solve the problem of low efficiency caused by frequent input and output of a large number of binary files in a genomic variation detection process. Fig. 4 is a schematic structural diagram of an embodiment of a genome data storage apparatus according to the present invention.

The genomic data storage device comprises:

a creating module 301, configured to obtain gene sequence comparison information and create gene sequence statistical information in a genome comparison process;

the comparison information storage module 302 is configured to store the gene sequence comparison information in a magnetic disk, and store a corresponding index in an internal memory according to a comparison position of the gene sequence comparison information in a genome; the index is the storage position of the gene sequence comparison information in a magnetic disk;

a statistical information classification module 303, configured to classify the genomic statistical information to obtain first statistical information and second statistical information;

a statistical information storage module 304, configured to store first statistical information in a memory, where the first statistical information is statistical information in which an access frequency is higher than a preset frequency in a variation detection process; and storing second statistical information in the magnetic disk, wherein the second statistical information is statistical information which cannot be stored in the memory and/or statistical information of which the access frequency is lower than the preset frequency in the variation detection process.

In some alternative embodiments, the first statistical information comprises base weighted quality value statistics, sign and sign statistics, indel statistics, and soft-clip statistics.

In some alternative embodiments, for a site where no indels or soft-cuts have occurred and no base type has occurred for at most 2, first statistical information for the site is stored using a first data structure;

the first data structure, comprising:

a first header for indicating a base type;

a first quality value storage unit for storing a base weight quality value;

In some alternative embodiments, for a site where an indel occurs and 3-4 base types have occurred, first statistical information for the site is stored using a first data structure and a second data structure;

the second data structure, comprising:

the first data structure, comprising:

a second head filled with 11;

a first missing information storage unit for indicating whether or not there is a missing;

a pointer to a corresponding second data structure storage location.

In some alternative embodiments, for a site where more than 1 indel occurs and the insertion length is greater than 12 bases, the first statistical information of the site is stored using the first data structure and the third data structure, and for the first statistical information of such site, a memory pool is created in the memory for storage;

the third data structure, comprising:

the first data structure, comprising:

a third head filled with 11;

In some optional embodiments, for the soft-clip statistics, a dynamic array is used for recording, each record including:

In some alternative embodiments, the index comprises a double-ended alignment information index and a single-ended alignment information index;

a first ID storage unit for storing an ID representing a gene sequence;

an insert length storage unit for storing an insert length of a gene sequence;

a second ID storage unit for storing an ID representing a gene sequence;

In some alternative embodiments, the storing the gene sequence alignment information in a disk specifically comprises:

a sequence storage unit for storing a gene sequence;

In view of the above object, according to a third aspect of the embodiments of the present invention, an embodiment of an apparatus for performing the method for storing genome data is provided. Fig. 5 is a schematic diagram of a hardware structure of an embodiment of the apparatus for performing the genomic data storage method according to the present invention.

As shown in fig. 5, the apparatus includes:

one or more processors 401 and a memory 402, one processor 401 being exemplified in fig. 5.

The apparatus for performing the genomic data storage method may further include: an input device 403 and an output device 404.

The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 5 illustrates an example of a connection by a bus.

The memory 402, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the genomic data storage method in the embodiment of the present application (for example, the creating module 301, the comparison information storage module 302, the statistical information classification module 303, and the statistical information storage module 304 shown in fig. 4). The processor 401 executes various functional applications of the server and data processing by running the nonvolatile software programs, instructions and modules stored in the memory 402, that is, implements the genome data storage method of the above-described method embodiment.

The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data region may store data created according to the use of the genomic data storage device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 402 may optionally include memory located remotely from processor 401, which may be connected to the member user behavior monitoring device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the genome data storage device. The output device 404 may include a display device such as a display screen.

The one or more modules are stored in the memory 402 and, when executed by the one or more processors 401, perform the method of genomic data storage in any of the method embodiments described above. The technical effect of the embodiment of the device for executing the genome data storage method is the same as or similar to that of any method embodiment.

The embodiment of the present application further provides a non-transitory computer storage medium, where the computer storage medium stores computer executable instructions, and the computer executable instructions may execute the processing method of the list item operation in any method embodiment. Embodiments of the non-transitory computer storage medium may be the same or similar in technical effect to any of the method embodiments described above.

Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes in the methods of the above embodiments may be implemented by a computer program that can be stored in a computer-readable storage medium and that, when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. The technical effect of the embodiment of the computer program is the same as or similar to that of any of the method embodiments described above.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method of genomic data storage, comprising:

storing second statistical information in a magnetic disk, wherein the second statistical information is statistical information which cannot be stored in an internal memory and/or statistical information of which the access frequency is lower than a preset frequency in a variation detection process;

the first statistical information comprises statistical information of base weight quality values, statistical information of positive and negative chains, statistical information of insertion deletion and statistical information of soft shearing;

for a site which has no insertion deletion and soft shearing and has at most 2 base types, storing first statistical information of the site by adopting a first data structure;

the first data structure, comprising:

a first header for indicating a base type;

a first quality value storage unit for storing a base weight quality value;

2. The method of claim 1, wherein for a site where an indel occurs and 3 to 4 base types occur, first statistical information of the site is stored using a first data structure and a second data structure;

the second data structure, comprising:

the first data structure, comprising:

a second head filled with 11;

a pointer to a corresponding second data structure storage location.

3. The method according to claim 1, wherein for a site where more than 1 indel occurs and the length of the insertion is greater than 12 bases, the first statistical information of the site is stored using a first data structure and a third data structure, and for the first statistical information of such a site, a memory pool is created in the memory for storage;

the third data structure, comprising:

the first data structure, comprising:

a third head filled with 11;

4. The method of claim 1, wherein for the soft-cut statistics, a dynamic array is used for the records, each record comprising:

5. The method of claim 1, wherein the index comprises a double-ended alignment information index and a single-ended alignment information index;

a first ID storage unit for storing an ID representing a gene sequence;

an insert length storage unit for storing an insert length of a gene sequence;

a second ID storage unit for storing an ID representing a gene sequence;

6. The method of claim 1, wherein storing the gene sequence alignment information in a disk comprises:

a sequence storage unit for storing a gene sequence;

7. The method of any one of claims 1-6, further comprising:

and/or the presence of a gas in the gas,

8. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the one processor to cause the at least one processor to perform the method of any one of claims 1-7.