WO2019205963A1 - 基因测序质量行数据压缩预处理、解压还原方法及*** - Google Patents
基因测序质量行数据压缩预处理、解压还原方法及*** Download PDFInfo
- Publication number
- WO2019205963A1 WO2019205963A1 PCT/CN2019/082466 CN2019082466W WO2019205963A1 WO 2019205963 A1 WO2019205963 A1 WO 2019205963A1 CN 2019082466 W CN2019082466 W CN 2019082466W WO 2019205963 A1 WO2019205963 A1 WO 2019205963A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- index
- entry
- quality
- column
- Prior art date
Links
Images
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3068—Precoding preceding compression, e.g. Burrows-Wheeler transformation
- H03M7/3077—Sorting
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/50—Compression of genetic data
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
Definitions
- the invention relates to compression pretreatment and decompression technology of gene sequencing quality data, and particularly relates to a gene sequencing quality data compression preprocessing, decompression and reduction method and system.
- Genetic testing is a technique for detecting DNA by blood, other body fluids, or cells.
- the DNA molecular information in the cells of the subject is detected by a specific device, and the gene type and gene defect and its expression function are analyzed.
- Genetic testing can diagnose diseases and can also be used to predict disease risk.
- With the continuous upgrading of gene sequencing technology the sequencing throughput is getting higher and higher, and the cost of sequencing is plummeting.
- High-throughput sequencing technology is widely used in scientific research and medical fields.
- the storage and transmission of massive gene sequencing data has become an important technical problem in genetic testing applications.
- a lossless compression algorithm with high compression ratio is an important technical way to solve this problem.
- the compression of mass row data in gene sequencing results is a difficult point in gene sequencing data compression.
- the compression processing strategy of quality line data in gene sequencing is: firstly through compression preprocessing, such as changing the order of data, and then using the classical compression algorithm to obtain good compression efficiency.
- compression preprocessing such as changing the order of data
- the most common method is to use the BWT algorithm for preprocessing, and then perform compression using arithmetic coding or the like.
- the purpose of compression preprocessing is to put the same or similar data together as much as possible, and then use the compression algorithm to improve the efficiency of compression.
- BWT Borrows-Wheelter Transform
- the BWT algorithm mainly includes the following key steps:
- the BWT method is an efficient compression preprocessing method. It adjusts the order of characters in the string to be compressed by cyclically shifting to the right, so that the same or similar characters are arranged together, thereby improving the efficiency of subsequent compression.
- the BWT algorithm has the following two drawbacks: (1) The overhead is large: since the BWT algorithm needs to save the location information I of the original string S in the matrix M, additional storage overhead is introduced in the preprocessing stage. Due to the existence of this overhead, it may result in pre-processed results that do not improve compression efficiency. (2) The preprocessing window is small: the BWT algorithm only adjusts the order of the characters in the string. The preprocessing window is only a fixed length string, and the preprocessed window is small, and no file or large data block is considered. The angle to adjust the order of the data.
- the BWT algorithm limits the data similarity in large data blocks due to the small preprocessing window.
- the overhead in its pre-processing also limits the further improvement in compression efficiency.
- the technical problem to be solved by the present invention is to provide a gene sequencing quality data compression preprocessing, decompression and reduction method and system for the above problems of the prior art, and the present invention does not introduce additional storage overhead, only a small computational overhead.
- the data rearrangement in the large data window is realized, thereby improving the compression efficiency.
- the invention is suitable for compressing and preprocessing the quality line data in the gene sequencing process, and the larger the data block, the more obvious the advantage.
- the technical solution adopted by the present invention is:
- the invention provides a data sequencing preprocessing method for gene sequencing quality, and the implementation steps include:
- the original data block Data in each of the quality rows are re-grouped according to the index column information, and delete the data of the index column portion, to obtain the grouped data after the grouped_Data;
- step 2) the detailed steps of step 2) include:
- the number of entries of the initialization packet information table IIT is 0, and the entry structure of the packet information table IIT includes a sequence number, an index column information Index, a variable num, a variable start, and a variable temp, wherein the variable num is information having a corresponding index column.
- the number of quality lines, the variable start indicates the starting position of the quality line having the index column information after the grouping is sorted, and the variable temp is the number of quality lines having the corresponding index column information processed in the grouping rearrangement process;
- step 2.6 sequentially scanning the current quality line Data[i] in the original data block Data, if the end of the original data block Data is reached, the jump proceeds to step 2.6); otherwise, the index column information Index of the current quality line Data[i] is taken out, Where Data[i] refers to the content of the current quality line i in the original data block Data; the line number i of the current quality line is incremented by one;
- step 2.7 sequentially scanning the entry of the packet information table IIT, and setting the start position of the corresponding packet for each index column information. If the end of the packet information table IIT is reached, this step ends, and the jump proceeds to step 3); otherwise, for the group information Table IIT currently scans the entry j. If the sequence number j of the entry is 0, the value of the variable start of the entry j is set to 0, the value of the variable temp is 0, and the current entry number is incremented by 1; the jump continues to execute.
- Step 2.7 otherwise, the value of the variable start of the table entry j is the sum of the variable start of the previous entry j-1 and its variable num, the value of the variable temp of the entry j is 0, and the current entry number j is 1 , the jump continues to step 2.7).
- step 3 the detailed steps of step 3) include:
- the invention also provides a gene sequencing quality data decompression and reduction method, and the implementation steps include:
- step S3) the detailed steps of step S3) include:
- the value of the number of entries k of the initialization packet information table IIT is 0, and the entry structure of the packet information table IIT includes a sequence number, an index column information Index, a variable num, a variable start, and a variable temp, wherein the variable num has The number of quality rows of the corresponding index column information, the variable start indicates the starting position of the quality row having the index column information after the grouping is sorted, and the variable temp is the number of quality rows having the corresponding index column information processed in the data restoration process. ;
- step S3.3 sequentially scan the data index_Data of the index column, if the end of the index index data of the index column is reached, the jump proceeds to step S3.6); otherwise, the current index column information corresponding to the current row in the index data of the index column is taken out Index_Data[ i];
- step S3.7 Sequentially scanning the packet information table IIT to set the starting position of the corresponding packet for the current index column information. If the end of the packet information table IIT has been reached, then jump to step S4); otherwise, for the entry j in the group information table IIT: if the sequence number j of the entry j is 0, the variable start and variable of the table entry j are set. Temp is 0, the sequence number j is incremented by 1, and the jump continues to execute step S3.7); otherwise, the variable start of the table entry j is set to the variable start of the previous entry j-1 and the previous entry j-1 The sum of the variables num, the variable temp of the entry j is 0, the sequence number j is incremented by 1, and the jump proceeds to step S3.7).
- step S4) the detailed steps of step S4) include:
- step S4.2 Obtain index column information of the packet rearranged data Grouped_Data[k]: if the end of the packet rearranged data Grouped_Data has been reached, the jump proceeds to step S5); otherwise, the packet information table IIT is scanned and found The entry j of the group information table IIT satisfies the fact that the value of the line number k is greater than or equal to the value of the variable start of the entry j, and the sum of the value of the variable start of the entry j and the value of the variable num is grouped.
- the index column information corresponding to the data of the current row in the rearranged data Grouped_Data Grouped_Data[k] is the index column information Index of the entry j;
- step S4.8 determining whether the line number k of the current line has exceeded the maximum number of rows of the grouped_Data after the packet rearrangement, and if the maximum number of rows of the grouped_Data after the packet rearrangement has not been exceeded, the process proceeds to step S4.2). Otherwise, the jump proceeds to step S5).
- the present invention also provides a gene sequencing quality line data compression system comprising a computer system programmed to perform the steps of the gene sequencing quality line data compression preprocessing method of the present invention.
- the present invention also provides a gene sequencing quality line data compression system comprising a computer system programmed to perform the steps of the gene sequencing quality line data decompression and reduction method of the present invention.
- the quality of the genes with similar sequencing results can be gathered together to improve the compression efficiency.
- the quality lines are similar, and often they have strong similarity in some columns, especially the detection results of the first few columns have an important correlation with the quality of the whole quality line. It can be used as an index column.
- the invention gathers the quality rows with the same index column together, so that the quality row data with similar genetic test quality are gathered together, so that the subsequent compression algorithm compresses better.
- the results of the method of the present invention after compression preprocessing include: Grouped_Data, Index_Data, and Index_No, where Index_Data is index column information extracted from the original data block, and Grouped_Data is other information from which the index column information is removed after reorganizing the quality row. data.
- Index_No is the column number information of the index column. Usually, the index column has only a few columns, and only a few bytes can be used to record the column number of the index column. Under normal circumstances, Index_No can directly select the default value without saving Index_No. Therefore, if the default index column number is directly used in the method of the present invention, there is no need to save Index_No, and no additional storage overhead is introduced. If other index column acquisition methods are used, only a few bytes of overhead are added to hold the column number of the index column, and the additional overhead is negligible relative to the GB of quality row data.
- the calculation cost is small. After optimization, the calculation overhead of the compression preprocessing of the method of the invention is small, and the data processing time of the 4 GB quality line is about 2 seconds, which can fully satisfy the requirement of real-time processing of the gene sequencing data.
- the gene sequencing quality data decompression and reduction method of the invention is the inverse method corresponding to the data compression pretreatment method of the gene sequencing quality line of the invention, and the others also have the advantages corresponding to the data sequencing preprocessing method of the gene sequencing quality line of the invention, so Let me repeat.
- the gene sequencing quality data compression system of the present invention is a step comprising a data compression pretreatment method programmed to perform the gene sequencing quality of the present invention or a data decompression reduction method of the gene sequencing quality of the present invention, and also has the gene sequencing quality data of the present invention.
- the advantages of the compression preprocessing method are not described here.
- FIG. 1 is a schematic diagram of a basic flow of a compression preprocessing method according to an embodiment of the present invention.
- FIG. 2 is a schematic diagram of a basic flow of a decompression and reduction method according to an embodiment of the present invention.
- the implementation steps of the data sequencing preprocessing method for the gene sequencing quality line of the present embodiment include:
- the original data block Data in each of the quality rows are re-grouped according to the index column information, and delete the data of the index column portion, to obtain the grouped data after the grouped_Data;
- step 1) determines that the function used by the column number Index_No of the index column is:
- step 2) the detailed steps of step 2) include:
- the number of entries of the initialization packet information table IIT is 0, and the entry structure of the packet information table IIT includes a sequence number, an index column information Index, a variable num, a variable start, and a variable temp, wherein the variable num is information having a corresponding index column.
- the number of quality lines, the variable start indicates the starting position of the quality line having the index column information after the grouping is sorted, and the variable temp is the number of quality lines having the corresponding index column information processed in the grouping rearrangement process;
- step 2.7 sequentially scanning the entry of the packet information table IIT, and setting the start position of the corresponding packet for each index column information. If the end of the packet information table IIT is reached, this step ends, and the jump proceeds to step 3); otherwise, for the group information Table IIT currently scans the entry j. If the sequence number of the entry j is 0, the value of the variable start that sets the entry j is 0, the value of the variable temp is 0, and j is incremented by 1, that is:
- the value of the variable start of the table entry j is set to the sum of the variable start of the previous entry j-1 and its variable num, and the value of the variable temp of the entry j is 0, and j is 1, that is:
- step 3 the detailed steps of step 3) include:
- the index of all the quality rows is taken out from the original data block Data in the order of the column number from small to large.
- Index_Data get_index_all (Data, Index_No);
- the column number of the index column Index_No, the index data of the index column of the original data block Data Index_Data and the grouped data after the grouping grouped_Data as a compression pre- Process the result output.
- the gene sequencing quality data compression preprocessing method proposes a grouping by index column (GIC), the basic idea is to take out a number of columns from the input quality line file or data block. As an indexed column, then all the quality row data is rearranged, and all index columns are grouped together in the same quality behavior and arranged by their relative position in the original data block. Since the same quality line data of the index columns tends to be more similar, this data recombination method can align the similar quality line data in the gene sequencing results, thereby improving the local similarity of the data. By performing BWT transformation and subsequent compression on the data preprocessed by the compression preprocessing method based on the index column grouping in this embodiment, the compression efficiency of the gene sequencing data can be further improved.
- GIC index column
- the present invention does not introduce additional storage overhead, and achieves data rearrangement in a large data window with only a small computational overhead, thereby improving compression efficiency.
- the data sequencing preprocessing method of the gene sequencing quality of the present embodiment is suitable for compression preprocessing of the quality line data in the gene sequencing result file FASTQ, and the larger the data block, the more obvious the advantage.
- the compression preprocessing part of the data sequencing preprocessing method of the present embodiment is the quality line data obtained by gene sequencing, and the mass line data volume is huge, usually generating hundreds of MB per minute, which is composed of many quality lines.
- the compression preprocessing method based on the index column grouping in this embodiment rearranges the quality rows according to the information of each quality row at the index column position, and obtains the converted quality row data. After the quality line data converted by the compression preprocessing method based on the index column grouping in this embodiment, subsequent compression processing is performed.
- the data compression preprocessing method of the gene sequencing quality data of the present embodiment can improve the local similarity of the data in the large data block range, thereby improving the compression efficiency of the gene sequencing data.
- the decompressing portion of the present invention needs to restore the original data block Data according to the data Index_Data of the index column, the data Grouped_Data after the group rearrangement, and the column number Index_No of the index column. Since the content of the index index data of the index column is the index column content in the original data block Data, the group information table is easily obtained according to the index Index data of the index column. Then, by using the group information table, the content of the grouped data in the grouped_Data can be restored to the original row position in the original data block Data, and then merged with the index index data of the index column, that is, the original data block is restored. Data. As shown in FIG. 2, the implementation steps of the data sequencing and mass reduction method of the gene sequencing quality of the present embodiment include:
- step S3) include:
- the value of the number of entries k of the initialization packet information table IIT is 0, and the entry structure of the packet information table IIT includes a sequence number, an index column information Index, a variable num, a variable start, and a variable temp, wherein the variable num has The number of quality rows of the corresponding index column information, the variable start indicates the starting position of the quality row having the index column information after the grouping is sorted, and the variable temp is the number of quality rows having the corresponding index column information processed in the data restoration process. ;
- step S3.3 sequentially scan the data index_Data of the index column, if the end of the index index data of the index column is reached, the jump proceeds to step S3.6); otherwise, the current index column information corresponding to the current row in the index data of the index column is taken out Index_Data[ i];
- variable start of the table entry j is the sum of the variable start of the previous entry j-1 and the variable num of the previous entry j-1, and the sequence number j is incremented by 1, and the variable temp of the entry j is 0, that is, :
- step S4) include:
- step S4.2 Obtain index column information of the packet rearranged data Grouped_Data[k]: if the end of the packet rearranged data Grouped_Data has been reached, the jump proceeds to step S5); otherwise, the packet information table IIT is scanned and found The entry j of the group information table IIT satisfies the sum of the value of the line number k being greater than or equal to the value of the variable start of the entry j and less than or equal to the value of the variable start of the entry j and the value of the variable num (IIT[ j].start ⁇ k ⁇ IIT[j].start+IIT[j].num), the data of the current row in the grouped_Data grouped_Data, the index column information corresponding to the grouped_Data[k] is the index of the entry j Column information Index(IIT[j].index);
- step S4.8 determining whether the line number k of the current line has exceeded the maximum number of rows of the grouped_Data after the packet rearrangement, and if the maximum number of rows of the grouped_Data after the packet rearrangement has not been exceeded, the process proceeds to step S4.2). Otherwise, the jump proceeds to step S5).
- the embodiment further provides a gene sequencing quality line data compression system, comprising a computer system, the computer device being programmed to perform the steps of the foregoing gene sequencing quality line data compression preprocessing method of the embodiment.
- the embodiment further provides a gene sequencing quality line data compression system, comprising a computer system, the computer device being programmed to perform the steps of the foregoing gene sequencing quality line data decompression and reduction method of the embodiment.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Genetics & Genomics (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Description
Claims (8)
- 一种基因测序质量行数据压缩预处理方法,其特征在于,实施步骤包括:1)读取质量行数据的原始数据块Data并确定其索引列的列号Index_No;2)根据原始数据块Data的索引列建立分组信息表IIT;3)根据分组信息表IIT,将原始数据块Data中各个质量行按照索引列信息重新分组排列、并删除索引列部分的数据,得到分组重排后的数据Grouped_Data;4)提取原始数据块Data的索引列的数据Index_Data,将索引列的列号Index_No、原始数据块Data的索引列的数据Index_Data以及分组重排后的数据Grouped_Data作为压缩预处理结果输出。
- 根据权利要求1所述的基因测序质量行数据压缩预处理方法,其特征在于,步骤2)的详细步骤包括:2.1)初始化分组信息表IIT的表项数量为0,且分组信息表IIT的表项结构包括序号、索引列信息Index、变量num、变量start和变量temp,其中变量num为具有相应索引列信息的质量行数目,变量start表示具有该索引列信息的质量行在分组排序后所处的起始位置,变量temp为分组重排过程中已处理的具有相应索引列信息的质量行数目;2.2)初始化原始数据块Data的当前质量行的行号i为0;2.3)顺序扫描原始数据块Data的当前质量行Data[i],如果达到原始数据块Data的末尾,则跳转执行步骤2.6);否则取出当前质量行Data[i]的索引列信息Index,其中Data[i]是指原始数据块Data中当前质量行i的内容;将当前质量行的行号i加1;2.4)查找分组信息表IIT中的所有表项,如果有分组信息表IIT的某个表项j的索引列信息、当前质量行Data[i]的索引列信息Index两者相等,则将表项j的变量num加1,跳转执行步骤2.3);否则,跳转执行步骤2.5);2.5)在分组信息表IIT中建立新的表项k,并设置表项k的索引列信息IIT[k].Index等于当前质量行Data[i]的索引列信息Index、表项k的变量num等于1,将序号k加1;跳转执行步骤2.3);2.6)初始化分组信息表IIT的当前表项j为0;2.7)顺序扫描分组信息表IIT的表项,为各索引列信息设置其对应分组的起始位置,如果已经到达分组信息表IIT的末尾,本步骤结束,跳转执行步骤3);否则针对分组信息表IIT当前扫描的表项j,如果表项的序号j为0,则设置表项j的变量start的值为0、变量temp的值为0,当前表项序号j加1;跳转继续执行步骤2.7);否则设置表项j的变量start的值为上一个表项j-1的变量start及其变量num之和,表项j的变量temp的值为0, 当前表项序号j加1,跳转继续执行步骤2.7)。
- 根据权利要求1所述的基因测序质量行数据压缩预处理方法,其特征在于,步骤3)的详细步骤包括:3.1)为分组重排后的数据Grouped_Data分配空间,其行数与原始数据块Data相同;3.2)初始化原始数据块Data的当前质量行的行号i的值为0;3.3)扫描原始数据块Data的当前质量行,当前质量行的数据为Data[i],其中i为当前质量行的行号,取出当前质量行Data[i]的索引列信息Index;3.4)在分组信息表IIT中查找索引信息与Index相同的表项j;3.5)在分组重排后的数据Grouped_Data中***删除了索引列信息的质量行数据,且***位置k的值为表项j的变量start和变量temp之和,并将表项j的变量temp值加1;3.6)将行号i加1,判断行号i是否超过原始数据块Data的总行数,如果尚未超过原始数据块Data的总行数则跳转执行步骤3.3);否则,跳转执行步骤4)。
- 一种基因测序质量行数据解压还原方法,其特征在于,实施步骤包括:S1)读取解压后得到的索引列的数据Index_Data、分组重排后的数据Grouped_Data以及索引列的列号Index_No,根据分组重排后的数据Grouped_Data和索引列的列号信息Index_No确定原始数据块Data的质量行数目和每行的字符数据,为存储原始数据块Data分配空间;S2)根据索引列的列号Index_No,将索引列的数据Index_Data的每一列数据分别赋值给原始数据块Data中列号属于Index_No所记录的相应列;S3)根据索引列的数据Index_Data建立分组信息表IIT;S4)根据分组信息表IIT,依次扫描分组重排后的数据Grouped_Data中的每一行数据,根据分组信息表IIT和索引列的数据Index_Data,确定该行在原始数据块中的位置,并将其写入原始数据块Data的相应质量行中;S5)输出原始数据块Data。
- 根据权利要求4所述的基因测序质量行数据解压还原方法,其特征在于,步骤S3)的详细步骤包括:S3.1)初始化分组信息表IIT的表项数目k的值为0,且分组信息表IIT的表项结构包括序号、索引列信息Index、变量num、变量start和变量temp,其中变量num为具有相应索引列信息的质量行数目,变量start表示具有该索引列信息的质量行在分组排序后所处的起始位置,变量temp为数据还原过程中已处理的具有相应索引列信息的质量行数目;S3.2)初始化索引列的数据Index_Data的当前行的行号i的值为0;S3.3)顺序扫描索引列的数据Index_Data,如果达到索引列的数据Index_Data的末尾,则跳转执行步骤S3.6);否则取出索引列的数据Index_Data中当前行对应的当前索引列信息Index_Data[i];S3.4)查找分组信息表IIT中的所有表项,如果存在有表项j的索引列信息Index与当前索引列信息Index_Data[i]相同,设置表项j的变量num加1,跳转执行步骤S3.3);否则,跳转执行步骤3.5);S3.5)为分组信息表IIT建立新的表项k,且表项k的索引列信息Index等于当前索引列信息Index_Data[i]、变量num等于1;将表项数目k加1,跳转执行步骤S3.3);S3.6)初始化分组信息表IIT的当前表项j为0;S3.7)顺序扫描分组信息表IIT,为当前索引列信息设置其对应分组的起始位置,如果已经到达分组信息表IIT的末尾,则跳转至步骤S4);否则对于分组信息表IIT中的表项j:如果表项j的序号j为0,则设置表项j的变量start和变量temp均为0,将序号j加1,跳转继续执行步骤S3.7);否则,设置表项j的变量start为上一个表项j-1的变量start及上一个表项j-1的变量num之和,表项j的变量temp为0,序号j加1,跳转继续执行步骤S3.7)。
- 根据权利要求4所述的基因测序质量行数据解压还原方法,其特征在于,步骤S4)的详细步骤包括:S4.1)初始化分组重排后的数据Grouped_Data的当前行的行号k的值为0;S4.2)获得分组重排后的数据Grouped_Data[k]的索引列信息:如果已经达到分组重排后的数据Grouped_Data的末尾,则跳转执行步骤S5);否则,扫描分组信息表IIT,找到分组信息表IIT的表项j使其满足:行号k的值大于等于表项j的变量start的值、且小于等于表项j的变量start的值及其变量num的值之和,则分组重排后的数据Grouped_Data中当前行的数据Grouped_Data[k]对应的索引列信息为表项j的索引列信息Index;S4.3)将分组重排后的数据Grouped_Data中当前行的数据Grouped_Data[k]、表项j的索引列信息Index两者合并生成完整的质量行Temp_Read;S4.4)获得完整的质量行Temp_Read在原始数据块Data中具有相同索引列信息的质量行中的出现次序r,次序r的值为当前行的行号k、表项j的变量start的值之间的差值;S4.5)顺序扫描索引列的数据Index_Data,找到第r个索引列信息为分组信息表IIT中表项j的索引列信息Index的项t,从而确定完整的质量行Temp_Read在原始数据块中的行号t;S4.6)将完整的质量行Temp_Read写入到原始数据块Data行号t中;S4.7)将分组重排后的数据Grouped_Data的当前行的行号k加1;S4.8)判断当前行的行号k是否已经超过分组重排后的数据Grouped_Data的最大行数,如果尚未超过分组重排后的数据Grouped_Data的最大行数,则跳转执行步骤S4.2);否则,跳转执行步骤S5)。
- 一种基因测序质量行数据压缩***,包括计算机***,其特征在于:所述计算机设备被编程以执行权利要求1~3中任意一项所述基因测序质量行数据压缩预处理方法的步骤。
- 一种基因测序质量行数据压缩***,包括计算机***,其特征在于:所述计算机设备被编程以执行权利要求4~6中任意一项所述基因测序质量行数据解压还原方法的步骤。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/969,197 US20200402618A1 (en) | 2018-04-27 | 2019-04-12 | Gene sequencing quality line data compression pre-processing and decompression and restoration methods, and system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810392727.7A CN110428868B (zh) | 2018-04-27 | 2018-04-27 | 基因测序质量行数据压缩预处理、解压还原方法及*** |
CN201810392727.7 | 2018-04-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019205963A1 true WO2019205963A1 (zh) | 2019-10-31 |
Family
ID=68294708
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/082466 WO2019205963A1 (zh) | 2018-04-27 | 2019-04-12 | 基因测序质量行数据压缩预处理、解压还原方法及*** |
Country Status (3)
Country | Link |
---|---|
US (1) | US20200402618A1 (zh) |
CN (1) | CN110428868B (zh) |
WO (1) | WO2019205963A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113157655A (zh) * | 2020-01-22 | 2021-07-23 | 阿里巴巴集团控股有限公司 | 一种数据压缩、解压方法、装置、电子设备和存储介质 |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11329665B1 (en) * | 2019-12-11 | 2022-05-10 | Xilinx, Inc. | BWT circuit arrangement and method |
CN113098526B (zh) * | 2021-04-08 | 2022-04-12 | 哈尔滨工业大学 | 一种dna自索引区间解压缩方法 |
CN113555061B (zh) * | 2021-07-23 | 2023-03-14 | 哈尔滨因极科技有限公司 | 一种无参考基因组的变异检测的数据工作流处理方法 |
CN115083530B (zh) * | 2022-08-22 | 2022-11-04 | 广州明领基因科技有限公司 | 基因测序数据压缩方法、装置、终端设备和存储介质 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104636349A (zh) * | 2013-11-07 | 2015-05-20 | 阿里巴巴集团控股有限公司 | 一种索引数据压缩以及索引数据搜索的方法和设备 |
CN105224828A (zh) * | 2015-10-09 | 2016-01-06 | 人和未来生物科技(长沙)有限公司 | 一种基因序列片段快速定位用键值索引数据压缩方法 |
CN105550535A (zh) * | 2015-12-03 | 2016-05-04 | 人和未来生物科技(长沙)有限公司 | 一种基因字符序列快速编码为二进制序列的编码方法 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012177792A2 (en) * | 2011-06-24 | 2012-12-27 | Sequenom, Inc. | Methods and processes for non-invasive assessment of a genetic variation |
KR101922129B1 (ko) * | 2011-12-05 | 2018-11-26 | 삼성전자주식회사 | 차세대 시퀀싱을 이용하여 획득된 유전 정보를 압축 및 압축해제하는 방법 및 장치 |
CN107633158B (zh) * | 2016-07-18 | 2020-12-01 | 三星(中国)半导体有限公司 | 对基因序列进行压缩和解压缩的方法和设备 |
CN106971090A (zh) * | 2017-03-10 | 2017-07-21 | 首度生物科技(苏州)有限公司 | 一种基因测序数据压缩和传输方法 |
-
2018
- 2018-04-27 CN CN201810392727.7A patent/CN110428868B/zh active Active
-
2019
- 2019-04-12 US US16/969,197 patent/US20200402618A1/en active Pending
- 2019-04-12 WO PCT/CN2019/082466 patent/WO2019205963A1/zh active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104636349A (zh) * | 2013-11-07 | 2015-05-20 | 阿里巴巴集团控股有限公司 | 一种索引数据压缩以及索引数据搜索的方法和设备 |
CN105224828A (zh) * | 2015-10-09 | 2016-01-06 | 人和未来生物科技(长沙)有限公司 | 一种基因序列片段快速定位用键值索引数据压缩方法 |
CN105550535A (zh) * | 2015-12-03 | 2016-05-04 | 人和未来生物科技(长沙)有限公司 | 一种基因字符序列快速编码为二进制序列的编码方法 |
Non-Patent Citations (2)
Title |
---|
XING, YUTING ET AL.: "GTZ: a fast compression and cloud transmission tool optimized for FASTQ files", BMC BIOINFORMATICS, 28 December 2017 (2017-12-28), pages 233 - 242, XP021252095, DOI: 10.1186/s12859-017-1973-5 * |
ZHOU, QINGHUA ET AL.: "Non-Invasive Prenatal Second-Generation Sequencing Data Analysis And Interpretation Platform Based On Cloud Computing Technology", COMPILATION OF PAPERS OF 15TH NATIONAL CONFERENCE OF MEDICAL GENETICS OF CHINESE MEDICAL ASSOCIATION, FIRST NATIONAL ACADEMIC CONFERENCE OF MEDICAL GENETICS BRANCH OF CHINESE MEDICAL DOCTOR ASSOCIATION, AND 2016 ZHEJIANG MEDICAL GENETICS ANNUAL CONFE, 30 November 2016 (2016-11-30), pages 11 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113157655A (zh) * | 2020-01-22 | 2021-07-23 | 阿里巴巴集团控股有限公司 | 一种数据压缩、解压方法、装置、电子设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN110428868A (zh) | 2019-11-08 |
CN110428868B (zh) | 2021-11-26 |
US20200402618A1 (en) | 2020-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019205963A1 (zh) | 基因测序质量行数据压缩预处理、解压还原方法及*** | |
Solomon et al. | Improved search of large transcriptomic sequencing databases using split sequence bloom trees | |
CN107742061B (zh) | 一种蛋白质相互作用预测方法、***和装置 | |
Russell et al. | A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences | |
CN110021369B (zh) | 基因测序数据压缩解压方法、***及计算机可读介质 | |
CN105760706A (zh) | 一种二代测序数据的压缩方法 | |
Apostolico et al. | Sequence similarity measures based on bounded hamming distance | |
CN115514376A (zh) | 基于改进符号聚合近似的高频时序数据压缩方法及装置 | |
CN111010189B (zh) | 一种对数据集的多路压缩方法、装置及存储介质 | |
CN110310709B (zh) | 一种基于参考序列的基因压缩方法 | |
Zhang et al. | A complexity-based method to compare RNA secondary structures and its application | |
CN104284189B (zh) | 一种改进的bwt数据压缩方法及其硬件实现*** | |
CN107103206A (zh) | 基于标准熵的局部敏感哈希的dna序列聚类 | |
JP2003188735A (ja) | データ圧縮装置及び方法並びにプログラム | |
CN107818325A (zh) | 基于集成字典学习的图像稀疏表示方法 | |
CN105224697A (zh) | 带过滤条件的排序方法和用于执行所述方法的装置 | |
Stošić et al. | Jackstrapping DEA scores for robust efficiency measurement | |
KR100537636B1 (ko) | 유사서열 추출을 통한 전사인자 결합부위 예측 장치 및 그방법 | |
Beal et al. | Compressing genome resequencing data via the maximal longest factor | |
TW202318434A (zh) | 用於處理基因定序資料的資料處理系統 | |
US11929150B2 (en) | Methods and apparatuses for performing character matching for short read alignment | |
JPH08221254A (ja) | マージソート方法及びマージソート装置 | |
He et al. | A Novel Compression Algorithm for High-Throughput DNA Sequence Based on Huffman Coding Method | |
CN115547414B (zh) | 潜在毒力因子的确定方法、装置、计算机设备及存储介质 | |
Pratas et al. | An experimental sorting method for improving metagenomic data encoding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19793052 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19793052 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17/02/2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19793052 Country of ref document: EP Kind code of ref document: A1 |