CN111145834B - Multithreading gene data compression method and device - Google Patents

Multithreading gene data compression method and device Download PDF

Info

Publication number
CN111145834B
CN111145834B CN201911200154.4A CN201911200154A CN111145834B CN 111145834 B CN111145834 B CN 111145834B CN 201911200154 A CN201911200154 A CN 201911200154A CN 111145834 B CN111145834 B CN 111145834B
Authority
CN
China
Prior art keywords
sequence
gene sequence
information
compressed
reference gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911200154.4A
Other languages
Chinese (zh)
Other versions
CN111145834A (en
Inventor
刘华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Shuguang Nanjing Computing Technology Co ltd
Original Assignee
Zhongke Shuguang Nanjing Computing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Shuguang Nanjing Computing Technology Co ltd filed Critical Zhongke Shuguang Nanjing Computing Technology Co ltd
Priority to CN201911200154.4A priority Critical patent/CN111145834B/en
Publication of CN111145834A publication Critical patent/CN111145834A/en
Application granted granted Critical
Publication of CN111145834B publication Critical patent/CN111145834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioethics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multithread gene data compression method and device, comprising the following steps: extracting a reference gene sequence from the gene sequence to be compressed; acquiring data information comprising base information of the reference gene sequence from the reference gene sequence; matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed; and carrying out multithread compression on each sequence in the gene sequence to be compressed based on the matching result. By adopting the scheme, the compression rate can be greatly improved.

Description

Multithreading gene data compression method and device
Technical Field
The invention relates to the field of gene data, in particular to a multithreading gene data compression method and device.
Background
The research of genes and DNA is widely applied to a plurality of important fields such as biology, medicine, genetic science and the like.
The data volume of the gene is huge, the traditional text data compression tools such as gzip and bzip2 have extremely low efficiency of compressing the data information of the gene sequence, and can only reduce the size of the original data to 1/4-1/3, and can not solve the challenges brought by a large amount of gene data.
Disclosure of Invention
The invention aims to: the invention aims to provide a multithread gene data compression method and device.
The technical scheme is as follows: the embodiment of the invention provides a multithreading gene data compression method, which comprises the following steps: extracting a reference gene sequence from the gene sequence to be compressed; acquiring data information comprising base information of the reference gene sequence from the reference gene sequence; matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed; and carrying out multithread compression on each sequence in the gene sequence to be compressed based on the matching result.
Specifically, extracting a reference gene sequence comprising base information and lower case character information from a gene sequence to be compressed; the gene sequence to be compressed is in the FASTA format.
Specifically, the data information in the reference gene sequence is read, and the identification information, the line width information, the data content information and the position information in the reference gene sequence are determined.
Specifically, reading the reference gene sequence to generate a reference gene data sequence S, and recording the first line of data as identification information identifier;
reading second row data of the reference gene data sequence, and counting the length of the character string to be used as row width information;
inquiring lower case character information in the reference gene data sequence, acquiring the starting position lowVecSegenin and the length information lowVecLength of each lower case character information, and converting the lower case character sequence into an upper case character sequence S 1
Extracting the uppercase character sequence S 1 A, C, G and T base information sequence S in (2) 2 The position speshapos and content speshach of the data information expressed as a plurality of codes, non A, C, G, T and N;
obtaining uppercase character sequence S 1 The starting position nVecSegin of N and the length information nVecLength;
lowVecBegin, speChaPos and nveccsteel were encoded using RLE algorithm to obtain new lowVecBegin, speChaPos and nveccsteel.
Specifically, k-mer sequences of length k are read from the reference gene sequence code, the individual bases are encoded with reference to a=1, c=2, g=3 and t=4, the individual k-mer sequences S j The Hash value H of (2) is calculated using the following formula (1):
wherein j=0, 1, 2, … …, k-1;
creating an array Ref loc And Ref bucket Calculating the Hash value of the ith k-mer sequence i So that Ref loc (i)=Ref bucket (value i ) Updating Ref bucket (value i ) =i, querying the position id=n corresponding to the same Hash value through the Hash value;
using Hash algorithm matching reference gene sequence S ref Base information and Gene sequence S to be compressed tar Base information, wherein the matching sequence matchResult is a set of matchEntry, and the matchEntry comprises a character string misMatchedStr which fails to match from the last matching position to the current matching position, a Length Pos from the last matching position to the current matching position and a Length of the current matching.
Specifically, a Hash value of a k-mer sequence in a gene sequence to be compressed is calculated i Id=ref in query sequence reference gene sequence bucket (value i );
Inquiring whether the ids and i positions of the reference gene sequence and the gene sequence to be compressed are consistent or not, and recording lengths length corresponding to different ids until the ids and i positions are different;
by Ref loc Traversing all ids matched with the Hash values, comparing the lengths of the ids, determining the maximum length maxLength as a matching result position, marking pos as a difference value between the current matching position and the last matching position, and marking length as a difference value between the maxLength and the minimum matching length k+1;
and acquiring the position of the same data information as the lowVec in the gene sequence to be compressed in the reference gene sequence, generating lowVecMATChed, and recording the lowVec which cannot be matched in the gene sequence to be compressed in the difflowVec in sequence, and recording the corresponding lowVecMATChed.
Specifically, calculating the Hash value of each matchEntry, and if a character in the misMatchedStr is different, calculating the corresponding Hash value H me Different; wherein H is me The following calculation formula (2) is adopted:
dividing the matchResult sequence into a plurality of k-mer sequences with the length of k', and calculating H in each k-mer sequence me The absolute value of the sum of (2) is calculated as the remainder H using the following equation (3) for the minimum prime number seqBucketLen having a length greater than the length of each sequence in the gene sequence to be compressed 2
Calculating the Hash value of the ith k-mer sequence i Assigning values such that Seq loc (i)=Seq bucket (value i ) Then update the Seq bucket (value i )=i。
Specifically, storing compression results obtained by compressing the genetic data in each thread at the same time into a temporary file;
the temporary file is written into the compressed file.
The embodiment of the invention also provides a multithreading gene data compression device, which comprises: the device comprises a reference unit, an extraction unit, a matching unit and a compression unit, wherein: the reference unit is used for extracting a reference gene sequence from the gene sequence to be compressed; the extraction unit is used for acquiring data information comprising base information of the reference gene sequence from the reference gene sequence; the matching unit is used for matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed; the compression unit is used for carrying out multithread compression on each sequence in the gene sequence to be compressed based on the matching result.
Specifically, the compression unit is configured to store a compression result obtained by compressing the genetic data in each thread at the same time into a temporary file; the temporary file is written into the compressed file.
The beneficial effects are that: compared with the prior art, the invention has the following remarkable advantages: by matching and storing the reference gene sequence and the gene sequence to be compressed, the similarity due to the human genome is about 99.9%. Therefore, the compression ratio can be greatly improved.
Drawings
Fig. 1 is a flow chart of a multi-threaded gene data compression method according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings.
Referring to fig. 1, a flow chart of a multi-threaded gene data compression method according to an embodiment of the invention includes specific steps, and the detailed description is given below with reference to the accompanying drawings.
Step S101, extracting a reference gene sequence from the gene sequence to be compressed.
In the embodiment of the invention, a reference gene sequence comprising base information and lowercase character information is extracted from a gene sequence to be compressed; the gene sequence to be compressed is in the FASTA format.
In a specific implementation, the reference sequence based compression algorithm requires two parts of data to be input: in order to realize lossless compression, the reference gene sequence and the gene sequence to be compressed need to store identifier information, line width information, basic base information (A, C, G, T), lower case information, N character information and other character information (R, Y, etc.) of data, and the above information is compressed. N characters and other characters are bases which are not detected in the gene sequencing process, the identifiers and the line width contain little information, and the information has no matching value, so that the information extraction amount of the reference sequence is less than that of the sequence to be compressed, and only the base information and the small-written information are contained.
Step S102, data information comprising base information of the reference gene sequence is obtained from the reference gene sequence.
In the embodiment of the invention, the data information in the reference gene sequence is read, and the identification information, the line width information, the data content information and the position information in the reference gene sequence are determined.
In specific implementation, the data information in the reference gene sequence is extracted to match with the data information in the gene sequence to be compressed, and the position information of a different part and a matching part between the data information and the data information is stored, so that the compression rate is greatly improved.
In the embodiment of the invention, the obtaining the data information including the base information of the reference gene sequence from the reference gene sequence comprises the following steps:
reading the reference gene sequence to generate a reference gene data sequence S, and recording the first line of data as identification information identifier;
reading second row data of the reference gene data sequence, and counting the length of the character string to be used as row width information;
inquiring lower case character information in the reference gene data sequence, acquiring the starting position lowVecSegenin and the length information lowVecLength of each lower case character information, and converting the lower case character sequence into an upper case character sequence S 1
Extracting the uppercase character sequence S 1 A, C, G and T base information sequence S in (2) 2 The position speshapos and content speshach of the data information expressed as a plurality of codes, non A, C, G, T and N;
obtaining uppercase character sequence S 1 The starting position nVecSegin of N and the length information nVecLength;
lowVecBegin, speChaPos and nveccsteel were encoded using RLE algorithm to obtain new lowVecBegin, speChaPos and nveccsteel.
Step S103, matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed.
In the embodiment of the invention, firstly, a Hash table is constructed, k-mer sequences with a specific length of k are read from a reference gene sequence code, and each base is encoded in a mode of A=1, C=2, G=3 and T=4, namely, each k-mer sequence is a digital sequence S with the length of k j Where j=0, 1, 2, … …, k-1. Calculating S using a calculation method as shown in formula (1) j Matching the reference gene sequence S using a Hash algorithm ref And the gene sequence S to be compressed tar Base information code of (a) is provided.
As can be seen from the above formula, when the step k value changes, H will also change, thereby causing the distribution space size to change; and the change in step size will also result in a change in the matching rate. Because of the limitation of the length of the Hash table and the high similarity characteristic of the gene information, the method aims to solve HThe problem of ash table collisions is addressed here by two array simulation zipper processes: creating two arrays of Ref loc And Ref bucket . First, calculating the Hash value of the ith k-mer sequence i Assigning a value such that Ref loc (i)=Ref bucket (value i ) And then update Ref bucket (value i ) =i. After the Hash table is constructed, any value is passed m Inquiring the position id=n corresponding to the same Hash value in the Hash table, wherein n is always the largest in the Hash chain table; to obtain the previous position id=n 'in the linked list, then n' =ref loc (n)。
Matching reference gene sequence S ref And the gene sequence S to be compressed tar The purpose of the base information code of (2) is to generate the sequence matchResult required for the second match, which is a set of matchEntry. The matchEntry consists of three parts: misMatchedStr, and Pos and Length. misMatchedStr represents a string that fails to match from the last matching location to the current matching location, represented by a number; pos represents the length from the last matching position to the current matching position; length represents the Length of the current match.
In the embodiment of the invention, after the construction of the reference sequence Hash table is completed, the Hash value V is calculated according to the same method for the k-mer sequence in the gene sequence to be compressed i Querying a corresponding reference sequence id=ref in a Hash table bucket (value i ). And comparing whether the corresponding code values are the same from the id and i positions of the reference sequence and the sequence to be compressed respectively, and recording the length corresponding to the id until the code values are different. By Ref loc Traversing ids matched with all Hash values on a linked list, and comparing length of each id to obtain maximum length maxLength. At this time, the id corresponding to maxLength is the final matching result position, and pos is denoted as maxLength, i.e. the length of the matching of this segment. To further compress the size of the mattentry, pos is noted as the difference (possibly negative) between the current and last matching positions, and length is noted as the difference between maxLength and the minimum matching length k+1.
In the embodiment of the invention, lowVec is matched with: the matching obtains the position of the lowVec in the reference sequence, which is the same as the gene sequence to be compressed, and generates an array lowVecMATChed; for lowVec that cannot be matched, then it is recorded in diffLowVec in order and the corresponding lowvecmaatched is set to 0.
In the embodiment of the invention, after the first matching is completed, the code of each gene sequence is converted into a matchResult set, and the matchResult sub-elements matchEntry are compressed to a certain extent, but a large number of identical and continuous sub-element sets still exist in the matchResult of each gene sequence. If a reference sequence can be constructed for a matchResult, then there is good reference to a matchResult that further compresses the gene sequence, and the more matchResult the greater the reference, thus constructing a dynamically increasing reference sequence matchresultVec. This sequence is first matched twice based on the existing matchResultVec when the first n sequences are compressed, and then itself is added to the matchResultVec set as a reference.
In the embodiment of the invention, the Hash value of each matchEntry is calculated respectively, and when one character in the misMatchedStr is different, the corresponding Hash value H is obtained me The differences are not the same. H me The calculation formula is shown as formula (2):
then according to the H of each subelement matchuntry in the matchResult sequence me To construct a Hash table. The adopted scheme for avoiding the Hash table conflict is the same as the method for constructing the first-time matching Hash table, and two groups of numbers are used for recording and backtracking the list id. The method for constructing the Hash mark by the matchResult sequence comprises the following steps: decomposing the matchResult sequence into k-mer sequences with a length of k', and calculating H in each k-mer sequence me Setting the length of the Hash table as the minimum prime number seqBucketLen greater than the length of the gene sequence, and obtaining the remainder as the Hash value H 2 The calculation formula is shown as formula (3).
From the above formula, H 2 And the matching rate varies with the value of the step k. First, calculating the Hash value of the ith k-mer sequence i Assigning values such that Seq loc (i)=Seq bucket (value i ) Then update the Seq bucket (value i ) =i. After the Hash table is constructed, any value is passed m Inquiring the position id=n corresponding to the same Hash value in the Hash table, wherein n is always the largest in the linked list; if it is desired to obtain the previous position id=n 'in the linked list, n' =seq loc (n)。
Step S104, performing multithread compression on each sequence in the gene sequence to be compressed based on the matching result.
In the embodiment of the invention, the compression result obtained by compressing the gene data in each thread at the same time is stored in a temporary file; the temporary file is written into the compressed file.
In specific implementation, after the extraction of the reference gene sequence information is completed, a thread is opened up for matching and compressing the gene sequence to be compressed. One piece of chromosome data can reach 250MB at maximum, and if all the sequence information of the genes to be compressed is extracted and then matched and compressed, the memory consumption is huge. Therefore, after the extraction of each piece of gene sequence data information to be compressed is finished, matching compression operation is immediately carried out to generate a compression result so as to save the memory. However, the compression result of each sequence also occupies a considerable amount of memory, which also needs to be released in time. If the compression result of each sequence is directly written into the compression result file, multiple threads exist to write data into the same file at the same time due to concurrency, which causes confusion of the final compressed data and can not finish the decompression operation. Therefore, in the embodiment of the invention, the results of each gene sequence to be compressed are stored in temporary files, and each temporary file contains a sequence of compression results. And finally, reading the temporary file data according to the sequence number order and writing the temporary file data into a compression result.
The embodiment of the invention also provides a multithreading gene data compression device, which comprises: the device comprises a reference unit, an extraction unit, a matching unit and a compression unit, wherein:
the reference unit is used for extracting a reference gene sequence from the gene sequence to be compressed;
the extraction unit is used for acquiring data information comprising base information of the reference gene sequence from the reference gene sequence;
the matching unit is used for matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed;
the compression unit is used for carrying out multithread compression on each sequence in the gene sequence to be compressed based on the matching result.
In the embodiment of the invention, the extracting unit is further used for extracting a reference gene sequence comprising base information and lower case character information from the gene sequence to be compressed; the gene sequence to be compressed is in the FASTA format.
In the embodiment of the invention, the extraction unit is further used for reading the data information in the reference gene sequence and determining the identification information, the line width information, the data content information and the position information in the reference gene sequence.
In the embodiment of the present invention, the extracting unit is further configured to read the reference gene sequence to generate a reference gene data sequence S, and record the first line of data as identification information identifier;
reading second row data of the reference gene data sequence, and counting the length of the character string to be used as row width information;
inquiring lower case character information in the reference gene data sequence, acquiring the starting position lowVecSegenin and the length information lowVecLength of each lower case character information, and converting the lower case character sequence into an upper case character sequence S 1
Extracting the uppercase character sequence S 1 A, C, G and T base information sequence S in (2) 2 The position speshapos and content speshach of the data information expressed as a plurality of codes, non A, C, G, T and N;
obtaining uppercase character sequence S 1 The starting position nVecSegin of N and the length information nVecLength;
lowVecBegin, speChaPos and nveccsteel were encoded using RLE algorithm to obtain new lowVecBegin, speChaPos and nveccsteel.
In the embodiment of the present invention, the matching unit is further configured to read k-mer sequences with a length k from the reference gene sequence code, and encode each base in such a manner that a=1, c=2, g=3, and t=4 are referred to, and each k-mer sequence S j The Hash value H of (2) is calculated using the following formula (1):
wherein j=0, 1, 2, … …, k-1;
creating an array Ref loc And Ref bucket Calculating the Hash value of the ith k-mer sequence i So that Ref loc (i)=Ref bucket (value i ) Updating Ref bucket (value i ) =i, querying the position id=n corresponding to the same Hash value through the Hash value;
matching reference gene sequences S using Hash algorithm ref Base information and Gene sequence S to be compressed tar Base information, wherein the matching sequence matchResult is a set of matchEntry, and the matchEntry comprises a character string misMatchedStr which fails to match from the last matching position to the current matching position, a Length Pos from the last matching position to the current matching position and a Length of the current matching.
In the embodiment of the invention, the matching unit is also used for calculating the Hash value of the k-mer sequence in the gene sequence to be compressed i Id=ref in query sequence reference gene sequence bucket (value i );
Inquiring whether the ids and i positions of the reference gene sequence and the gene sequence to be compressed are consistent or not, and recording lengths length corresponding to different ids until the ids and i positions are different;
by Ref loc Traversing allThe ids matched with the Hash values are compared, the lengths of the ids are determined, the maximum length maxLength is used as a matching result position, pos is marked as a difference value between the current matching position and the last matching position, and length is marked as a difference value between the maxLength and the minimum matching length k+1;
and acquiring the position of the same data information as the lowVec in the gene sequence to be compressed in the reference gene sequence, generating lowVecMATChed, and recording the lowVec which cannot be matched in the gene sequence to be compressed in the difflowVec in sequence, and recording the corresponding lowVecMATChed.
In the embodiment of the present invention, the matching unit is further configured to calculate a Hash value of each matchEntry, and if a character in the mismatchedstrl is different, the corresponding Hash value H me Different; wherein H is me The following calculation formula (2) is adopted:
dividing the matchResult sequence into a plurality of k-mer sequences with the length of k', and calculating H in each k-mer sequence me The absolute value of the sum of (2) is calculated as the remainder H using the following equation (3) for the minimum prime number seqBucketLen having a length greater than the length of each sequence in the gene sequence to be compressed 2
Calculating the Hash value of the ith k-mer sequence i Assigning values such that Seq loc (i)=Seq bucket (value i ) Then update the Seq bucket (value i )=i。
In the embodiment of the invention, the compression unit is further used for storing the compression result obtained by compressing the gene data in each thread at the same time into the temporary file; the temporary file is written into the compressed file.

Claims (8)

1. A method for multi-threaded genetic data compression comprising:
extracting a reference gene sequence from the gene sequence to be compressed;
acquiring data information comprising base information of the reference gene sequence from the reference gene sequence; querying the reference gene sequence for lower case character information and converting the lower case character sequence into an upper case character sequence S 1 The method comprises the steps of carrying out a first treatment on the surface of the Extracting the uppercase character sequence S 1 A, C, G and T base information sequence S in (2) 2 Representing the codes into a plurality of groups;
matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed; reading k-mer sequences of length k from the reference gene sequence code, encoding the individual bases with references a=1, c=2, g=3 and t=4, the individual k-mer sequences S j The Hash value H of (2) is calculated using the following formula (1):
(1)
wherein j=0, 1, 2, … …, k-1;
creating an array Ref loc And Ref bucket Calculating the Hash value of the ith k-mer sequence i So that Ref loc (i)=Ref bucket (value i ) Updating Ref bucket (value i ) =i, querying the position id=n corresponding to the same Hash value through the Hash value;
matching reference gene sequences S using Hash algorithm ref Base information and Gene sequence S to be compressed tar Base information, wherein the matching sequence matchResult is a set of matchEntry, and the matchEntry comprises a character string misMatchedStr which fails to match from the last matching position to the current matching position, a length pos from the last matching position to the current matching position and a length of the current matching;
calculating the Hash value of each matchEntry, and if a character in the misMatchedStr is different, calculating the corresponding Hash value H me Different; wherein H is me The following calculation formula (2) is adopted:
(2)
dividing the matchResult sequence into a plurality of k-mer sequences with the length of k', and calculating H in each k-mer sequence me Setting the absolute value of the sum of the two values to be the minimum prime number seqBucketLen greater than the length of the gene sequence, and calculating H by using the following formula (3) me Remainder H of the sum of the absolute value divided by seqBucketLen 2
(3)
Calculating the Hash value of the ith k-mer sequence i Assigning values such that Seq loc (i)=Seq bucket (value i ) Then update the Seq bucket (value i )=i;
And carrying out multithread compression on each sequence in the gene sequence to be compressed based on the matching result.
2. The method of claim 1, wherein extracting the reference gene sequence from the gene sequences to be compressed comprises:
extracting a reference gene sequence comprising base information and lower case character information from the gene sequence to be compressed; the gene sequence to be compressed is in the FASTA format.
3. The method of claim 2, wherein the obtaining data information including base information of the reference gene sequence from the reference gene sequence comprises:
and reading the data information in the reference gene sequence, and determining the identification information, the line width information, the data content information and the position information in the reference gene sequence.
4. The method of claim 3, wherein the obtaining data information including base information of the reference gene sequence from the reference gene sequence comprises:
reading the reference gene sequence to generate a reference gene data sequence S, and recording the first line of data as identification information identifier;
reading second row data of the reference gene data sequence, and counting the length of the character string to be used as row width information;
inquiring lower case character information in the reference gene data sequence, acquiring the starting position lowVecSegenin and the length information lowVecLength of each lower case character information, and converting the lower case character sequence into an upper case character sequence S 1
Extracting the uppercase character sequence S 1 A, C, G and T base information sequence S in (2) 2 The position speshapos and content speshach of the data information expressed as a plurality of codes, non A, C, G, T and N;
obtaining uppercase character sequence S 1 The starting position nVecSegin of N and the length information nVecLength;
lowVecBegin, speChaPos and nveccsteel were encoded using RLE algorithm to obtain new lowVecBegin, speChaPos and nveccsteel.
5. The method of claim 4, wherein matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed comprises:
calculating the Hash value of k-mer sequence in the gene sequence to be compressed i Id=ref in query sequence reference gene sequence bucket (value i );
Inquiring whether the ids and i positions of the reference gene sequence and the gene sequence to be compressed are consistent or not, and recording lengths length corresponding to different ids until the ids and i positions are different;
by Ref loc Traversing all the ids matched with the Hash values, comparing the lengths of the ids, determining the maximum length maxLength as a matching result position, marking pos as the difference between the current matching position and the last matching position, and marking length as maxLengthA difference between the length and the minimum matching length k+1;
and acquiring the position of the same data information as the lowVec in the gene sequence to be compressed in the reference gene sequence, generating lowVecMATChed, and recording the lowVec which cannot be matched in the gene sequence to be compressed in the difflowVec in sequence, and recording the corresponding lowVecMATChed.
6. The method for multi-threaded gene data compression according to claim 4, wherein the multi-threaded compression of each of the sequences in the gene sequence to be compressed based on the matching result comprises:
storing compression results obtained by compressing the gene data in each thread at the same time into a temporary file;
the temporary file is written into the compressed file.
7. A multi-threaded genetic data compression device comprising: the device comprises a reference unit, an extraction unit, a matching unit and a compression unit, wherein:
the reference unit is used for extracting a reference gene sequence from the gene sequence to be compressed;
the extraction unit is used for acquiring data information comprising base information of the reference gene sequence from the reference gene sequence; querying the reference gene sequence for lower case character information and converting the lower case character sequence into an upper case character sequence S 1 The method comprises the steps of carrying out a first treatment on the surface of the Extracting the uppercase character sequence S 1 A, C, G and T base information sequence S in (2) 2 Representing the codes into a plurality of groups;
the matching unit is used for matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed; reading k-mer sequences of length k from the reference gene sequence code, encoding the individual bases with references a=1, c=2, g=3 and t=4, the individual k-mer sequences S j The Hash value H of (2) is calculated using the following formula (1):
(1)
wherein j=0, 1, 2, … …, k-1;
creating an array Ref loc And Ref bucket Calculating the Hash value of the ith k-mer sequence i So that Ref loc (i)=Ref bucket (value i ) Updating Ref bucket (value i ) =i, querying the position id=n corresponding to the same Hash value through the Hash value;
matching reference gene sequences S using Hash algorithm ref Base information and Gene sequence S to be compressed tar Base information, wherein the matching sequence matchResult is a set of matchEntry, and the matchEntry comprises a character string misMatchedStr which fails to match from the last matching position to the current matching position, a length pos from the last matching position to the current matching position and a length of the current matching;
calculating the Hash value of each matchEntry, and if a character in the misMatchedStr is different, calculating the corresponding Hash value H me Different; wherein H is me The following calculation formula (2) is adopted:
(2)
dividing the matchResult sequence into a plurality of k-mer sequences with the length of k', and calculating H in each k-mer sequence me Setting the absolute value of the sum of the two values to be the minimum prime number seqBucketLen greater than the length of the gene sequence, and calculating H by using the following formula (3) me Remainder H of the sum of the absolute value divided by seqBucketLen 2
(3)
Calculating the Hash value of the ith k-mer sequence i Assigning values such that Seq loc (i)=Seq bucket (value i ) Then update the Seq bucket (value i )=i;
The compression unit is used for carrying out multithread compression on each sequence in the gene sequence to be compressed based on the matching result.
8. The multi-thread gene data compression device according to claim 7, wherein the compression unit is configured to store compression results obtained by compressing gene data in each thread at the same time in a temporary file; the temporary file is written into the compressed file.
CN201911200154.4A 2019-11-29 2019-11-29 Multithreading gene data compression method and device Active CN111145834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911200154.4A CN111145834B (en) 2019-11-29 2019-11-29 Multithreading gene data compression method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911200154.4A CN111145834B (en) 2019-11-29 2019-11-29 Multithreading gene data compression method and device

Publications (2)

Publication Number Publication Date
CN111145834A CN111145834A (en) 2020-05-12
CN111145834B true CN111145834B (en) 2023-10-27

Family

ID=70517347

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911200154.4A Active CN111145834B (en) 2019-11-29 2019-11-29 Multithreading gene data compression method and device

Country Status (1)

Country Link
CN (1) CN111145834B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106100641A (en) * 2016-06-12 2016-11-09 深圳大学 Multithreading quick storage lossless compression method and system thereof for FASTQ data
WO2018000174A1 (en) * 2016-06-28 2018-01-04 深圳大学 Rapid and parallelstorage-oriented dna sequence matching method and system thereof
CN108287985A (en) * 2018-01-24 2018-07-17 深圳大学 A kind of the DNA sequence dna compression method and system of GPU acceleration
CN108388808A (en) * 2018-03-05 2018-08-10 郑州轻工业学院 Image encryption method based on Xi Er encryption and dynamic DNA encoding
CN110299187A (en) * 2019-07-04 2019-10-01 南京邮电大学 A kind of parallelization gene data compression method based on Hadoop
CN110310709A (en) * 2019-07-04 2019-10-08 南京邮电大学 A kind of gene compression method based on reference sequences

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106100641A (en) * 2016-06-12 2016-11-09 深圳大学 Multithreading quick storage lossless compression method and system thereof for FASTQ data
WO2018000174A1 (en) * 2016-06-28 2018-01-04 深圳大学 Rapid and parallelstorage-oriented dna sequence matching method and system thereof
CN108287985A (en) * 2018-01-24 2018-07-17 深圳大学 A kind of the DNA sequence dna compression method and system of GPU acceleration
CN108388808A (en) * 2018-03-05 2018-08-10 郑州轻工业学院 Image encryption method based on Xi Er encryption and dynamic DNA encoding
CN110299187A (en) * 2019-07-04 2019-10-01 南京邮电大学 A kind of parallelization gene data compression method based on Hadoop
CN110310709A (en) * 2019-07-04 2019-10-08 南京邮电大学 A kind of gene compression method based on reference sequences

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Yao Haichang 等.HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data.《BioMed Research International》.2019,第2019卷第1-14页. *
邓清津.高通量DNA测序数据的并行快速压缩方法.《中国优秀硕士学位论文全文数据库 信息科技辑》.2019,(第7期),全文. *
郭旭.高度相似基因组序列数据集的压缩算法研究.《中国优秀硕士学位论文全文数据库 信息科技辑》.2019,(第2期),全文. *

Also Published As

Publication number Publication date
CN111145834A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
US20210050074A1 (en) Systems and methods for sequence encoding, storage, and compression
US20110295858A1 (en) Method and apparatus for searching nucleic acid sequence
JP2019537172A (en) Method and system for indexing bioinformatics data
Claude et al. Compressed q-gram indexing for highly repetitive biological sequences
CN109979540B (en) DNA information storage coding method
CN109979537B (en) Multi-sequence-oriented gene sequence data compression method
WO2015180203A1 (en) High-throughput dna sequencing quality score lossless compression system and compression method
CN103546160A (en) Multi-reference-sequence based gene sequence stage compression method
US8972200B2 (en) Compression of genomic data
Janin et al. Adaptive reference-free compression of sequence quality scores
US20200185058A1 (en) Gene sequencing data compression preprocessing, compression and decompression method, system, and computer-readable medium
CN107066837A (en) One kind has with reference to DNA sequence dna compression method and system
WO2019080670A1 (en) Gene sequencing data compression method and decompression method, system, and computer readable medium
Afify et al. Dna lossless differential compression algorithm based on similarity of genomic sequence database
CN111145834B (en) Multithreading gene data compression method and device
CN107633158B (en) Method and apparatus for compressing and decompressing gene sequences
CN110310709B (en) Reference sequence-based gene compression method
CN110120247A (en) A kind of distributed genetic big data storage platform
Selva et al. SRComp: short read sequence compression using burstsort and Elias omega coding
Kuboi et al. Faster str-ic-lcs computation via rle
CN109698703B (en) Gene sequencing data decompression method, system and computer readable medium
CN110111852A (en) A kind of magnanimity DNA sequencing data lossless Fast Compression platform
Yaghoobi A new approach in DNA sequence compression: Fast DNA sequence compression using parallel chaos game representation
Ferragina et al. Computational biology
CN114730616A (en) Information encoding and decoding method, apparatus, storage medium, and information storage and reading method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant