CN111145834B

CN111145834B - Multithreading gene data compression method and device

Info

Publication number: CN111145834B
Application number: CN201911200154.4A
Authority: CN
Inventors: 刘华
Original assignee: Zhongke Shuguang Nanjing Computing Technology Co ltd
Current assignee: Zhongke Shuguang Nanjing Computing Technology Co ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2023-10-27
Anticipated expiration: 2039-11-29
Also published as: CN111145834A

Abstract

The invention discloses a multithread gene data compression method and device, comprising the following steps: extracting a reference gene sequence from the gene sequence to be compressed; acquiring data information comprising base information of the reference gene sequence from the reference gene sequence; matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed; and carrying out multithread compression on each sequence in the gene sequence to be compressed based on the matching result. By adopting the scheme, the compression rate can be greatly improved.

Description

Multithreading gene data compression method and device

Technical Field

The invention relates to the field of gene data, in particular to a multithreading gene data compression method and device.

Background

The research of genes and DNA is widely applied to a plurality of important fields such as biology, medicine, genetic science and the like.

The data volume of the gene is huge, the traditional text data compression tools such as gzip and bzip2 have extremely low efficiency of compressing the data information of the gene sequence, and can only reduce the size of the original data to 1/4-1/3, and can not solve the challenges brought by a large amount of gene data.

Disclosure of Invention

The invention aims to: the invention aims to provide a multithread gene data compression method and device.

The technical scheme is as follows: the embodiment of the invention provides a multithreading gene data compression method, which comprises the following steps: extracting a reference gene sequence from the gene sequence to be compressed; acquiring data information comprising base information of the reference gene sequence from the reference gene sequence; matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed; and carrying out multithread compression on each sequence in the gene sequence to be compressed based on the matching result.

Specifically, extracting a reference gene sequence comprising base information and lower case character information from a gene sequence to be compressed; the gene sequence to be compressed is in the FASTA format.

Specifically, the data information in the reference gene sequence is read, and the identification information, the line width information, the data content information and the position information in the reference gene sequence are determined.

Specifically, reading the reference gene sequence to generate a reference gene data sequence S, and recording the first line of data as identification information identifier;

reading second row data of the reference gene data sequence, and counting the length of the character string to be used as row width information;

inquiring lower case character information in the reference gene data sequence, acquiring the starting position lowVecSegenin and the length information lowVecLength of each lower case character information, and converting the lower case character sequence into an upper case character sequence S ₁ ；

Extracting the uppercase character sequence S ₁ A, C, G and T base information sequence S in (2) ₂ The position speshapos and content speshach of the data information expressed as a plurality of codes, non A, C, G, T and N;

obtaining uppercase character sequence S ₁ The starting position nVecSegin of N and the length information nVecLength;

lowVecBegin, speChaPos and nveccsteel were encoded using RLE algorithm to obtain new lowVecBegin, speChaPos and nveccsteel.

Specifically, k-mer sequences of length k are read from the reference gene sequence code, the individual bases are encoded with reference to a=1, c=2, g=3 and t=4, the individual k-mer sequences S _j The Hash value H of (2) is calculated using the following formula (1):

wherein j=0, 1, 2, … …, k-1;

creating an array Ref _loc And Ref _bucket Calculating the Hash value of the ith k-mer sequence _i So that Ref _loc (i)＝Ref _bucket (value _i ) Updating Ref _bucket (value _i ) =i, querying the position id=n corresponding to the same Hash value through the Hash value;

using Hash algorithm matching reference gene sequence S _ref Base information and Gene sequence S to be compressed _tar Base information, wherein the matching sequence matchResult is a set of matchEntry, and the matchEntry comprises a character string misMatchedStr which fails to match from the last matching position to the current matching position, a Length Pos from the last matching position to the current matching position and a Length of the current matching.

Specifically, a Hash value of a k-mer sequence in a gene sequence to be compressed is calculated _i Id=ref in query sequence reference gene sequence _bucket (value _i )；

Inquiring whether the ids and i positions of the reference gene sequence and the gene sequence to be compressed are consistent or not, and recording lengths length corresponding to different ids until the ids and i positions are different;

by Ref _loc Traversing all ids matched with the Hash values, comparing the lengths of the ids, determining the maximum length maxLength as a matching result position, marking pos as a difference value between the current matching position and the last matching position, and marking length as a difference value between the maxLength and the minimum matching length k+1;

and acquiring the position of the same data information as the lowVec in the gene sequence to be compressed in the reference gene sequence, generating lowVecMATChed, and recording the lowVec which cannot be matched in the gene sequence to be compressed in the difflowVec in sequence, and recording the corresponding lowVecMATChed.

Specifically, calculating the Hash value of each matchEntry, and if a character in the misMatchedStr is different, calculating the corresponding Hash value H _me Different; wherein H is _me The following calculation formula (2) is adopted:

dividing the matchResult sequence into a plurality of k-mer sequences with the length of k', and calculating H in each k-mer sequence _me The absolute value of the sum of (2) is calculated as the remainder H using the following equation (3) for the minimum prime number seqBucketLen having a length greater than the length of each sequence in the gene sequence to be compressed ₂ ：

Calculating the Hash value of the ith k-mer sequence _i Assigning values such that Seq _loc (i)＝Seq _bucket (value _i ) Then update the Seq _bucket (value _i )＝i。

Specifically, storing compression results obtained by compressing the genetic data in each thread at the same time into a temporary file;

the temporary file is written into the compressed file.

The embodiment of the invention also provides a multithreading gene data compression device, which comprises: the device comprises a reference unit, an extraction unit, a matching unit and a compression unit, wherein: the reference unit is used for extracting a reference gene sequence from the gene sequence to be compressed; the extraction unit is used for acquiring data information comprising base information of the reference gene sequence from the reference gene sequence; the matching unit is used for matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed; the compression unit is used for carrying out multithread compression on each sequence in the gene sequence to be compressed based on the matching result.

Specifically, the compression unit is configured to store a compression result obtained by compressing the genetic data in each thread at the same time into a temporary file; the temporary file is written into the compressed file.

The beneficial effects are that: compared with the prior art, the invention has the following remarkable advantages: by matching and storing the reference gene sequence and the gene sequence to be compressed, the similarity due to the human genome is about 99.9%. Therefore, the compression ratio can be greatly improved.

Drawings

Fig. 1 is a flow chart of a multi-threaded gene data compression method according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings.

Referring to fig. 1, a flow chart of a multi-threaded gene data compression method according to an embodiment of the invention includes specific steps, and the detailed description is given below with reference to the accompanying drawings.

Step S101, extracting a reference gene sequence from the gene sequence to be compressed.

In the embodiment of the invention, a reference gene sequence comprising base information and lowercase character information is extracted from a gene sequence to be compressed; the gene sequence to be compressed is in the FASTA format.

In a specific implementation, the reference sequence based compression algorithm requires two parts of data to be input: in order to realize lossless compression, the reference gene sequence and the gene sequence to be compressed need to store identifier information, line width information, basic base information (A, C, G, T), lower case information, N character information and other character information (R, Y, etc.) of data, and the above information is compressed. N characters and other characters are bases which are not detected in the gene sequencing process, the identifiers and the line width contain little information, and the information has no matching value, so that the information extraction amount of the reference sequence is less than that of the sequence to be compressed, and only the base information and the small-written information are contained.

Step S102, data information comprising base information of the reference gene sequence is obtained from the reference gene sequence.

In the embodiment of the invention, the data information in the reference gene sequence is read, and the identification information, the line width information, the data content information and the position information in the reference gene sequence are determined.

In specific implementation, the data information in the reference gene sequence is extracted to match with the data information in the gene sequence to be compressed, and the position information of a different part and a matching part between the data information and the data information is stored, so that the compression rate is greatly improved.

In the embodiment of the invention, the obtaining the data information including the base information of the reference gene sequence from the reference gene sequence comprises the following steps:

reading the reference gene sequence to generate a reference gene data sequence S, and recording the first line of data as identification information identifier;

Step S103, matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed.

In the embodiment of the invention, firstly, a Hash table is constructed, k-mer sequences with a specific length of k are read from a reference gene sequence code, and each base is encoded in a mode of A=1, C=2, G=3 and T=4, namely, each k-mer sequence is a digital sequence S with the length of k _j Where j=0, 1, 2, … …, k-1. Calculating S using a calculation method as shown in formula (1) _j Matching the reference gene sequence S using a Hash algorithm _ref And the gene sequence S to be compressed _tar Base information code of (a) is provided.

As can be seen from the above formula, when the step k value changes, H will also change, thereby causing the distribution space size to change; and the change in step size will also result in a change in the matching rate. Because of the limitation of the length of the Hash table and the high similarity characteristic of the gene information, the method aims to solve HThe problem of ash table collisions is addressed here by two array simulation zipper processes: creating two arrays of Ref _loc And Ref _bucket . First, calculating the Hash value of the ith k-mer sequence _i Assigning a value such that Ref _loc (i)＝Ref _bucket (value _i ) And then update Ref _bucket (value _i ) =i. After the Hash table is constructed, any value is passed _m Inquiring the position id=n corresponding to the same Hash value in the Hash table, wherein n is always the largest in the Hash chain table; to obtain the previous position id=n 'in the linked list, then n' =ref _loc (n)。

Matching reference gene sequence S _ref And the gene sequence S to be compressed _tar The purpose of the base information code of (2) is to generate the sequence matchResult required for the second match, which is a set of matchEntry. The matchEntry consists of three parts: misMatchedStr, and Pos and Length. misMatchedStr represents a string that fails to match from the last matching location to the current matching location, represented by a number; pos represents the length from the last matching position to the current matching position; length represents the Length of the current match.

In the embodiment of the invention, after the construction of the reference sequence Hash table is completed, the Hash value V is calculated according to the same method for the k-mer sequence in the gene sequence to be compressed _i Querying a corresponding reference sequence id=ref in a Hash table _bucket (value _i ). And comparing whether the corresponding code values are the same from the id and i positions of the reference sequence and the sequence to be compressed respectively, and recording the length corresponding to the id until the code values are different. By Ref _loc Traversing ids matched with all Hash values on a linked list, and comparing length of each id to obtain maximum length maxLength. At this time, the id corresponding to maxLength is the final matching result position, and pos is denoted as maxLength, i.e. the length of the matching of this segment. To further compress the size of the mattentry, pos is noted as the difference (possibly negative) between the current and last matching positions, and length is noted as the difference between maxLength and the minimum matching length k+1.

In the embodiment of the invention, lowVec is matched with: the matching obtains the position of the lowVec in the reference sequence, which is the same as the gene sequence to be compressed, and generates an array lowVecMATChed; for lowVec that cannot be matched, then it is recorded in diffLowVec in order and the corresponding lowvecmaatched is set to 0.

In the embodiment of the invention, after the first matching is completed, the code of each gene sequence is converted into a matchResult set, and the matchResult sub-elements matchEntry are compressed to a certain extent, but a large number of identical and continuous sub-element sets still exist in the matchResult of each gene sequence. If a reference sequence can be constructed for a matchResult, then there is good reference to a matchResult that further compresses the gene sequence, and the more matchResult the greater the reference, thus constructing a dynamically increasing reference sequence matchresultVec. This sequence is first matched twice based on the existing matchResultVec when the first n sequences are compressed, and then itself is added to the matchResultVec set as a reference.

In the embodiment of the invention, the Hash value of each matchEntry is calculated respectively, and when one character in the misMatchedStr is different, the corresponding Hash value H is obtained _me The differences are not the same. H _me The calculation formula is shown as formula (2):

then according to the H of each subelement matchuntry in the matchResult sequence _me To construct a Hash table. The adopted scheme for avoiding the Hash table conflict is the same as the method for constructing the first-time matching Hash table, and two groups of numbers are used for recording and backtracking the list id. The method for constructing the Hash mark by the matchResult sequence comprises the following steps: decomposing the matchResult sequence into k-mer sequences with a length of k', and calculating H in each k-mer sequence _me Setting the length of the Hash table as the minimum prime number seqBucketLen greater than the length of the gene sequence, and obtaining the remainder as the Hash value H ₂ The calculation formula is shown as formula (3).

From the above formula, H ₂ And the matching rate varies with the value of the step k. First, calculating the Hash value of the ith k-mer sequence _i Assigning values such that Seq _loc (i)＝Seq _bucket (value _i ) Then update the Seq _bucket (value _i ) =i. After the Hash table is constructed, any value is passed _m Inquiring the position id=n corresponding to the same Hash value in the Hash table, wherein n is always the largest in the linked list; if it is desired to obtain the previous position id=n 'in the linked list, n' =seq _loc (n)。

Step S104, performing multithread compression on each sequence in the gene sequence to be compressed based on the matching result.

In the embodiment of the invention, the compression result obtained by compressing the gene data in each thread at the same time is stored in a temporary file; the temporary file is written into the compressed file.

In specific implementation, after the extraction of the reference gene sequence information is completed, a thread is opened up for matching and compressing the gene sequence to be compressed. One piece of chromosome data can reach 250MB at maximum, and if all the sequence information of the genes to be compressed is extracted and then matched and compressed, the memory consumption is huge. Therefore, after the extraction of each piece of gene sequence data information to be compressed is finished, matching compression operation is immediately carried out to generate a compression result so as to save the memory. However, the compression result of each sequence also occupies a considerable amount of memory, which also needs to be released in time. If the compression result of each sequence is directly written into the compression result file, multiple threads exist to write data into the same file at the same time due to concurrency, which causes confusion of the final compressed data and can not finish the decompression operation. Therefore, in the embodiment of the invention, the results of each gene sequence to be compressed are stored in temporary files, and each temporary file contains a sequence of compression results. And finally, reading the temporary file data according to the sequence number order and writing the temporary file data into a compression result.

The embodiment of the invention also provides a multithreading gene data compression device, which comprises: the device comprises a reference unit, an extraction unit, a matching unit and a compression unit, wherein:

the reference unit is used for extracting a reference gene sequence from the gene sequence to be compressed;

the extraction unit is used for acquiring data information comprising base information of the reference gene sequence from the reference gene sequence;

the matching unit is used for matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed;

the compression unit is used for carrying out multithread compression on each sequence in the gene sequence to be compressed based on the matching result.

In the embodiment of the invention, the extracting unit is further used for extracting a reference gene sequence comprising base information and lower case character information from the gene sequence to be compressed; the gene sequence to be compressed is in the FASTA format.

In the embodiment of the invention, the extraction unit is further used for reading the data information in the reference gene sequence and determining the identification information, the line width information, the data content information and the position information in the reference gene sequence.

In the embodiment of the present invention, the extracting unit is further configured to read the reference gene sequence to generate a reference gene data sequence S, and record the first line of data as identification information identifier;

In the embodiment of the present invention, the matching unit is further configured to read k-mer sequences with a length k from the reference gene sequence code, and encode each base in such a manner that a=1, c=2, g=3, and t=4 are referred to, and each k-mer sequence S _j The Hash value H of (2) is calculated using the following formula (1):

wherein j=0, 1, 2, … …, k-1;

matching reference gene sequences S using Hash algorithm _ref Base information and Gene sequence S to be compressed _tar Base information, wherein the matching sequence matchResult is a set of matchEntry, and the matchEntry comprises a character string misMatchedStr which fails to match from the last matching position to the current matching position, a Length Pos from the last matching position to the current matching position and a Length of the current matching.

In the embodiment of the invention, the matching unit is also used for calculating the Hash value of the k-mer sequence in the gene sequence to be compressed _i Id=ref in query sequence reference gene sequence _bucket (value _i )；

by Ref _loc Traversing allThe ids matched with the Hash values are compared, the lengths of the ids are determined, the maximum length maxLength is used as a matching result position, pos is marked as a difference value between the current matching position and the last matching position, and length is marked as a difference value between the maxLength and the minimum matching length k+1;

In the embodiment of the present invention, the matching unit is further configured to calculate a Hash value of each matchEntry, and if a character in the mismatchedstrl is different, the corresponding Hash value H _me Different; wherein H is _me The following calculation formula (2) is adopted:

In the embodiment of the invention, the compression unit is further used for storing the compression result obtained by compressing the gene data in each thread at the same time into the temporary file; the temporary file is written into the compressed file.

Claims

1. A method for multi-threaded genetic data compression comprising:

extracting a reference gene sequence from the gene sequence to be compressed;

acquiring data information comprising base information of the reference gene sequence from the reference gene sequence; querying the reference gene sequence for lower case character information and converting the lower case character sequence into an upper case character sequence S ₁ The method comprises the steps of carrying out a first treatment on the surface of the Extracting the uppercase character sequence S ₁ A, C, G and T base information sequence S in (2) ₂ Representing the codes into a plurality of groups;

matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed; reading k-mer sequences of length k from the reference gene sequence code, encoding the individual bases with references a=1, c=2, g=3 and t=4, the individual k-mer sequences S _j The Hash value H of (2) is calculated using the following formula (1):

(1)

wherein j=0, 1, 2, … …, k-1;

matching reference gene sequences S using Hash algorithm _ref Base information and Gene sequence S to be compressed _tar Base information, wherein the matching sequence matchResult is a set of matchEntry, and the matchEntry comprises a character string misMatchedStr which fails to match from the last matching position to the current matching position, a length pos from the last matching position to the current matching position and a length of the current matching;

calculating the Hash value of each matchEntry, and if a character in the misMatchedStr is different, calculating the corresponding Hash value H _me Different; wherein H is _me The following calculation formula (2) is adopted:

(2)

dividing the matchResult sequence into a plurality of k-mer sequences with the length of k', and calculating H in each k-mer sequence _me Setting the absolute value of the sum of the two values to be the minimum prime number seqBucketLen greater than the length of the gene sequence, and calculating H by using the following formula (3) _me Remainder H of the sum of the absolute value divided by seqBucketLen ₂ ：

(3)

Calculating the Hash value of the ith k-mer sequence _i Assigning values such that Seq _loc (i)＝Seq _bucket (value _i ) Then update the Seq _bucket (value _i )＝i；

And carrying out multithread compression on each sequence in the gene sequence to be compressed based on the matching result.

2. The method of claim 1, wherein extracting the reference gene sequence from the gene sequences to be compressed comprises:

extracting a reference gene sequence comprising base information and lower case character information from the gene sequence to be compressed; the gene sequence to be compressed is in the FASTA format.

3. The method of claim 2, wherein the obtaining data information including base information of the reference gene sequence from the reference gene sequence comprises:

and reading the data information in the reference gene sequence, and determining the identification information, the line width information, the data content information and the position information in the reference gene sequence.

4. The method of claim 3, wherein the obtaining data information including base information of the reference gene sequence from the reference gene sequence comprises:

5. The method of claim 4, wherein matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed comprises:

calculating the Hash value of k-mer sequence in the gene sequence to be compressed _i Id=ref in query sequence reference gene sequence _bucket (value _i )；

by Ref _loc Traversing all the ids matched with the Hash values, comparing the lengths of the ids, determining the maximum length maxLength as a matching result position, marking pos as the difference between the current matching position and the last matching position, and marking length as maxLengthA difference between the length and the minimum matching length k+1;

6. The method for multi-threaded gene data compression according to claim 4, wherein the multi-threaded compression of each of the sequences in the gene sequence to be compressed based on the matching result comprises:

storing compression results obtained by compressing the gene data in each thread at the same time into a temporary file;

the temporary file is written into the compressed file.

7. A multi-threaded genetic data compression device comprising: the device comprises a reference unit, an extraction unit, a matching unit and a compression unit, wherein:

the extraction unit is used for acquiring data information comprising base information of the reference gene sequence from the reference gene sequence; querying the reference gene sequence for lower case character information and converting the lower case character sequence into an upper case character sequence S ₁ The method comprises the steps of carrying out a first treatment on the surface of the Extracting the uppercase character sequence S ₁ A, C, G and T base information sequence S in (2) ₂ Representing the codes into a plurality of groups;

the matching unit is used for matching the base information of the reference gene sequence with the base information of the gene sequence to be compressed; reading k-mer sequences of length k from the reference gene sequence code, encoding the individual bases with references a=1, c=2, g=3 and t=4, the individual k-mer sequences S _j The Hash value H of (2) is calculated using the following formula (1):

(1)

wherein j=0, 1, 2, … …, k-1;

(2)

(3)

8. The multi-thread gene data compression device according to claim 7, wherein the compression unit is configured to store compression results obtained by compressing gene data in each thread at the same time in a temporary file; the temporary file is written into the compressed file.