CN114489518B

CN114489518B - Sequencing data quality control method and system

Info

Publication number: CN114489518B
Application number: CN202210308643.7A
Authority: CN
Inventors: 刘卫国; 闫立峰; 殷泽坤; 赵展
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-09-09
Anticipated expiration: 2042-03-28
Also published as: CN114489518A

Abstract

The invention provides a sequencing data quality control method and a system, belongs to the technical field of gene sequencing data processing, and aims at the problems that sequencing data quality processing does not support multithreading and processing speed is low, the control method comprises the following steps: reading FASTQ data to a memory, cutting the FASTQ data into data blocks, and putting the processed data blocks into a memory pool; continuously detecting required data blocks from the memory pool and taking out the data blocks, formatting the taken out data into Read objects, and then processing each Read object according to a quality control flow; and merging the statistical information processed according to the quality control flow, and performing visual output. The invention makes the task of quality control be completed efficiently on common PC.

Description

Sequencing data quality control method and system

Technical Field

The invention belongs to the technical field of gene sequencing data processing, and particularly relates to a sequencing data quality control method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the advent of sequencing technology in the last 70 s, gene sequencing technology has subverted traditional biological technology, and people began to uncover mysterious veil of genes. Especially, the popularization of the second generation sequencing technology can obtain more sequencing data with lower cost, and related research fields all make breakthroughs, so that the detailed and comprehensive analysis of transcriptome and genome of one species becomes possible.

The quality of sequencing data directly influences the downstream data analysis, and the quality control of the sequencing data can be used for judging the quality of the data and carrying out filtering and modification to a certain degree, so that clean and reliable data are provided for downstream tasks; meanwhile, with the continuous development of sequencing technology, the throughput is greatly improved, the data volume is continuously increased, the efficiency of the quality control process is very important, and the efficiency can be greatly improved by parallelizing the process and simultaneously processing a plurality of threads.

In addition, sequencing data may have problems such as too low quality score, adapter contamination, too high repetition rate, too short sequence, too high content of N base, etc., and a modern quality control tool with complete functions, simplicity, easy use and high efficiency in both production and research is required.

At present, the commonly used quality control software comprises FASTQC, fastp, RabbitQC, trimmatic, SOAPnuke and the like, the performances of the RabbitQC, the FASTQC and the fastp are better, the RabbitQC under a single thread is about one time faster than other software, and the RabbitQC under a multi-thread is much faster than other software. The acceleration ratio of the RabbitQC, FASTQC and fastp software under 1-20 threads is not accelerated any more after 2-4 threads except the RabbitQC, and the RabbitQC can keep a better acceleration ratio all the time, which is also the reason that the performance of the RabbitQC, FASTQC and fastp software is far better than that of other software in multithreading.

However, further tests show that the RabbitQC single thread has the throughput of 0.3M reads/s, the rate is only less than 100Mb/s calculated according to 300 bytes of each read, the performance peak is far from IO, 15-20 threads are needed to reach the peak, and each researcher using the software does not have a server with 20 cores, namely the single thread is not fast enough. In addition, the complexity of an over representation (over presentation) module in RabbitQC software is high, and the single-thread analysis of a fastq file of 7G takes nearly 40 minutes; in the RabbitQC software, the read-write efficiency of fastq data in a compressed format is very low, only dozens of MB of data can be processed per second, multithreading is not supported, and compression and decompression become the performance bottleneck of the whole program if a compressed file is read and written.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a sequencing data quality control method, which can realize high-efficiency processing of sequencing data.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

in a first aspect, a method for sequencing data quality control is disclosed, comprising:

reading FASTQ data to a memory, cutting the FASTQ data into data blocks, and placing the processed data blocks into a memory pool;

continuously detecting required data blocks from the memory pool, taking out the data blocks, formatting the taken out data into Read objects, and then processing each Read object according to a quality control flow;

and merging the statistical information processed according to the quality control flow, and performing visual output.

As a further technical solution, FASTQ data is read into a memory and cut into data blocks of a required size, and then the end of the data blocks is processed to ensure that there is no FASTQ data distributed between two data blocks.

As a further technical scheme, when the required data blocks are continuously detected from the memory pool, if not, the method waits until the available data blocks are detected from the memory pool.

As a further technical scheme, each Read class object is processed according to a quality control flow, which specifically comprises the following steps: and performing information statistics, joint removal, repetition degree analysis, over-representation analysis and sequence clipping processing on each Read class object.

As a further technical solution, the method further comprises: outputting the processed FASTQ data to a file.

As a further technical scheme, when each Read class object is processed according to the quality control flow, a vectorization technology is adopted to optimize the single-thread performance.

As a further technical scheme, information statistics, joint removal and repeatability analysis are reconstructed, so that code logic is more suitable for vectorization;

the information statistics part concentrates original discrete memory access by modifying a storage mode and a circulation sequence, and then constructs vectors by using gather and scatter instructions to reduce the memory access time;

the repeatability analysis reduces branch decisions in the vectorization process by altering the termination conditions of the loop.

As a further technical scheme, the Read class object does not store 4 string type member variables, but stores a pointer of fastq data in the memory.

As a further technical scheme, parallel acceleration compression and decompression are used in the process of processing each Read class object according to a quality control flow.

In a second aspect, a sequencing data quality control system is disclosed, comprising:

a data processing module configured to: reading FASTQ data to a memory, cutting the FASTQ data into data blocks, and placing the processed data blocks into a memory pool;

a quality control processing module configured to: continuously detecting required data blocks from the memory pool, taking out the data blocks, formatting the taken out data into Read objects, and then processing each Read object according to a quality control flow;

a statistics merging module configured to: and merging the statistical information processed according to the quality control flow, and performing visual output.

The above one or more technical solutions have the following beneficial effects:

compared with the RabbitQC, the basic flow of quality control achieves the acceleration of 2.8 times of single-end data and 3.5 times of double-end data under a single thread of the RabbitQC, and the acceleration ratio close to linearity in the RabbitQC is ensured. The optimized RabbitQCPlus only uses 4-8 threads to reach the write performance peak value of the mainstream SSD, so that the task of quality control can be completely and efficiently completed on a common PC.

The performance comparison of the modules is excessively represented under 7.5G single-ended data before and after optimization, the performance improvement under each thread is 7-9 times, and the nearly linear speed-up ratio is ensured. Since the performance improvement of this part of the tested double ended data and single ended data is almost the same.

For the case of reading and writing xxx.fastq.gz format compressed files, the optimized RabbitQCPlus supports multithreading compression and decompression of gz format files. Because the RabbitQC card is compressed and decompressed, multithreading has no effect, the multithreading speed-up ratio of other two software is poor, the RabbitQCPlus can approach to speed-up by one order of magnitude in 4 threads, and 8 threads are not greatly improved because the RabbitQC card is also finally blocked on multithreading decompression. Since the performance improvement of this part of the tested double ended data and single ended data is almost the same.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of an embodiment of the present invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

Referring to the attached FIG. 1, this example discloses a sequencing data quality control method, which comprises the following steps:

the method comprises the following steps: the single producer reads FASTQ data from the hard disk into the memory, performs simple data formatting, namely cutting into data blocks with the size of 4MB, and performs simple processing at the end of the data blocks, so as to ensure that one FASTQ data is not distributed in two data blocks. And then the processed data block is placed in a memory pool for the consumer to use.

Step two: a plurality of consumers continuously detect whether the available data blocks exist in the memory pool, and if not, the consumers wait; if yes, taking out the data, and then performing a complex data formatting process, namely formatting FASTQ data in the memory into a special Read class object. And then performing quality control processes such as information statistics, joint removal, repeatability analysis, over-representation analysis, sequence cutting and the like on each Read class object.

Step three: finally, a plurality of consumers combine the statistical information together and output the statistical information to a visual quality control report, and simultaneously output processed FASTQ data to a file.

Note that step one and step two are carried on at the same time, turn on producer and consumer at the beginning of the procedure, they carry on putting into and taking out the operation to the memory pool continuously, it is a course of flow type to pay integrally. The invention optimizes the functions of both producers and consumers on the basis of the framework, and improves the operation efficiency.

(1) Single thread performance is optimized using scalable vectorization techniques:

vectorization refers to the use of a specific vector register, and only one instruction is used to operate on multiple data simultaneously, and as vectorization is continuously iterated and matured, some intel processors have already introduced the AVX-512 instruction set, and can operate on 16 float data simultaneously. The vectorization technology is an effective means for improving single-thread performance in a calculation-intensive task, an information statistical module, an overlap analysis module and a joint trimming module in a quality control flow all belong to the types of calculation-intensive and access comparison rules, and the information statistical module, the overlap analysis module and the joint trimming module are reconstructed to enable code logic to be more suitable for vectorization. For example, the information statistics part integrates the original discrete access memory by modifying the storage mode and the circulation order, and then the access memory time can be reduced by constructing a vector by using gather and scatter instructions; the overlap analysis module reduces branch judgment in the vectorization process by changing the termination condition of the loop, thereby facilitating vectorization and improving the calculation efficiency. Finally, the overlap analysis module is accelerated by 6 times, the joint removal module is accelerated by 4 times, the information statistics module is accelerated by 2 times, and the performance improvement makes it possible for the quality control process to reach the system writing performance peak value on a common PC.

Meanwhile, based on the fact that each machine does not support the AVX-512 instruction set, in order to improve the optimization universality, the RabbitQCPlus also realizes the optimization versions of other instruction sets, the automatic vectorization versions of compilers, the non-vectorization versions and the like, can be manually specified during compiling, and ensures that the highest performance of a CPU can be exerted on different platforms. Finally, the hot spot function optimized by the vectorization technology is used for realizing 2-6 times of acceleration, and the integral single-thread performance is accelerated by 2-3 times.

(2) And (3) optimizing the single thread by adopting a brand-new fastq data storage mode:

the fastq data is text sequencing data comprising a number of entries, one entry also called a read, each read having 4 rows, respectively sequence identifier, base sequence, separator, base mass fraction.

The traditional quality control software stores a single read by constructing a class object which comprises 4 string variables to represent 4 lines of information, and the data in the memory needs to be copied again when the class object is constructed (because the c + + language syntax specifies that the original data needs to be copied when the string is constructed). Compared with the traditional quality control software, the RabbitQCPlus software has the innovation that 4 strings are not stored in the class object, but the addresses of the original data corresponding to the memory are directly stored, namely 4 pointers are stored, so that one-time memory copy is reduced, the memory is saved, and the operation efficiency is improved.

Then, the analysis process of the fastq data is carried out, and the RabbitQC adopts a mode of randomly dividing and then finding out a piece of complete fastq data backwards, so that no problem exists in the common double-end data processing, but errors can be accumulated for the double-end data with a large difference in corresponding read length, and errors can occur when the size of the double-end data exceeds the size of a buffer area (1 Mb). The RabbitQCPlus completes the function, changes to find forward, modifies the double-end data termination condition, and can also process the condition of inconsistent numbers of read1 and read 2.

(3) The invention uses parallel acceleration of compression and decompression processes:

the decompression process is directed to the input data of the entire quality control flow, and since the sequencing data may reach hundreds of GB, the compression format, i.e., xxx. When the quality control software reads the compressed data, the data must be decompressed first, and the traditional quality control software generally adopts a simple single-thread interface gzread, so that the efficiency is very low, and the subsequent quality control process is seriously slowed down.

The same compression process aims at the output of the whole quality control flow, namely the fastq data after cleaning. Most of the traditional quality control software uses a single-threaded gzwrite interface, so that the efficiency is low, the data after cleaning cannot be compressed late, and the program cannot be finished late. At present, part of software adopts multithread compression packages such as bgzip, the software package can only realize block compression of data, the application range of the compression file in the format is not wide, and the compression file cannot be identified by a plurality of decompression software. In addition, samtools software (https:// github. com/samtools/samtools) for processing the BAM file only supports the compressed file in the BAM format, and the BAM format adopts a block compression mode, so that multi-core processing is very simple. The pugz library (https:// githu. com/Piezoid/pugz) integrated into the RabbitQCPlus software in the invention is used for multi-thread decompression of a common gz format compressed file, and has wider application range.

The performance of each step when the RabbitQC reads and writes compressed data is compared, for single-thread quality control, decompression can barely keep up with the processing speed, and compression is only half of the processing speed, so that the processed data cannot be output in time to compress and cause queue blockage; the problem of multi-thread quality control is more serious, the decompression speed is only 1/10 of the processing speed, data cannot be provided for the processing module at all in time, and the compression module is slower, so that the performance of the quality control processing module which is optimized in a key way cannot be exerted at all.

Compression and decompression are always one of the research hotspots, the natural format dependence of compressed data causes that parallel decompression is difficult to realize, and great progress has been made in this respect by special researchers, parallel decompression software pugz and parallel compression software pigz are provided, and RabbitQCPlus stands on the shoulder of a giant to directly integrate the two libraries into the software, so that the streaming processing function is realized. Compared with a mode of firstly decompressing xx.fastq.gz into xx.fastq and then reading the xx.fastq onto a hard disk, the parallel decompression module integrated in the RabbitQCPlus is used for saving the hard disk read and write of the whole file at one time, so that the processing efficiency is improved, and the hard disk space occupied by the intermediate temporary file is also avoided; the same is true for the compression flow.

And finally, compared with other instruction control software, the RabbitQCPlus achieves acceleration close to one order of magnitude when reading and writing compressed files for quality control.

(4) The over-representation process is accelerated using an efficient data structure:

the fastQC software only analyzed the first 1M reads to save memory in the over-representation module, fastp pointed out the problems that existed in it and instead analyzed the entire file to determine the over-represented sequences, in order to reduce run time fastp analyzes every 20 reads, which although could accomplish a uniform extraction across all reads, the reads of 1/20 alone are sometimes insufficient to find all over-represented sequences. If the default parameters of the fastp are changed to analyze all reads, the operation efficiency is greatly reduced, and the 7G fastq data single thread runs for nearly 40 minutes.

The processing process of quality control software such as RabbitQC and fastp on the over-representation module comprises the steps of firstly analyzing the previous 10000 reads to find out frequently-occurring substrings, inserting the substrings into a map, using the substrings as keys, using value as the number of times of occurrence of the substrings, and setting the value as 0 initially. All reads are then enumerated for their substrings, and their values are incremented if they exist in the map. The specific optimization strategy of RabbitQCPlus for the part is to replace map, firstly because the map query complexity in STL is high, and secondly because there are a large number of queries in the scene, but 99.9% of queries cannot be found, and map is not suitable for the situation. The RabbitQCPlus adopts a handwriting array simulation linked list mode to realize chain hash, and the hash value calculation complexity of adjacent substrings is reduced by one dimension by using the nthash idea. In addition, most queries cannot be queried, so that the condition is very suitable for filtering by adding a layer of bloom filter, and the filtered queries are then sent to a hash chain table for querying, and the efficiency can be further improved. Finally, the optimized Rabbit QCPlus part can accelerate nearly one order of magnitude for both single thread and multiple thread on the premise of ensuring the analysis result and the fastp are completely consistent.

Example two

The present embodiment aims to provide a sequencing data quality control system, which includes:

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. The sequencing data quality control method is characterized by comprising the following steps:

continuously detecting required data blocks from the memory pool, taking out the data blocks, formatting the taken out data into Read objects, and then processing each Read object according to a quality control flow; optimizing single-thread performance by adopting a vectorization technology when each Read object is processed according to a quality control flow; reconstructing information statistics, joint removal and repeatability analysis to make code logic more suitable for vectorization;

the information statistics part concentrates the original discrete memory access by modifying a storage mode and a circulation order, and then uses gather and scatter instructions to construct vectors to reduce the memory access time;

the repeatability analysis reduces branch decisions in the vectorization process by altering the termination conditions of the loop;

2. The method of sequencing data quality control as claimed in claim 1, wherein FASTQ data is read into memory and cut into data blocks of a desired size, and then processed at the end of the data blocks to ensure that there is not one FASTQ data distributed over two data blocks.

3. The sequencing data quality control method of claim 1, wherein while the desired data block is continuously detected from the memory pool, if not, waiting until a usable data block is detected from the memory pool.

4. The sequencing data quality control method of claim 1, wherein each Read class object is processed according to a quality control flow, specifically: and carrying out information statistics, joint removal, repeatability analysis, over-representation analysis and sequence clipping processing on each Read class object.

5. The sequencing data quality control method of claim 1, further comprising: outputting the processed FASTQ data to a file.

6. The sequencing data quality control method of claim 1, wherein the Read class object stores no more member variables of 4 string types but a pointer to fastq data in the memory.

7. The sequencing data quality control method of claim 1, wherein parallel acceleration compression and decompression is used during processing of each Read class object according to the quality control flow.

8. Sequencing data quality control system, characterized by, includes:

a quality control processing module configured to: continuously detecting required data blocks from the memory pool, taking out the data blocks, formatting the taken out data into Read objects, and then processing each Read object according to a quality control flow; optimizing single-thread performance by adopting a vectorization technology when each Read object is processed according to a quality control flow; reconstructing information statistics, joint removal and repeatability analysis to make code logic more suitable for vectorization;