CN114489518B - Sequencing data quality control method and system - Google Patents

Sequencing data quality control method and system Download PDF

Info

Publication number
CN114489518B
CN114489518B CN202210308643.7A CN202210308643A CN114489518B CN 114489518 B CN114489518 B CN 114489518B CN 202210308643 A CN202210308643 A CN 202210308643A CN 114489518 B CN114489518 B CN 114489518B
Authority
CN
China
Prior art keywords
data
quality control
read
data blocks
fastq
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210308643.7A
Other languages
Chinese (zh)
Other versions
CN114489518A (en
Inventor
刘卫国
闫立峰
殷泽坤
赵展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202210308643.7A priority Critical patent/CN114489518B/en
Publication of CN114489518A publication Critical patent/CN114489518A/en
Application granted granted Critical
Publication of CN114489518B publication Critical patent/CN114489518B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a sequencing data quality control method and a system, belongs to the technical field of gene sequencing data processing, and aims at the problems that sequencing data quality processing does not support multithreading and processing speed is low, the control method comprises the following steps: reading FASTQ data to a memory, cutting the FASTQ data into data blocks, and putting the processed data blocks into a memory pool; continuously detecting required data blocks from the memory pool and taking out the data blocks, formatting the taken out data into Read objects, and then processing each Read object according to a quality control flow; and merging the statistical information processed according to the quality control flow, and performing visual output. The invention makes the task of quality control be completed efficiently on common PC.

Description

Sequencing data quality control method and system
Technical Field
The invention belongs to the technical field of gene sequencing data processing, and particularly relates to a sequencing data quality control method and system.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
With the advent of sequencing technology in the last 70 s, gene sequencing technology has subverted traditional biological technology, and people began to uncover mysterious veil of genes. Especially, the popularization of the second generation sequencing technology can obtain more sequencing data with lower cost, and related research fields all make breakthroughs, so that the detailed and comprehensive analysis of transcriptome and genome of one species becomes possible.
The quality of sequencing data directly influences the downstream data analysis, and the quality control of the sequencing data can be used for judging the quality of the data and carrying out filtering and modification to a certain degree, so that clean and reliable data are provided for downstream tasks; meanwhile, with the continuous development of sequencing technology, the throughput is greatly improved, the data volume is continuously increased, the efficiency of the quality control process is very important, and the efficiency can be greatly improved by parallelizing the process and simultaneously processing a plurality of threads.
In addition, sequencing data may have problems such as too low quality score, adapter contamination, too high repetition rate, too short sequence, too high content of N base, etc., and a modern quality control tool with complete functions, simplicity, easy use and high efficiency in both production and research is required.
At present, the commonly used quality control software comprises FASTQC, fastp, RabbitQC, trimmatic, SOAPnuke and the like, the performances of the RabbitQC, the FASTQC and the fastp are better, the RabbitQC under a single thread is about one time faster than other software, and the RabbitQC under a multi-thread is much faster than other software. The acceleration ratio of the RabbitQC, FASTQC and fastp software under 1-20 threads is not accelerated any more after 2-4 threads except the RabbitQC, and the RabbitQC can keep a better acceleration ratio all the time, which is also the reason that the performance of the RabbitQC, FASTQC and fastp software is far better than that of other software in multithreading.
However, further tests show that the RabbitQC single thread has the throughput of 0.3M reads/s, the rate is only less than 100Mb/s calculated according to 300 bytes of each read, the performance peak is far from IO, 15-20 threads are needed to reach the peak, and each researcher using the software does not have a server with 20 cores, namely the single thread is not fast enough. In addition, the complexity of an over representation (over presentation) module in RabbitQC software is high, and the single-thread analysis of a fastq file of 7G takes nearly 40 minutes; in the RabbitQC software, the read-write efficiency of fastq data in a compressed format is very low, only dozens of MB of data can be processed per second, multithreading is not supported, and compression and decompression become the performance bottleneck of the whole program if a compressed file is read and written.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a sequencing data quality control method, which can realize high-efficiency processing of sequencing data.
In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
in a first aspect, a method for sequencing data quality control is disclosed, comprising:
reading FASTQ data to a memory, cutting the FASTQ data into data blocks, and placing the processed data blocks into a memory pool;
continuously detecting required data blocks from the memory pool, taking out the data blocks, formatting the taken out data into Read objects, and then processing each Read object according to a quality control flow;
and merging the statistical information processed according to the quality control flow, and performing visual output.
As a further technical solution, FASTQ data is read into a memory and cut into data blocks of a required size, and then the end of the data blocks is processed to ensure that there is no FASTQ data distributed between two data blocks.
As a further technical scheme, when the required data blocks are continuously detected from the memory pool, if not, the method waits until the available data blocks are detected from the memory pool.
As a further technical scheme, each Read class object is processed according to a quality control flow, which specifically comprises the following steps: and performing information statistics, joint removal, repetition degree analysis, over-representation analysis and sequence clipping processing on each Read class object.
As a further technical solution, the method further comprises: outputting the processed FASTQ data to a file.
As a further technical scheme, when each Read class object is processed according to the quality control flow, a vectorization technology is adopted to optimize the single-thread performance.
As a further technical scheme, information statistics, joint removal and repeatability analysis are reconstructed, so that code logic is more suitable for vectorization;
the information statistics part concentrates original discrete memory access by modifying a storage mode and a circulation sequence, and then constructs vectors by using gather and scatter instructions to reduce the memory access time;
the repeatability analysis reduces branch decisions in the vectorization process by altering the termination conditions of the loop.
As a further technical scheme, the Read class object does not store 4 string type member variables, but stores a pointer of fastq data in the memory.
As a further technical scheme, parallel acceleration compression and decompression are used in the process of processing each Read class object according to a quality control flow.
In a second aspect, a sequencing data quality control system is disclosed, comprising:
a data processing module configured to: reading FASTQ data to a memory, cutting the FASTQ data into data blocks, and placing the processed data blocks into a memory pool;
a quality control processing module configured to: continuously detecting required data blocks from the memory pool, taking out the data blocks, formatting the taken out data into Read objects, and then processing each Read object according to a quality control flow;
a statistics merging module configured to: and merging the statistical information processed according to the quality control flow, and performing visual output.
The above one or more technical solutions have the following beneficial effects:
compared with the RabbitQC, the basic flow of quality control achieves the acceleration of 2.8 times of single-end data and 3.5 times of double-end data under a single thread of the RabbitQC, and the acceleration ratio close to linearity in the RabbitQC is ensured. The optimized RabbitQCPlus only uses 4-8 threads to reach the write performance peak value of the mainstream SSD, so that the task of quality control can be completely and efficiently completed on a common PC.
The performance comparison of the modules is excessively represented under 7.5G single-ended data before and after optimization, the performance improvement under each thread is 7-9 times, and the nearly linear speed-up ratio is ensured. Since the performance improvement of this part of the tested double ended data and single ended data is almost the same.
For the case of reading and writing xxx.fastq.gz format compressed files, the optimized RabbitQCPlus supports multithreading compression and decompression of gz format files. Because the RabbitQC card is compressed and decompressed, multithreading has no effect, the multithreading speed-up ratio of other two software is poor, the RabbitQCPlus can approach to speed-up by one order of magnitude in 4 threads, and 8 threads are not greatly improved because the RabbitQC card is also finally blocked on multithreading decompression. Since the performance improvement of this part of the tested double ended data and single ended data is almost the same.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of an embodiment of the present invention.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
Referring to the attached FIG. 1, this example discloses a sequencing data quality control method, which comprises the following steps:
the method comprises the following steps: the single producer reads FASTQ data from the hard disk into the memory, performs simple data formatting, namely cutting into data blocks with the size of 4MB, and performs simple processing at the end of the data blocks, so as to ensure that one FASTQ data is not distributed in two data blocks. And then the processed data block is placed in a memory pool for the consumer to use.
Step two: a plurality of consumers continuously detect whether the available data blocks exist in the memory pool, and if not, the consumers wait; if yes, taking out the data, and then performing a complex data formatting process, namely formatting FASTQ data in the memory into a special Read class object. And then performing quality control processes such as information statistics, joint removal, repeatability analysis, over-representation analysis, sequence cutting and the like on each Read class object.
Step three: finally, a plurality of consumers combine the statistical information together and output the statistical information to a visual quality control report, and simultaneously output processed FASTQ data to a file.
Note that step one and step two are carried on at the same time, turn on producer and consumer at the beginning of the procedure, they carry on putting into and taking out the operation to the memory pool continuously, it is a course of flow type to pay integrally. The invention optimizes the functions of both producers and consumers on the basis of the framework, and improves the operation efficiency.
(1) Single thread performance is optimized using scalable vectorization techniques:
vectorization refers to the use of a specific vector register, and only one instruction is used to operate on multiple data simultaneously, and as vectorization is continuously iterated and matured, some intel processors have already introduced the AVX-512 instruction set, and can operate on 16 float data simultaneously. The vectorization technology is an effective means for improving single-thread performance in a calculation-intensive task, an information statistical module, an overlap analysis module and a joint trimming module in a quality control flow all belong to the types of calculation-intensive and access comparison rules, and the information statistical module, the overlap analysis module and the joint trimming module are reconstructed to enable code logic to be more suitable for vectorization. For example, the information statistics part integrates the original discrete access memory by modifying the storage mode and the circulation order, and then the access memory time can be reduced by constructing a vector by using gather and scatter instructions; the overlap analysis module reduces branch judgment in the vectorization process by changing the termination condition of the loop, thereby facilitating vectorization and improving the calculation efficiency. Finally, the overlap analysis module is accelerated by 6 times, the joint removal module is accelerated by 4 times, the information statistics module is accelerated by 2 times, and the performance improvement makes it possible for the quality control process to reach the system writing performance peak value on a common PC.
Meanwhile, based on the fact that each machine does not support the AVX-512 instruction set, in order to improve the optimization universality, the RabbitQCPlus also realizes the optimization versions of other instruction sets, the automatic vectorization versions of compilers, the non-vectorization versions and the like, can be manually specified during compiling, and ensures that the highest performance of a CPU can be exerted on different platforms. Finally, the hot spot function optimized by the vectorization technology is used for realizing 2-6 times of acceleration, and the integral single-thread performance is accelerated by 2-3 times.
(2) And (3) optimizing the single thread by adopting a brand-new fastq data storage mode:
the fastq data is text sequencing data comprising a number of entries, one entry also called a read, each read having 4 rows, respectively sequence identifier, base sequence, separator, base mass fraction.
The traditional quality control software stores a single read by constructing a class object which comprises 4 string variables to represent 4 lines of information, and the data in the memory needs to be copied again when the class object is constructed (because the c + + language syntax specifies that the original data needs to be copied when the string is constructed). Compared with the traditional quality control software, the RabbitQCPlus software has the innovation that 4 strings are not stored in the class object, but the addresses of the original data corresponding to the memory are directly stored, namely 4 pointers are stored, so that one-time memory copy is reduced, the memory is saved, and the operation efficiency is improved.
Then, the analysis process of the fastq data is carried out, and the RabbitQC adopts a mode of randomly dividing and then finding out a piece of complete fastq data backwards, so that no problem exists in the common double-end data processing, but errors can be accumulated for the double-end data with a large difference in corresponding read length, and errors can occur when the size of the double-end data exceeds the size of a buffer area (1 Mb). The RabbitQCPlus completes the function, changes to find forward, modifies the double-end data termination condition, and can also process the condition of inconsistent numbers of read1 and read 2.
(3) The invention uses parallel acceleration of compression and decompression processes:
the decompression process is directed to the input data of the entire quality control flow, and since the sequencing data may reach hundreds of GB, the compression format, i.e., xxx. When the quality control software reads the compressed data, the data must be decompressed first, and the traditional quality control software generally adopts a simple single-thread interface gzread, so that the efficiency is very low, and the subsequent quality control process is seriously slowed down.
The same compression process aims at the output of the whole quality control flow, namely the fastq data after cleaning. Most of the traditional quality control software uses a single-threaded gzwrite interface, so that the efficiency is low, the data after cleaning cannot be compressed late, and the program cannot be finished late. At present, part of software adopts multithread compression packages such as bgzip, the software package can only realize block compression of data, the application range of the compression file in the format is not wide, and the compression file cannot be identified by a plurality of decompression software. In addition, samtools software (https:// github. com/samtools/samtools) for processing the BAM file only supports the compressed file in the BAM format, and the BAM format adopts a block compression mode, so that multi-core processing is very simple. The pugz library (https:// githu. com/Piezoid/pugz) integrated into the RabbitQCPlus software in the invention is used for multi-thread decompression of a common gz format compressed file, and has wider application range.
The performance of each step when the RabbitQC reads and writes compressed data is compared, for single-thread quality control, decompression can barely keep up with the processing speed, and compression is only half of the processing speed, so that the processed data cannot be output in time to compress and cause queue blockage; the problem of multi-thread quality control is more serious, the decompression speed is only 1/10 of the processing speed, data cannot be provided for the processing module at all in time, and the compression module is slower, so that the performance of the quality control processing module which is optimized in a key way cannot be exerted at all.
Compression and decompression are always one of the research hotspots, the natural format dependence of compressed data causes that parallel decompression is difficult to realize, and great progress has been made in this respect by special researchers, parallel decompression software pugz and parallel compression software pigz are provided, and RabbitQCPlus stands on the shoulder of a giant to directly integrate the two libraries into the software, so that the streaming processing function is realized. Compared with a mode of firstly decompressing xx.fastq.gz into xx.fastq and then reading the xx.fastq onto a hard disk, the parallel decompression module integrated in the RabbitQCPlus is used for saving the hard disk read and write of the whole file at one time, so that the processing efficiency is improved, and the hard disk space occupied by the intermediate temporary file is also avoided; the same is true for the compression flow.
And finally, compared with other instruction control software, the RabbitQCPlus achieves acceleration close to one order of magnitude when reading and writing compressed files for quality control.
(4) The over-representation process is accelerated using an efficient data structure:
the fastQC software only analyzed the first 1M reads to save memory in the over-representation module, fastp pointed out the problems that existed in it and instead analyzed the entire file to determine the over-represented sequences, in order to reduce run time fastp analyzes every 20 reads, which although could accomplish a uniform extraction across all reads, the reads of 1/20 alone are sometimes insufficient to find all over-represented sequences. If the default parameters of the fastp are changed to analyze all reads, the operation efficiency is greatly reduced, and the 7G fastq data single thread runs for nearly 40 minutes.
The processing process of quality control software such as RabbitQC and fastp on the over-representation module comprises the steps of firstly analyzing the previous 10000 reads to find out frequently-occurring substrings, inserting the substrings into a map, using the substrings as keys, using value as the number of times of occurrence of the substrings, and setting the value as 0 initially. All reads are then enumerated for their substrings, and their values are incremented if they exist in the map. The specific optimization strategy of RabbitQCPlus for the part is to replace map, firstly because the map query complexity in STL is high, and secondly because there are a large number of queries in the scene, but 99.9% of queries cannot be found, and map is not suitable for the situation. The RabbitQCPlus adopts a handwriting array simulation linked list mode to realize chain hash, and the hash value calculation complexity of adjacent substrings is reduced by one dimension by using the nthash idea. In addition, most queries cannot be queried, so that the condition is very suitable for filtering by adding a layer of bloom filter, and the filtered queries are then sent to a hash chain table for querying, and the efficiency can be further improved. Finally, the optimized Rabbit QCPlus part can accelerate nearly one order of magnitude for both single thread and multiple thread on the premise of ensuring the analysis result and the fastp are completely consistent.
Example two
The present embodiment aims to provide a sequencing data quality control system, which includes:
a data processing module configured to: reading FASTQ data to a memory, cutting the FASTQ data into data blocks, and placing the processed data blocks into a memory pool;
a quality control processing module configured to: continuously detecting required data blocks from the memory pool, taking out the data blocks, formatting the taken out data into Read objects, and then processing each Read object according to a quality control flow;
a statistics merging module configured to: and merging the statistical information processed according to the quality control flow, and performing visual output.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims (8)

1. The sequencing data quality control method is characterized by comprising the following steps:
reading FASTQ data to a memory, cutting the FASTQ data into data blocks, and placing the processed data blocks into a memory pool;
continuously detecting required data blocks from the memory pool, taking out the data blocks, formatting the taken out data into Read objects, and then processing each Read object according to a quality control flow; optimizing single-thread performance by adopting a vectorization technology when each Read object is processed according to a quality control flow; reconstructing information statistics, joint removal and repeatability analysis to make code logic more suitable for vectorization;
the information statistics part concentrates the original discrete memory access by modifying a storage mode and a circulation order, and then uses gather and scatter instructions to construct vectors to reduce the memory access time;
the repeatability analysis reduces branch decisions in the vectorization process by altering the termination conditions of the loop;
and merging the statistical information processed according to the quality control flow, and performing visual output.
2. The method of sequencing data quality control as claimed in claim 1, wherein FASTQ data is read into memory and cut into data blocks of a desired size, and then processed at the end of the data blocks to ensure that there is not one FASTQ data distributed over two data blocks.
3. The sequencing data quality control method of claim 1, wherein while the desired data block is continuously detected from the memory pool, if not, waiting until a usable data block is detected from the memory pool.
4. The sequencing data quality control method of claim 1, wherein each Read class object is processed according to a quality control flow, specifically: and carrying out information statistics, joint removal, repeatability analysis, over-representation analysis and sequence clipping processing on each Read class object.
5. The sequencing data quality control method of claim 1, further comprising: outputting the processed FASTQ data to a file.
6. The sequencing data quality control method of claim 1, wherein the Read class object stores no more member variables of 4 string types but a pointer to fastq data in the memory.
7. The sequencing data quality control method of claim 1, wherein parallel acceleration compression and decompression is used during processing of each Read class object according to the quality control flow.
8. Sequencing data quality control system, characterized by, includes:
a data processing module configured to: reading FASTQ data to a memory, cutting the FASTQ data into data blocks, and placing the processed data blocks into a memory pool;
a quality control processing module configured to: continuously detecting required data blocks from the memory pool, taking out the data blocks, formatting the taken out data into Read objects, and then processing each Read object according to a quality control flow; optimizing single-thread performance by adopting a vectorization technology when each Read object is processed according to a quality control flow; reconstructing information statistics, joint removal and repeatability analysis to make code logic more suitable for vectorization;
the information statistics part concentrates original discrete memory access by modifying a storage mode and a circulation sequence, and then constructs vectors by using gather and scatter instructions to reduce the memory access time;
the repeatability analysis reduces branch decisions in the vectorization process by altering the termination conditions of the loop;
a statistics merging module configured to: and merging the statistical information processed according to the quality control flow, and performing visual output.
CN202210308643.7A 2022-03-28 2022-03-28 Sequencing data quality control method and system Active CN114489518B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210308643.7A CN114489518B (en) 2022-03-28 2022-03-28 Sequencing data quality control method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210308643.7A CN114489518B (en) 2022-03-28 2022-03-28 Sequencing data quality control method and system

Publications (2)

Publication Number Publication Date
CN114489518A CN114489518A (en) 2022-05-13
CN114489518B true CN114489518B (en) 2022-09-09

Family

ID=81489165

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210308643.7A Active CN114489518B (en) 2022-03-28 2022-03-28 Sequencing data quality control method and system

Country Status (1)

Country Link
CN (1) CN114489518B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117393046B (en) * 2023-12-11 2024-03-19 山东大学 Space transcriptome sequencing method, system, medium and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107003853A (en) * 2014-12-24 2017-08-01 英特尔公司 The systems, devices and methods performed for data-speculative
CN107003850A (en) * 2014-12-24 2017-08-01 英特尔公司 The systems, devices and methods performed for data-speculative

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440229B (en) * 2013-08-12 2017-11-10 浪潮电子信息产业股份有限公司 A kind of vectorization optimization method based on MIC architecture processors
CN103559020B (en) * 2013-11-07 2016-07-06 中国科学院软件研究所 A kind of DNA reads ordinal number according to the compression of FASTQ file in parallel and decompression method
CN107766696A (en) * 2016-08-23 2018-03-06 武汉生命之美科技有限公司 Eucaryote alternative splicing analysis method and system based on RNA seq data
US10346166B2 (en) * 2017-04-28 2019-07-09 Intel Corporation Intelligent thread dispatch and vectorization of atomic operations
WO2020157887A1 (en) * 2019-01-31 2020-08-06 三菱電機株式会社 Sentence structure vectorization device, sentence structure vectorization method, and sentence structure vectorization program
CN110349635B (en) * 2019-06-11 2021-06-11 华南理工大学 Parallel compression method for gene sequencing data quality fraction
CN111292805B (en) * 2020-03-19 2023-08-18 山东大学 Third generation sequencing data overlap detection method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107003853A (en) * 2014-12-24 2017-08-01 英特尔公司 The systems, devices and methods performed for data-speculative
CN107003850A (en) * 2014-12-24 2017-08-01 英特尔公司 The systems, devices and methods performed for data-speculative

Also Published As

Publication number Publication date
CN114489518A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
US8463820B2 (en) System and method for memory bandwidth friendly sorting on multi-core architectures
US10235398B2 (en) Processor and data gathering method
US9075428B2 (en) Results generation for state machine engines
CN107608750B (en) Device for pattern recognition
US10489062B2 (en) Methods and systems for using state vector data in a state machine engine
US9712646B2 (en) Automated client/server operation partitioning
US10007605B2 (en) Hardware-based array compression
CN102253921B (en) Dynamic reconfigurable processor
CN103019855B (en) Method for forecasting executive time of Map Reduce operation
CN114489518B (en) Sequencing data quality control method and system
US20190089370A1 (en) Program counter compression method and hardware circuit thereof
CN116483441A (en) Output time sequence optimizing system, method and related equipment based on shift buffering
WO2023000785A1 (en) Data processing method, device and system, and server and medium
CN111708621B (en) Display method of Pattern file based on multithread parallel processing
CN103117748B (en) The method and system in a kind of BWT implementation method, suffix sorted
CN102375886A (en) Multi-channel high-speed data comparing method
CN113641705A (en) Marketing disposal rule engine method based on calculation engine
Zhang et al. Cic-pim: Trading spare computing power for memory space in graph processing
CN111370070B (en) Compression processing method for big data gene sequencing file
CN117393046B (en) Space transcriptome sequencing method, system, medium and equipment
CN117827284B (en) Vector processor memory access instruction processing method, system, equipment and storage medium
Yan et al. RabbitQCPlus 2.0: More efficient and versatile quality control for sequencing data
CN117435248B (en) Automatic generation method and device for adaptive instruction set codes
CN118132078A (en) Program efficiency optimization method and system based on multi-compilation fusion optimization
Xiao et al. Cloudgt: A high performance genome analysis toolkit leveraging pipeline optimization on spark

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant