CN113139798B

CN113139798B - Gene sequencing flow management control method and system

Info

Publication number: CN113139798B
Application number: CN202110633608.8A
Authority: CN
Inventors: 谭光明; 康宁; 张春明; 段勃
Original assignee: Western Research Institute Of China Science And Technology Computing Technology
Current assignee: Western Research Institute Of China Science And Technology Computing Technology
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2024-02-20
Anticipated expiration: 2041-06-07
Also published as: CN113139798A

Abstract

The invention relates to the technical field of gene sequencing flow management, in particular to a gene sequencing flow management control method and a system, wherein the system comprises a plurality of heterogeneous units, each heterogeneous unit comprises a flow control module, and the flow control module is used for receiving flow management information, arbitrating the flow management information to obtain a management command and distributing the management command to a preset management message queue; the system is also used for receiving the data information, calling the management command in the corresponding management message queue according to the data information, and executing the management command; the system also comprises a central processing unit, wherein the central processing unit is used for acquiring the flow management information and sending the flow management information to the heterogeneous unit, and when the heterogeneous unit is used for receiving the flow management information, the corresponding flow control module is called to acquire a management message queue. By adopting the scheme, programmable gene sequencing flow control can be provided, centralized management of the gene sequencing flow is realized, and time overhead of the system in the gene sequencing flow control is effectively reduced.

Description

Gene sequencing flow management control method and system

Technical Field

The invention relates to the technical field of gene sequencing flow management, in particular to a gene sequencing flow management control method and system.

Background

With the rapid development of bioinformatics, gene analysis has become a widely used technical means in scientific research and industry, and has been successfully applied in species identification, disease diagnosis, etc., in which gene sequencing has become an increasingly important field in gene research, and in general, gene sequencing involves determining the order of nucleotides of nucleic acids such as RNA or DNA fragments. By analyzing shorter gene sequences, the resulting sequence information is used in various bioinformatics methods to logically fit together multiple fragments to reliably determine a sequence of a broader length of genetic material.

The gene sequencing technology is closely related to the computer technology, and the whole computer processing flow of the gene sequencing can be roughly divided into six steps: BWA-MEM, sort, mark Duplex, indel Realignment, BQSR and Variant rolling. The existing gene sequencing flow is usually controlled by a CPU, the flow is fixed at the beginning of programming, and the control of the gene sequencing flow cannot be regulated, for example, the gene sequence after each processing link in each processing step of the gene sequencing is required to be stored based on the requirement of the gene sequencing. Meanwhile, as the gene sequencing flow is controlled by the CPU, the corresponding computer processing flow is carried out in the CPU, so that the CPU load is large, the switching is required between different gene sequencing steps, and the time cost in the gene sequencing is increased.

Therefore, there is a need for a method and system for managing and controlling a gene sequencing process that can control the programmable gene sequencing process and reduce the time overhead in the process control.

Disclosure of Invention

One of the objectives of the present invention is to provide a gene sequencing flow management control system, which can provide programmable gene sequencing flow control and reduce the time overhead in the gene sequencing flow control.

The basic scheme provided by the invention is as follows: the gene sequencing flow management control system comprises a plurality of heterogeneous units, wherein each heterogeneous unit comprises a flow control module, and the flow control module is used for receiving flow management information, arbitrating the flow management information to obtain a management command and distributing the management command to a preset management message queue; and the system is also used for receiving the data information, calling the management command in the corresponding management message queue according to the data information, and executing the management command.

The first basic scheme has the beneficial effects that: the flow control module is arranged in each heterogeneous unit, the gene sequencing flow in each heterogeneous unit is managed through the flow control module, and the centralized management of the gene sequencing flow is realized through the programmable management of the flow management information.

The flow management information comprises a plurality of management commands, the flow management information is arbitrated by the flow control module to obtain a plurality of management commands, and the management commands are distributed to the management message queue for storage. When receiving the data information, the management command is called from the corresponding management message queue to execute, thereby realizing the processing and transmission of the data information. Meanwhile, the processing and the transmission of the data information are unloaded to the heterogeneous unit for carrying out, the delay of the control of the data information is reduced, and the high-efficiency control of the gene sequencing flow is realized.

By adopting the scheme, the programmable gene sequencing flow control is provided through the flow control module and the flow management information, and meanwhile, the centralized management of the gene sequencing flow is realized, so that the time overhead of the system in the gene sequencing flow control is effectively reduced.

Further, the management message queue comprises a queue number, the flow control module comprises a management message arbitration sub-module, a data flow rotor module and a multi-queue sub-module, and the management message arbitration sub-module is used for analyzing the flow management information to obtain the message queue number, and writing management commands into the corresponding management message queue according to the message queue number and the queue number in sequence;

The data flow sub-module is used for analyzing the data information to obtain a data queue number, screening a corresponding management message queue according to the data queue number and the queue number, calling the management command in the screened management message queue, and the multi-queue sub-module is used for deleting the corresponding management command in the management message queue when the management command in the management message queue is called.

The beneficial effects are that: the management message queue in the flow control module is provided with a unique queue number, and the management message arbitration sub-module is configured to obtain the message queue number from the flow management information, so as to learn the management message queue into which the management command needs to be put. The sequence of the operations executed in the gene sequencing flow is fixed, so that the management commands are written into the management message queue in sequence, and the quick calling of the subsequent management commands is facilitated.

The data information contains the sequence number of the management message queue, namely the data sequence number, where the gene sequencing process executed by the gene data is located. And setting a data flow rotor module, screening a corresponding management message queue based on the data queue number and the queue number, and calling a management command to execute, thereby completing a gene sequencing flow. After the management command is called, the corresponding management command is deleted, so that the command executed by the next gene sequencing flow is positioned at the first position of the management message queue, and the quick calling of the management command is realized when the next data information comes.

And the flow control module is further used for judging whether the management command is the written data after executing the management command, waiting for the next data information when the management command is the written data, otherwise, calling the management command in the management message queue according to the data information after executing the management command, and executing the management command.

The beneficial effects are that: and identifying the management command, and judging whether the gene sequencing process or a certain step in the gene sequencing process is finished or not through the identification of the management command. When the gene sequencing flow is completed or one of the steps is completed, data is finally written into the local or the remote end, and based on the characteristic of the finally written data, the subsequent execution step is judged, so that the control of the gene sequencing flow is realized.

The heterogeneous unit comprises an in-memory computing unit and a storage computing unit, the central processing unit is used for acquiring a gene data reading request and sending the gene data reading request to the storage computing unit, and the storage computing unit is used for calling a corresponding flow control module to acquire pre-stored gene data by taking the gene data reading request as data information when receiving the gene data reading request;

The in-memory computing unit, the central processing unit and the storage computing unit process the gene data in sequence;

the storage computing unit is also used for sending the gene data to the in-memory computing unit when receiving the processed gene data, and the central processing unit is also used for extracting the gene data from the in-memory computing unit when the in-memory computing unit receives the gene data, processing the gene data to obtain the processed gene data and sending the gene data to the storage computing module; the storage computing module is also used for compressing and storing the gene data.

The beneficial effects are that: the gene data is stored in the storage computing unit, and when the gene data needs to be subjected to gene sequencing, the corresponding gene data is called through a gene data reading request. And taking the gene data as data information, and executing corresponding operations on the gene data through the in-memory computing unit, the central processing unit and the storage computing unit. After the gene sequencing process is finished or one step is finished, the central processing unit is informed when the in-memory computing unit receives the gene data, and the central processing unit extracts the gene data for corresponding processing.

Further, when the in-memory computing unit, the central processing unit and the storage computing unit process the gene data in sequence,

The in-memory computing unit is used for taking the gene data as data information when receiving the gene data sent by the memory computing unit, and calling a corresponding flow control module to obtain the processed gene data;

the central processing unit is also used for processing the gene data to obtain processed gene data when receiving the gene data sent by the in-memory computing unit;

and the storage computing unit is also used for calling a corresponding flow control module to compress and store the gene data by taking the gene data as data information when receiving the gene data sent by the central processing unit.

The beneficial effects are that: the gene sequencing flow comprises a plurality of steps, a plurality of processing steps exist under each step, and the plurality of processing steps are executed based on the in-memory computing unit, the central processing unit and the storage computing unit, so that the gene sequencing of the gene data is completed.

The second objective of the present invention is to provide a method for managing and controlling a gene sequencing process.

The invention provides a basic scheme II: the gene sequencing flow management control method comprises the following steps:

command management step: receiving flow management information, arbitrating the flow management information to obtain a management command, and distributing the management command to a preset management message queue;

The command execution step: and receiving the data information, calling a management command in the management message queue according to the data information, and executing the management command.

The second basic scheme has the beneficial effects that: the flow management information comprises a plurality of management commands, the plurality of management commands are obtained by arbitrating the flow management information, and the management commands are distributed to the management message queue for storage. When receiving the data information, the management command is called from the corresponding management message queue to execute, thereby realizing the processing and transmission of the data information.

By adopting the scheme, the programmable gene sequencing flow control is provided through the flow management information, and meanwhile, the centralized management of the gene sequencing flow is realized, so that the time overhead of the system in the gene sequencing flow control is effectively reduced.

Further, the management message queue includes a queue number, and the management command is distributed to a preset management message queue, including the following contents:

analyzing the flow management information to obtain a message queue number, and writing management commands into corresponding management message queues according to the message queue number and the queue number in sequence;

invoking a management command in a management message queue according to the data information, wherein the management command comprises the following contents:

Analyzing the data information to obtain a data queue number, screening a corresponding management message queue according to the data queue number and the queue number, calling management commands in the screened management message queue, and deleting the corresponding management commands in the management message queue.

The beneficial effects are that: the management message queue has a unique queue number, and the flow management information is analyzed to obtain the message queue number, so that the management message queue into which the management command needs to be put is known. The operation sequence executed in the gene sequencing flow is fixed, so that the management command is written into the management message queue in sequence, the subsequent quick calling of the management command is facilitated, and the high-efficiency control of the gene sequencing flow is realized.

The data information contains the sequence number of the management message queue, namely the data sequence number, where the gene sequencing process executed by the gene data is located. And screening a corresponding management message queue based on the data queue number and the queue number, and calling a management command to execute, thereby completing the gene sequencing flow. After the management command is called, the corresponding management command is deleted, so that the command executed by the next gene sequencing flow is positioned at the first position of the management message queue, and the management command is quickly called when the next data information comes, so that the time cost of the system in the control of the gene sequencing flow is reduced.

Further, the management command includes performing a genetic sequencing operator, reading data, and writing data, and further includes the following:

executing a judging step: after executing the management command, judging whether the management command is writing data, and when the management command is writing data, waiting for next data information, otherwise, calling a command executing step according to the data information after executing the management command.

The beneficial effects are that: the management command is identified by executing the determining step, thereby determining whether the gene sequencing flow or a certain step in the gene sequencing flow is completed. When the gene sequencing flow is completed or one of the steps is completed, data is finally written into the local or the remote end, and based on the characteristic of the finally written data, the subsequent execution step is judged, so that the control of the gene sequencing flow is realized.

Further, the system comprises an in-memory computing unit, a central processing unit and a storage computing unit, and further comprises the following contents:

and a data processing step: the in-memory computing unit calls the command execution step to obtain the processed data information; the central processing unit processes the data information processed by the in-memory computing unit; the storage computing unit calls the command execution step to compress and store the data information processed by the central processing unit.

The beneficial effects are that: the gene sequencing flow comprises a plurality of steps, a plurality of processing steps exist in each step, the in-memory computing unit, the central processing unit and the storage computing unit are controlled through the data processing steps, the command execution step is called to acquire and execute the management command, so that gene sequencing is completed, the processed data information is stored, and the data information is conveniently called or further processed in the subsequent steps. Meanwhile, the processing and transmission of the data information are unloaded to an in-memory computing unit, a central processing unit and a storage computing unit for processing, the delay of data information control is reduced, and the efficient control of the gene sequencing flow is realized.

Further, the method also comprises a gene sequencing step, wherein the gene sequencing step comprises the following steps:

s1: acquiring flow management information, and calling a command management step to acquire a management message queue;

s2: acquiring a gene data reading request, taking the gene data reading request as data information, and calling a command execution step to acquire pre-stored gene data;

s3: taking the gene data as data information, and calling a data processing step to process the gene data;

s4: the in-memory computing unit receives the processed gene data, when the in-memory computing unit receives the gene data, the central processing unit extracts the gene data from the in-memory computing unit, processes the gene data as data information to obtain the processed gene data, and sends the gene data to the storage computing module; the storage calculation module compresses and stores the received gene data.

The beneficial effects are that: the flow management information comprises all the steps required by the gene sequencing flow, different steps in the gene sequencing flow are distributed in different management message queues to control, parallel processing of gene data is realized, and parallel efficiency among different processing steps of gene sequencing is effectively improved.

Gene data is pre-stored or stored at a designated location, and when gene sequencing is required for the gene data, the corresponding gene data is called by a gene data reading request. And taking the gene data as data information, and executing corresponding operation on the gene data through a data processing step. After the gene sequencing process is finished or a certain step is finished, when the in-memory computing unit receives the gene data, the in-memory computing unit informs the central processing unit, so that the central processing unit extracts the gene data to perform corresponding processing, compresses and stores the processed gene data, and reduces the space performance required by storing the gene data.

Drawings

FIG. 1 is a schematic diagram of a flow control module of a gene sequencing process management control system according to an embodiment of the present invention;

FIG. 2 is a block diagram of a second embodiment of a control system for managing a gene sequencing process according to the present invention;

FIG. 3 is a schematic diagram of a process module of a gene sequencing flow management control system according to the present invention;

FIG. 4 is a diagram showing the segmentation of the gene data fields of the gene sequencing flow management control system according to the present invention;

FIG. 5 is a flow chart showing the compression steps of the control method for gene sequencing process management according to the present invention.

Detailed Description

The following is a further detailed description of the embodiments:

example 1

The gene sequencing flow management control system comprises a plurality of heterogeneous units, wherein each heterogeneous unit comprises a flow control module. The flow control module is used for receiving flow management information, arbitrating the flow management information to obtain a management command, and distributing the management command to a preset management message queue; the method is also used for receiving the data information, calling the management command in the corresponding management message queue according to the data information, and executing the management command, and specifically:

as shown in fig. 1, the flow control module includes a multi-queue sub-module, a management message arbitration sub-module, and a streaming sub-module, where the multi-queue sub-module is configured to store a plurality of management message queues, and context information of each management message queue, where the context information refers to information of the management message queue, including a queue number and a data depth, i.e., the management message queue includes a queue number. The arrangement of a plurality of management message queues supports message interfaces of the plurality of queues, and realizes the concurrent processing of a plurality of gene sequencing processes, thereby reducing the time cost in the gene sequencing processes.

The management message arbitration submodule is used for receiving flow management information, and the flow management information is transmitted from outside, such as flow management software. The management message arbitration sub-module is also used for arbitrating the flow management information to obtain a management command, analyzing the flow management information to obtain a message queue number, and writing the management command into a corresponding management message queue according to the message queue number and the queue number in sequence. The management message queue contains a plurality of management commands, each management command comprises a message operation code and an operation code parameter, the message operation code marks the operation to be executed by the data and comprises the execution of a gene sequencing operator, the local reading or writing of the data and the remote reading or writing of the data, namely the management command comprises the execution of the gene sequencing operator, the reading of the data and the writing of the data, the reading of the data comprises the local reading of the data and the remote reading of the data, and the writing of the data comprises the local writing of the data and the remote writing of the data. The opcode parameters are sideband information needed to provide an operation, such as executing a management command to read data from the local, where the address and size of the local data need to be known, i.e., the sideband information.

The data flow converting sub-module is used for receiving data information, the data information is transmitted from outside, and can also be prestored by a system and obtained in a calling mode. The data flow rotor module is also used for analyzing the data information to obtain a data queue number, screening a corresponding management message queue according to the data queue number and the queue number, and calling the management command in the screened management message queue. The multi-queue sub-module is further configured to delete a corresponding management command in the management message queue when the management command in the management message queue is invoked. After the management command is called, deleting the corresponding management command, so that the management command executed by the next gene sequencing flow is positioned at the first position of the management message queue, and the management command is temporarily called in the next data information, thereby realizing the quick calling of the management command.

The flow control module is also used for judging whether the management command is writing data after executing the management command, waiting for next data information when the management command is writing data, otherwise calling the management command in the management message queue according to the data information after executing the management command, and executing the management command.

By adopting the scheme, the flow control module is arranged in each heterogeneous unit, the gene sequencing flow in each heterogeneous unit is managed through the flow control module, and the centralized management of the gene sequencing flow is realized through the programmable management of the flow management information. And the processing and the transmission of the data information are unloaded to a heterogeneous unit, so that the delay of the control of the data information is reduced, the time overhead of the system in the control of the gene sequencing flow is effectively reduced, and the efficient control of the gene sequencing flow is realized.

In addition, the present embodiment also provides a method for controlling the management of a gene sequencing flow, using the above gene sequencing flow management system, which includes the following steps:

command management step: receiving flow management information, arbitrating the flow management information to obtain a management command, and distributing the management command to a preset management message queue. The command management step specifically comprises the following steps:

flow management information is received, the flow management information being externally imported, such as flow management software.

And carrying out arbitration on the flow management information to obtain a management command, analyzing the flow management information to obtain a message queue number, and writing the management command into a corresponding management message queue according to the message queue number and the queue number.

The preset management message queues are multiple, the management message queues after the management commands are written in contain a plurality of management commands, each management command comprises a message operation code and an operation code parameter, the message operation code marks the operation to be executed by the data and comprises the steps of executing a gene sequencing operator, reading or writing data from local and writing data from far end or writing data from far end, namely, the management commands comprise the execution of the gene sequencing operator, the reading of the data and the writing of the data, the reading of the data comprises the reading of the data from local and the reading of the data from far end, and the writing of the data comprises the writing of the data to local and the writing of the data to far end. The opcode parameters are sideband information needed to provide an operation, such as executing a management command to read data from the local, where the address and size of the local data need to be known, i.e., the sideband information.

The command execution step: and receiving the data information, calling a management command in the management message queue according to the data information, and executing the management command. Context information of each management message queue is preset, wherein the context information refers to information of the management message queue and comprises a queue number and a data depth, namely the management message queue comprises the queue number. The command execution step specifically comprises the following steps:

and receiving data information, wherein the data information is transmitted from the outside, can be prestored by a system and can be obtained in a calling mode.

Executing management commands, each management command comprising a message opcode and an opcode parameter, the message opcode marking an operation to be performed by the data, including executing a genetic sequencing operator, reading or writing data locally, reading or writing data remotely, i.e. the management command includes executing a genetic sequencing operator, reading data, and writing data, the reading data includes reading data locally and reading data from remotely, and the writing data includes writing data locally and writing data remotely. The opcode parameters are sideband information needed to provide an operation, such as executing a management command to read data from the local, where the address and size of the local data need to be known, i.e., the sideband information.

Example two

The present embodiment is different from the first embodiment in that:

the gene sequencing flow management system, as shown in figure 2, further comprises a central processing unit, wherein the heterogeneous unit comprises an in-memory computing unit and a storage computing unit, the central processing unit, the in-memory computing unit and the storage computing unit communicate through a root complex, and the communication at this time comprises the data transmission of gene data and the interaction of management messages. Specifically, the root complex is connected to the central processing unit through the FSB, the root complex is connected to the in-memory computing unit through the DIMM, and the root complex is connected to the memory computing unit through the PCIe.

The in-memory computing unit and the storage computing unit both comprise flow control modules, which are respectively defined as an in-memory flow control module and a storage flow control module for the convenience of distinguishing.

The central processing unit is used for acquiring the flow management information and sending the flow management information to the heterogeneous unit, and when the heterogeneous unit is used for receiving the flow management information, the corresponding flow control module is called to acquire a management message queue, specifically, the flow control module in the in-memory computing unit is called to acquire the management message queue in the in-memory computing unit, and the flow control module in the storage computing unit is called to acquire the management message queue in the storage computing unit.

The central processing unit is also used for acquiring a gene data reading request and sending the gene data reading request to the storage computing unit, and the storage computing unit is used for calling a corresponding flow control module to acquire pre-stored gene data by taking the gene data reading request as data information when receiving the gene data reading request. Specifically, the storage computing unit further comprises an SSD module and a processing module, and gene data for gene sequencing are prestored in the SSD module. The storage flow control module is used for analyzing the obtained data queue number according to the gene data reading request, calling a first management command in a management message queue corresponding to the queue number, wherein the management command is the local reading data at the moment, and the processing module is used for reading the gene data from the SSD module according to the management command. Meanwhile, the storage flow control module is also used for deleting the management command after the management command is called.

The in-memory computing unit, the central processing unit and the storage computing unit process the gene data in sequence. The in-memory computing unit is used for taking the gene data as data information when receiving the gene data sent by the memory computing unit, and calling a corresponding flow control module to obtain the processed gene data; the central processing unit is also used for processing the gene data to obtain processed gene data when receiving the gene data sent by the in-memory computing unit; and the storage computing unit is also used for calling a corresponding flow control module to compress and store the gene data by taking the gene data as data information when receiving the gene data sent by the central processing unit. Specifically, the in-memory computing unit further comprises a DDR module and a logic computing module, the central processing unit comprises a custom operation function module, the in-memory flow control module is used for calling a first management command in a management message queue corresponding to the queue number according to the data information, at the moment, the management command is to execute a seed operation, and the logic computing module is used for executing the seed operation on the genetic data according to the management command to obtain the processed genetic data. The memory flow control module is used for calling a first management command in a management message queue corresponding to the queue number after executing the management command, wherein the management command is a filter operation execution at the moment, and the logic calculation module is used for executing the filter operation on the gene data according to the management command to obtain the processed gene data. And the customized operation function module is used for executing extension operation on the gene data to obtain the processed gene data when receiving the gene data sent by the in-memory calculation unit. The storage flow control module is used for calling a first management command in a management message queue corresponding to the queue number according to the received gene data, at this time, the management command is compression operation execution, and the processing module is used for executing compression operation on the gene data according to the management command to obtain the processed gene data, wherein the compressed gene data is in a bam format. The storage flow control module is used for calling a first management command in a management message queue corresponding to the queue number after executing the management command, at this time, the management command is a storage operation execution, and the processing module is used for executing the storage operation on the gene data according to the management command and storing the gene data in the SSD module. The seed operation, the filter operation and the extension operation are operation links in the gene sequencing process, and the execution of the rest of the operations is the same as that of the operation links, so that the description is omitted.

In other embodiments, the processing module is configured to perform a compression operation on the genetic data according to the management command, and specifically, as shown in fig. 3, the processing module includes a field separator, an operator pool, an operator selector, an operator combiner, and a field combiner.

The operator pool stores in advance a plurality of types of compression operators, and in this embodiment, includes a data conversion class operator including run-length coding, MTF coding, LZ77, BWT, and the like, an entropy coding operator including Huffman coding, arithmetic coding, and the like, and a general coding operator including Unary coding, rice coding, and the like. The compression operators in the operator pool are all in the form of configurable hardware libraries.

The field separator is used for dividing the gene data into a plurality of data fields according to the data type, and particularly dividing each data field in the N data fields into M data blocks; where N is the primary parallel design at the field level and M is the secondary parallel design at the field algorithm level. The size of N is determined by the complexity and richness of the genetic data, and the size of M is limited by hardware resources and compression effects.

The data type comprises name information, sequence information of genes and mass fraction corresponding to bases in the sequence information of the genes, the sequence information of the genes stores relative position information of GATC bases, and corresponding data fields generated after segmentation comprise name fields, sequence fields and mass fraction fields. As shown in fig. 4, the first row and third row name fields of fig. 4, collectively referred to as field 1; a second behavior sequence field, called field 2; the fourth behavior quality score field, referred to as field 3.

The operator selector is used for receiving each data field and the compression requirement corresponding to each data field, wherein the compression requirement comprises a compression rate and compression performance, and the compression performance is the performance and the resource occupation condition when the hardware realizes compression. And is further configured to select a compression operator from the operator pool based on the compression requirements of each data field.

An operator combiner for combining compression operators selected according to each data field into compression algorithms for the data field, each compression algorithm comprising at least one compression operator.

Different compression operators can be selected for the same data field, and the compression operators are selected to be combined into an optimal compression algorithm based on the difference between the compression rate and the compression performance of the compression operators. In this embodiment, taking the genetic data in fig. 4 as an example, field 1 is a name field, and is encoded in a general encoding manner, field 2 is a sequence field, and a manner of combining BWT operator with MTF operator is adopted; the field 3 is a quality fraction field, and adopts a mode of combining a differential coding operator and a run coding operator.

And the field combiner is used for compressing each data field according to the corresponding combined compression algorithm and combining the compression results of each data field. The merging mode of the compression results of each data field is as follows: the compression result of each data field is stored in a specific format in the same file.

And when the compression results are combined, marking a combination of compression operators contained in the compression algorithm selected by each data field in the file header, and conveniently calling the corresponding operator for decompression during decompression.

By adopting the scheme, different steps in the gene sequencing process are distributed in different process control modules through heterogeneous units to control, so that parallel processing of gene data is realized, the parallel efficiency among different processing steps of gene sequencing is effectively improved, and the performance of the gene sequencing process is improved.

In addition, the embodiment also provides a gene sequencing flow management control method, and the gene sequencing flow management control system comprises an in-memory computing unit, a central processing unit and a storage computing unit, and further comprises the following contents:

In other embodiments, the compression step, as shown in fig. 5, includes the following:

the acquisition step: acquiring gene data and compression requirements of corresponding gene data; the compression requirement is the compression rate and compression performance after the user is weighed, and the compression performance is the performance and resource occupation condition when the hardware realizes compression.

A field separation step: dividing the gene data into a plurality of data fields according to the data types, and particularly dividing each data field in the N data fields into M data blocks; where N is the primary parallel design at the field level and M is the secondary parallel design at the field algorithm level. The size of N is determined by the complexity and richness of the genetic data, and the size of M is limited by hardware resources and compression effects.

The data type comprises name information, sequence information of the gene and mass fraction information of base correspondence in the sequence information of the gene, wherein the sequence information of the gene stores relative position information of the GATC base. The sequence information of the gene stores the relative position information of the GATC base, and the corresponding data fields generated after segmentation comprise a name field, a sequence field and a quality fraction field. As shown in fig. 4, fig. 4 is a partial section of data in the FASTQ file of the genetic data, and the first row and the third row name fields, collectively referred to as field 1; a second behavior sequence field, called field 2; the fourth behavior quality score field, referred to as field 3.

Operator selection and combination steps: and selecting a corresponding compression operator from preset compression operators according to the compression requirements of each data field, and combining the compression operators into a compression algorithm of the corresponding data field, wherein each compression algorithm at least comprises one compression operator.

The preset compression operators comprise data conversion operators, entropy coding operators and general coding operators, wherein the data conversion operators comprise run-length coding, MTF coding, LZ77, BWT and the like, the entropy coding operators comprise Huffman coding, arithmetic coding and the like, and the general coding operators comprise Unary coding, rice coding and the like. The preset compression operators are stored in an operator pool in a classified mode, and the compression operators are selected from the operator pool according to compression requirements of all data fields. The operator pool also records the compression rate and compression performance of each compression operator in a list.

The operator selecting and combining step further comprises the following steps:

s101, parallelly setting compression operators in a compression algorithm, and setting the compression operators as M identical compression pipelines; each data field is respectively allocated with M identical compression pipelines; the compression pipeline comprises a plurality of compression algorithms, and the compression algorithms are formed by combining a plurality of compression operators.

S102, acquiring first parallelism K of compression operators in compression pipelines _N And according to the first parallelism K _N Acquiring a second parallelism M.times.K of the Nth data field _N 。

S103, according to the second parallelism M x K of each data field _N Analyzing the completion time of each data field for completing compression, and recording the completed synchronization rate;

s104, judging whether the synchronous rate accords with a set value, if not, adjusting a compression operator or a combination of compression algorithms in a compression assembly line to obtain a first parallelism K of the compression assembly line _N ' and a second parallelism M x K of each data field _N ’；

S105, repeatedly executing the step S103 and the step S104 until the synchronization rate accords with the set value.

A sub-field compression step: and compressing each data field according to the corresponding combined compression algorithm to obtain a compression result of each data field.

And a field merging step: and merging compression results of the data fields. Specifically, each data field compression result is stored in the same file in a specific format. And when the compression results are combined, marking a combination of compression operators contained in the compression algorithm selected by each data field in the file header, and conveniently calling the corresponding operator for decompression during decompression.

Compression performance analysis: according to the first parallelism K _N And a second parallelism M x K _N The compression performance of the gene data was analyzed. The method specifically comprises the following steps:

according to the first parallelism K _N Obtaining Min (K) _N )；

According to the second parallelism M.times.K of each data field _N A third parallelism M.times.N.times.Min (K) _N )；

According to the third parallelism M N Min (K _N ) The compression performance of the gene data was analyzed.

The gene sequencing step comprises the following steps:

By adopting the scheme, the processing and the transmission of the data information are unloaded to the in-memory computing unit, the central processing unit and the storage computing unit for carrying out, the delay of the control of the data information is reduced, and the high-efficiency control of the gene sequencing flow is realized. Meanwhile, different steps in the gene sequencing flow are distributed in different management message queues to control, so that parallel processing of gene data is realized, and the parallel efficiency among different processing steps of gene sequencing is effectively improved.

The foregoing is merely an embodiment of the present invention, and a specific structure and characteristics of common knowledge in the art, which are well known in the scheme, are not described herein, so that a person of ordinary skill in the art knows all the prior art in the application day or before the priority date of the present invention, and can know all the prior art in the field, and have the capability of applying the conventional experimental means before the date, so that a person of ordinary skill in the art can complete and implement the present embodiment in combination with his own capability in the light of the present application, and some typical known structures or known methods should not be an obstacle for a person of ordinary skill in the art to implement the present application. It should be noted that modifications and improvements can be made by those skilled in the art without departing from the structure of the present invention, and these should also be considered as the scope of the present invention, which does not affect the effect of the implementation of the present invention and the utility of the patent. The protection scope of the present application shall be subject to the content of the claims, and the description of the specific embodiments and the like in the specification can be used for explaining the content of the claims.

Claims

1. The gene sequencing flow management control system is characterized in that: the system comprises a plurality of heterogeneous units, wherein each heterogeneous unit comprises a flow control module, and the flow control module is used for receiving flow management information, arbitrating the flow management information to obtain a management command and distributing the management command to a preset management message queue; the system is also used for receiving the data information, calling the management command in the corresponding management message queue according to the data information, and executing the management command;

the management message queue comprises a queue number, the flow control module comprises a management message arbitration sub-module, a data flow rotor module and a multi-queue sub-module, and the management message arbitration sub-module is used for analyzing flow management information to obtain a message queue number, and writing management commands into the corresponding management message queue according to the message queue number and the queue number in sequence;

the data flow sub-module is used for analyzing the data information to obtain a data queue number, screening a corresponding management message queue according to the data queue number and the queue number, calling the management command in the screened management message queue, and the multi-queue sub-module is used for deleting the corresponding management command in the management message queue when the management command in the management message queue is called;

2. The gene sequencing flow management control system of claim 1, wherein: the management command comprises an execution gene sequencing operator, read data and write data, the flow control module is further used for judging whether the management command is the write data after executing the management command, waiting for next data information when the management command is the write data, otherwise, calling the management command in the management message queue according to the data information after executing the management command, and executing the management command.

3. The gene sequencing flow management control system of claim 1, wherein: when the in-memory computing unit, the central processing unit and the storage computing unit process the gene data in sequence,

4. The gene sequencing flow management control method is characterized in that the gene sequencing flow management control system as claimed in claim 1 is applied and comprises the following steps:

the command execution step: receiving data information, calling a management command in a management message queue according to the data information, and executing the management command;

The management message queue comprises a queue number, and the management command is distributed to a preset management message queue, and the management message queue comprises the following contents:

5. The method for controlling gene sequencing process management according to claim 4, wherein: the management command comprises the steps of executing a gene sequencing operator, reading data and writing data, and further comprises the following contents:

6. The control method according to any one of claims 4 to 5, comprising an in-memory computing unit, a central processing unit, and a memory computing unit, further comprising:

7. The method for controlling gene sequencing process management according to claim 6, wherein: further comprising a gene sequencing step comprising: