CN108280214A

CN108280214A - Quick I/O systems applied to distributed genetic group analysis

Info

Publication number: CN108280214A
Application number: CN201810102016.1A
Authority: CN
Inventors: 马志强; 薛红; 顾磷; 李威洁
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-02-02
Filing date: 2018-02-01
Publication date: 2018-07-13

Abstract

The invention discloses a kind of quick I/O systems applied to distributed genetic group analysis, it uses distributed file system HDFS as basic storage system, and utilize the local disk of each node, the comprehensive high-throughput I/O functions of improving the genomic data in multinode Distributed Calculation；HDFS clusters are coupled with genome analysis calculate node；Namenode of one of calculate node as HDFS；All calculate nodes are used as HDFS back end；It is written on the 3 different nodes of the data of file system in the cluster and stores 3 copies；In each calculate node, start NFS agencies；In each calculate node, it is file system MF to be acted on behalf of HDFS carries by NFS；In each calculate node, local file system LF is used as interim storage；Analysis process in calculate node reads data as reading traditional file systems from carry file system MF；For the data to be written, LF is write data into；In LF, the file generated by process will be automatically saved to MF before process exits.

Description

Quick I/O systems applied to distributed genetic group analysis

Technical field

The present invention relates to memory system technologies field more particularly to a kind of quick I/ applied to distributed genetic group analysis O systems.

Background technology

Genomic data, for example, people full-length genome data, data volume is huge.So to the analytic process of genome It needs to read and write a large amount of data, including input, output and intermediate file.Traditional centralized I/O data processing systems, Such as NFS and SAN, the bottleneck analyzed for genomic data is become.

Expensive storage device can be used for improving I/O performances.But not there is the such cost of burden in all demands Expection.Moreover, if cluster expansion is to 1000 nodes, this may become technical bottleneck.

Distributed system, such as GFS [1] and HDFS [2] can provide high I/O with low cost.Meanwhile they can expand It opens up and is run on 1000 nodes.However, the semanteme of the semanteme and traditional local file system of distributed file system is Different.So genome analysis tool using traditional file systems as interface possibly can not well with distributed document Cooperative system.

Invention content

This method uses distributed file system HDFS as basic storage system, is constructed on this basis for base Because of the high-throughput I/O functions of Distributed Calculation of group data on multiserver node.

Compared to centralized single copy storage system, the present invention will generate data in 3 copies of storage on 3 nodes, after allowing Continuous step can be read with 3 times of I/O flux.

System resource is more efficiently used.In calculating process, in addition to the beginning and end stage needs to handle largely Reading or store data demand, usual I/O is not intensive.I/O can be used by the process of other nodes at this time.

The data copying machine system of HDFS ensures that the writing speed of multicopy file will not be significantly slower than single copy system.

It is traditional file systems by HDFS carries, making conventional tool (such as bwa and GATK), it is not necessary to modify can be read from HDFS Access evidence.

The local file system that the interim storage demand generated in calculating process can be met makes conventional tool can be with distribution Storage system is used cooperatively.

Description of the drawings

Fig. 1 is an exemplary plot of this system.

Specific implementation mode

The present invention constructs the distributed I/O systems for genome analysis.File system (HDFS) is in a distributed manner for it Basic storage system provides high throughput I/O functions on multiple server nodes, carries out distributed genetic group data and calculates.

Its operation principle is as follows：

One HDFS cluster is coupled with genome analysis calculate node.

Namenode (NameNode) of one of calculate node as HDFS.

All calculate nodes are used as the back end (DataNodes) of HDFS.

It is written on the 3 different nodes of the data of file system in the cluster and stores 3 copies.

In each calculate node, start NFS agencies.

It is acted on behalf of from NFS to calculate node and file system is provided.

In each calculate node, it is file system MF to be acted on behalf of HDFS carries by NFS.

In each calculate node, local file system LF is used as interim storage.

In analysis process, calculate node is read out data by file system MF, for the data to be written, system LF is written into file.

In LF, the file generated by process will be automatically saved to MF before process exits.

It is same group node to analyze calculate node and HDFS memory nodes.

Data analysis process in calculate node can act on behalf of the file system MF of carry on node by HDFS HDFS back end carries out I/O operation.

Figure is an example of system.It includes 4 nodes.The hard disk of each node has two parts, a part of carry It to/mnt/local, is used by LF, a part is used as HDFS and stores.There are one act on behalf of MF being loaded into/mnt/ for each node dfs.Data can be read by agency from MF by analyzing program, read data from LF or be write data into LF, by data from LF Store MF.Data can be copied to 3 nodes.

The above embodiment is only the preferred embodiment of the present invention, and the scope of protection of the present invention is not limited thereto, The variation and replacement for any unsubstantiality that those skilled in the art is done on the basis of the present invention belong to institute of the present invention Claimed range.

Claims

1. the quick I/O systems applied to distributed genetic group analysis, which is characterized in that it uses distributed file system HDFS As basic storage system, the height of the genomic data in multinode Distributed Calculation is improved using the local disk of each node The I/O functions of flux；

HDFS clusters are coupled with genome analysis calculate node；Namenode of one of calculate node as HDFS；It is all Calculate node is used as HDFS back end；It is written on the 3 different nodes of the data of file system in the cluster and stores 3 pairs This；

In each calculate node, start NFS agencies；NFS is acted on behalf of to calculate node and is provided file system support；In each calculating On node, it is file system MF to be acted on behalf of HDFS carries by NFS；

In each calculate node, local file system LF is used as interim storage；Analysis process picture in calculate node, which is read, to be passed File system of uniting is the same to read data from carry file system MF；For the data to be written, LF is write data into；In LF, The file generated by process will be automatically saved to MF before process exits；

HDFS memory nodes are same group analysis calculate nodes；Analysis process in calculate node can pass through the file system of carry The MF that unites carries out I/O operation to the back end of HDFS.