CN108280214A - Quick I/O systems applied to distributed genetic group analysis - Google Patents

Quick I/O systems applied to distributed genetic group analysis Download PDF

Info

Publication number
CN108280214A
CN108280214A CN201810102016.1A CN201810102016A CN108280214A CN 108280214 A CN108280214 A CN 108280214A CN 201810102016 A CN201810102016 A CN 201810102016A CN 108280214 A CN108280214 A CN 108280214A
Authority
CN
China
Prior art keywords
hdfs
file system
calculate node
data
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810102016.1A
Other languages
Chinese (zh)
Inventor
马志强
薛红
顾磷
李威洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of CN108280214A publication Critical patent/CN108280214A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture

Abstract

The invention discloses a kind of quick I/O systems applied to distributed genetic group analysis, it uses distributed file system HDFS as basic storage system, and utilize the local disk of each node, the comprehensive high-throughput I/O functions of improving the genomic data in multinode Distributed Calculation;HDFS clusters are coupled with genome analysis calculate node;Namenode of one of calculate node as HDFS;All calculate nodes are used as HDFS back end;It is written on the 3 different nodes of the data of file system in the cluster and stores 3 copies;In each calculate node, start NFS agencies;In each calculate node, it is file system MF to be acted on behalf of HDFS carries by NFS;In each calculate node, local file system LF is used as interim storage;Analysis process in calculate node reads data as reading traditional file systems from carry file system MF;For the data to be written, LF is write data into;In LF, the file generated by process will be automatically saved to MF before process exits.

Description

Quick I/O systems applied to distributed genetic group analysis
Technical field
The present invention relates to memory system technologies field more particularly to a kind of quick I/ applied to distributed genetic group analysis O systems.
Background technology
Genomic data, for example, people full-length genome data, data volume is huge.So to the analytic process of genome It needs to read and write a large amount of data, including input, output and intermediate file.Traditional centralized I/O data processing systems, Such as NFS and SAN, the bottleneck analyzed for genomic data is become.
Expensive storage device can be used for improving I/O performances.But not there is the such cost of burden in all demands Expection.Moreover, if cluster expansion is to 1000 nodes, this may become technical bottleneck.
Distributed system, such as GFS [1] and HDFS [2] can provide high I/O with low cost.Meanwhile they can expand It opens up and is run on 1000 nodes.However, the semanteme of the semanteme and traditional local file system of distributed file system is Different.So genome analysis tool using traditional file systems as interface possibly can not well with distributed document Cooperative system.
Invention content
This method uses distributed file system HDFS as basic storage system, is constructed on this basis for base Because of the high-throughput I/O functions of Distributed Calculation of group data on multiserver node.
Compared to centralized single copy storage system, the present invention will generate data in 3 copies of storage on 3 nodes, after allowing Continuous step can be read with 3 times of I/O flux.
System resource is more efficiently used.In calculating process, in addition to the beginning and end stage needs to handle largely Reading or store data demand, usual I/O is not intensive.I/O can be used by the process of other nodes at this time.
The data copying machine system of HDFS ensures that the writing speed of multicopy file will not be significantly slower than single copy system.
It is traditional file systems by HDFS carries, making conventional tool (such as bwa and GATK), it is not necessary to modify can be read from HDFS Access evidence.
The local file system that the interim storage demand generated in calculating process can be met makes conventional tool can be with distribution Storage system is used cooperatively.
Description of the drawings
Fig. 1 is an exemplary plot of this system.
Specific implementation mode
The present invention constructs the distributed I/O systems for genome analysis.File system (HDFS) is in a distributed manner for it Basic storage system provides high throughput I/O functions on multiple server nodes, carries out distributed genetic group data and calculates.
Its operation principle is as follows:
One HDFS cluster is coupled with genome analysis calculate node.
Namenode (NameNode) of one of calculate node as HDFS.
All calculate nodes are used as the back end (DataNodes) of HDFS.
It is written on the 3 different nodes of the data of file system in the cluster and stores 3 copies.
In each calculate node, start NFS agencies.
It is acted on behalf of from NFS to calculate node and file system is provided.
In each calculate node, it is file system MF to be acted on behalf of HDFS carries by NFS.
In each calculate node, local file system LF is used as interim storage.
In analysis process, calculate node is read out data by file system MF, for the data to be written, system LF is written into file.
In LF, the file generated by process will be automatically saved to MF before process exits.
It is same group node to analyze calculate node and HDFS memory nodes.
Data analysis process in calculate node can act on behalf of the file system MF of carry on node by HDFS HDFS back end carries out I/O operation.
Figure is an example of system.It includes 4 nodes.The hard disk of each node has two parts, a part of carry It to/mnt/local, is used by LF, a part is used as HDFS and stores.There are one act on behalf of MF being loaded into/mnt/ for each node dfs.Data can be read by agency from MF by analyzing program, read data from LF or be write data into LF, by data from LF Store MF.Data can be copied to 3 nodes.
The above embodiment is only the preferred embodiment of the present invention, and the scope of protection of the present invention is not limited thereto, The variation and replacement for any unsubstantiality that those skilled in the art is done on the basis of the present invention belong to institute of the present invention Claimed range.

Claims (1)

1. the quick I/O systems applied to distributed genetic group analysis, which is characterized in that it uses distributed file system HDFS As basic storage system, the height of the genomic data in multinode Distributed Calculation is improved using the local disk of each node The I/O functions of flux;
HDFS clusters are coupled with genome analysis calculate node;Namenode of one of calculate node as HDFS;It is all Calculate node is used as HDFS back end;It is written on the 3 different nodes of the data of file system in the cluster and stores 3 pairs This;
In each calculate node, start NFS agencies;NFS is acted on behalf of to calculate node and is provided file system support;In each calculating On node, it is file system MF to be acted on behalf of HDFS carries by NFS;
In each calculate node, local file system LF is used as interim storage;Analysis process picture in calculate node, which is read, to be passed File system of uniting is the same to read data from carry file system MF;For the data to be written, LF is write data into;In LF, The file generated by process will be automatically saved to MF before process exits;
HDFS memory nodes are same group analysis calculate nodes;Analysis process in calculate node can pass through the file system of carry The MF that unites carries out I/O operation to the back end of HDFS.
CN201810102016.1A 2017-02-02 2018-02-01 Quick I/O systems applied to distributed genetic group analysis Pending CN108280214A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762453539P 2017-02-02 2017-02-02
US62/453,539 2017-02-02

Publications (1)

Publication Number Publication Date
CN108280214A true CN108280214A (en) 2018-07-13

Family

ID=62807309

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810102016.1A Pending CN108280214A (en) 2017-02-02 2018-02-01 Quick I/O systems applied to distributed genetic group analysis

Country Status (1)

Country Link
CN (1) CN108280214A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130151884A1 (en) * 2011-12-09 2013-06-13 Promise Technology, Inc. Cloud data storage system
CN104408047A (en) * 2014-10-28 2015-03-11 浪潮电子信息产业股份有限公司 Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server
US20150370502A1 (en) * 2014-06-19 2015-12-24 Cohesity, Inc. Making more active use of a secondary storage system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130151884A1 (en) * 2011-12-09 2013-06-13 Promise Technology, Inc. Cloud data storage system
US20150370502A1 (en) * 2014-06-19 2015-12-24 Cohesity, Inc. Making more active use of a secondary storage system
CN104408047A (en) * 2014-10-28 2015-03-11 浪潮电子信息产业股份有限公司 Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
包永红: "基于Hadoop的基因组分析平台构建", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
邹振宇: "基于HDFS的云存储***的实现与优化", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
US20200356901A1 (en) Target variable distribution-based acceptance of machine learning test data sets
Wang et al. Performance prediction for apache spark platform
CN109582433B (en) Resource scheduling method and device, cloud computing system and storage medium
US9135071B2 (en) Selecting processing techniques for a data flow task
CN104881466B (en) The processing of data fragmentation and the delet method of garbage files and device
JP5939123B2 (en) Execution control program, execution control method, and information processing apparatus
CN105205154B (en) Data migration method and device
CN109643310B (en) System and method for redistribution of data in a database
CN105930479A (en) Data skew processing method and apparatus
US9535743B2 (en) Data processing control method, computer-readable recording medium, and data processing control device for performing a Mapreduce process
CN103917960A (en) Storage apparatus and duplicate data detection method
Ferraro Petrillo et al. Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics
JP2011170774A (en) Device and method for generation of decision tree, and program
JP6069503B2 (en) Parallel analysis platform for serial data and parallel distributed processing method
JP6193406B2 (en) Serialization for differential encoding
JP2017045080A (en) Business flow specification regeneration method
US20160042097A1 (en) System and method for concurrent multi-user analysis of design models
CN108280214A (en) Quick I/O systems applied to distributed genetic group analysis
EP3264254B1 (en) System and method for a simulation of a block storage system on an object storage system
JP5637071B2 (en) Processing program, processing method, and processing apparatus
JP4268141B2 (en) Database replication program and database replication apparatus
Ruan et al. Hymr: a hybrid mapreduce workflow system
Becker et al. Accelerated genomics data processing using memory-driven computing
US11610151B2 (en) Distribution system, data management apparatus, data management method, and computer-readable recording medium
JP2018022433A (en) Control program, apparatus, and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180713

WD01 Invention patent application deemed withdrawn after publication