CN108280214A - Quick I/O systems applied to distributed genetic group analysis - Google Patents
Quick I/O systems applied to distributed genetic group analysis Download PDFInfo
- Publication number
- CN108280214A CN108280214A CN201810102016.1A CN201810102016A CN108280214A CN 108280214 A CN108280214 A CN 108280214A CN 201810102016 A CN201810102016 A CN 201810102016A CN 108280214 A CN108280214 A CN 108280214A
- Authority
- CN
- China
- Prior art keywords
- hdfs
- file system
- calculate node
- data
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
- G06F16/1824—Distributed file systems implemented using Network-attached Storage [NAS] architecture
Abstract
The invention discloses a kind of quick I/O systems applied to distributed genetic group analysis, it uses distributed file system HDFS as basic storage system, and utilize the local disk of each node, the comprehensive high-throughput I/O functions of improving the genomic data in multinode Distributed Calculation;HDFS clusters are coupled with genome analysis calculate node;Namenode of one of calculate node as HDFS;All calculate nodes are used as HDFS back end;It is written on the 3 different nodes of the data of file system in the cluster and stores 3 copies;In each calculate node, start NFS agencies;In each calculate node, it is file system MF to be acted on behalf of HDFS carries by NFS;In each calculate node, local file system LF is used as interim storage;Analysis process in calculate node reads data as reading traditional file systems from carry file system MF;For the data to be written, LF is write data into;In LF, the file generated by process will be automatically saved to MF before process exits.
Description
Technical field
The present invention relates to memory system technologies field more particularly to a kind of quick I/ applied to distributed genetic group analysis
O systems.
Background technology
Genomic data, for example, people full-length genome data, data volume is huge.So to the analytic process of genome
It needs to read and write a large amount of data, including input, output and intermediate file.Traditional centralized I/O data processing systems,
Such as NFS and SAN, the bottleneck analyzed for genomic data is become.
Expensive storage device can be used for improving I/O performances.But not there is the such cost of burden in all demands
Expection.Moreover, if cluster expansion is to 1000 nodes, this may become technical bottleneck.
Distributed system, such as GFS [1] and HDFS [2] can provide high I/O with low cost.Meanwhile they can expand
It opens up and is run on 1000 nodes.However, the semanteme of the semanteme and traditional local file system of distributed file system is
Different.So genome analysis tool using traditional file systems as interface possibly can not well with distributed document
Cooperative system.
Invention content
This method uses distributed file system HDFS as basic storage system, is constructed on this basis for base
Because of the high-throughput I/O functions of Distributed Calculation of group data on multiserver node.
Compared to centralized single copy storage system, the present invention will generate data in 3 copies of storage on 3 nodes, after allowing
Continuous step can be read with 3 times of I/O flux.
System resource is more efficiently used.In calculating process, in addition to the beginning and end stage needs to handle largely
Reading or store data demand, usual I/O is not intensive.I/O can be used by the process of other nodes at this time.
The data copying machine system of HDFS ensures that the writing speed of multicopy file will not be significantly slower than single copy system.
It is traditional file systems by HDFS carries, making conventional tool (such as bwa and GATK), it is not necessary to modify can be read from HDFS
Access evidence.
The local file system that the interim storage demand generated in calculating process can be met makes conventional tool can be with distribution
Storage system is used cooperatively.
Description of the drawings
Fig. 1 is an exemplary plot of this system.
Specific implementation mode
The present invention constructs the distributed I/O systems for genome analysis.File system (HDFS) is in a distributed manner for it
Basic storage system provides high throughput I/O functions on multiple server nodes, carries out distributed genetic group data and calculates.
Its operation principle is as follows:
One HDFS cluster is coupled with genome analysis calculate node.
Namenode (NameNode) of one of calculate node as HDFS.
All calculate nodes are used as the back end (DataNodes) of HDFS.
It is written on the 3 different nodes of the data of file system in the cluster and stores 3 copies.
In each calculate node, start NFS agencies.
It is acted on behalf of from NFS to calculate node and file system is provided.
In each calculate node, it is file system MF to be acted on behalf of HDFS carries by NFS.
In each calculate node, local file system LF is used as interim storage.
In analysis process, calculate node is read out data by file system MF, for the data to be written, system
LF is written into file.
In LF, the file generated by process will be automatically saved to MF before process exits.
It is same group node to analyze calculate node and HDFS memory nodes.
Data analysis process in calculate node can act on behalf of the file system MF of carry on node by HDFS
HDFS back end carries out I/O operation.
Figure is an example of system.It includes 4 nodes.The hard disk of each node has two parts, a part of carry
It to/mnt/local, is used by LF, a part is used as HDFS and stores.There are one act on behalf of MF being loaded into/mnt/ for each node
dfs.Data can be read by agency from MF by analyzing program, read data from LF or be write data into LF, by data from LF
Store MF.Data can be copied to 3 nodes.
The above embodiment is only the preferred embodiment of the present invention, and the scope of protection of the present invention is not limited thereto,
The variation and replacement for any unsubstantiality that those skilled in the art is done on the basis of the present invention belong to institute of the present invention
Claimed range.
Claims (1)
1. the quick I/O systems applied to distributed genetic group analysis, which is characterized in that it uses distributed file system HDFS
As basic storage system, the height of the genomic data in multinode Distributed Calculation is improved using the local disk of each node
The I/O functions of flux;
HDFS clusters are coupled with genome analysis calculate node;Namenode of one of calculate node as HDFS;It is all
Calculate node is used as HDFS back end;It is written on the 3 different nodes of the data of file system in the cluster and stores 3 pairs
This;
In each calculate node, start NFS agencies;NFS is acted on behalf of to calculate node and is provided file system support;In each calculating
On node, it is file system MF to be acted on behalf of HDFS carries by NFS;
In each calculate node, local file system LF is used as interim storage;Analysis process picture in calculate node, which is read, to be passed
File system of uniting is the same to read data from carry file system MF;For the data to be written, LF is write data into;In LF,
The file generated by process will be automatically saved to MF before process exits;
HDFS memory nodes are same group analysis calculate nodes;Analysis process in calculate node can pass through the file system of carry
The MF that unites carries out I/O operation to the back end of HDFS.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201762453539P | 2017-02-02 | 2017-02-02 | |
US62/453,539 | 2017-02-02 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108280214A true CN108280214A (en) | 2018-07-13 |
Family
ID=62807309
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810102016.1A Pending CN108280214A (en) | 2017-02-02 | 2018-02-01 | Quick I/O systems applied to distributed genetic group analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108280214A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130151884A1 (en) * | 2011-12-09 | 2013-06-13 | Promise Technology, Inc. | Cloud data storage system |
CN104408047A (en) * | 2014-10-28 | 2015-03-11 | 浪潮电子信息产业股份有限公司 | Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server |
US20150370502A1 (en) * | 2014-06-19 | 2015-12-24 | Cohesity, Inc. | Making more active use of a secondary storage system |
-
2018
- 2018-02-01 CN CN201810102016.1A patent/CN108280214A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130151884A1 (en) * | 2011-12-09 | 2013-06-13 | Promise Technology, Inc. | Cloud data storage system |
US20150370502A1 (en) * | 2014-06-19 | 2015-12-24 | Cohesity, Inc. | Making more active use of a secondary storage system |
CN104408047A (en) * | 2014-10-28 | 2015-03-11 | 浪潮电子信息产业股份有限公司 | Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server |
Non-Patent Citations (2)
Title |
---|
包永红: "基于Hadoop的基因组分析平台构建", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
邹振宇: "基于HDFS的云存储***的实现与优化", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200356901A1 (en) | Target variable distribution-based acceptance of machine learning test data sets | |
Wang et al. | Performance prediction for apache spark platform | |
CN109582433B (en) | Resource scheduling method and device, cloud computing system and storage medium | |
US9135071B2 (en) | Selecting processing techniques for a data flow task | |
CN104881466B (en) | The processing of data fragmentation and the delet method of garbage files and device | |
JP5939123B2 (en) | Execution control program, execution control method, and information processing apparatus | |
CN105205154B (en) | Data migration method and device | |
CN109643310B (en) | System and method for redistribution of data in a database | |
CN105930479A (en) | Data skew processing method and apparatus | |
US9535743B2 (en) | Data processing control method, computer-readable recording medium, and data processing control device for performing a Mapreduce process | |
CN103917960A (en) | Storage apparatus and duplicate data detection method | |
Ferraro Petrillo et al. | Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics | |
JP2011170774A (en) | Device and method for generation of decision tree, and program | |
JP6069503B2 (en) | Parallel analysis platform for serial data and parallel distributed processing method | |
JP6193406B2 (en) | Serialization for differential encoding | |
JP2017045080A (en) | Business flow specification regeneration method | |
US20160042097A1 (en) | System and method for concurrent multi-user analysis of design models | |
CN108280214A (en) | Quick I/O systems applied to distributed genetic group analysis | |
EP3264254B1 (en) | System and method for a simulation of a block storage system on an object storage system | |
JP5637071B2 (en) | Processing program, processing method, and processing apparatus | |
JP4268141B2 (en) | Database replication program and database replication apparatus | |
Ruan et al. | Hymr: a hybrid mapreduce workflow system | |
Becker et al. | Accelerated genomics data processing using memory-driven computing | |
US11610151B2 (en) | Distribution system, data management apparatus, data management method, and computer-readable recording medium | |
JP2018022433A (en) | Control program, apparatus, and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180713 |
|
WD01 | Invention patent application deemed withdrawn after publication |