CN107066205A

CN107066205A - A kind of data-storage system

Info

Publication number: CN107066205A
Application number: CN201611257420.3A
Authority: CN
Inventors: 惠润海; 杨浩
Original assignee: Dawning Information Industry Beijing Co Ltd
Current assignee: ZHONGKE SUGON INFORMATION INDUSTRY CHENGDU Co.,Ltd.; Dawning Information Industry Beijing Co Ltd
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2017-08-18
Anticipated expiration: 2036-12-30
Also published as: CN107066205B

Abstract

The present invention proposes a kind of data-storage system, and the data-storage system includes：Hadoop clusters, with the component being arranged in Hadoop clusters, nfs server module, wherein, component includes Map Reduce frameworks, and Map Reduce frameworks are used to perform Map Reduce flows, and Map Reduce flows include Map tasks and Reduce tasks, it is arranged on the disk array module in nfs server module, and pass through disk array module and nfs server module composition shared storage device, so as to provide storage for Hadoop clusters, and store the result of each Map task to shared storage device, to remove shuffle processes, so as to optimize the flow of Map tasks and Reduce tasks；And the file cutting used in Hadoop clusters is multiple pieces by component, and different computer nodes are dealt into by each piece, it is achieved thereby that load balancing.The present invention by the data-storage system so that system in cost performance, reliability, can safeguard, obtained larger improvement in terms of performance.

Description

A kind of data-storage system

Technical field

The present invention relates to the communications field, it particularly relates to a kind of data-storage system.

Background technology

In recent years, Hadoop increased income, and big data project is increasingly mature, and it brings feasible to each big data application industry Solution, the parallel processing framework Map-Reduce of Hadoop clusters is to structuring and the equal energy of the processing of semi-structured data Many nodal parallels are enough realized, the speed of Data Analysis Services can be largely lifted.

Meanwhile, the default storage of Hadoop clusters is used under the distributed file system HDFS carried, default situations, should HDFS is stored using three copies, still, for big data application, and many copies of HDFS acquiescences are stored with several defects：

Big data application system generally not only only does big data analysis, also numerous other types of business datum, because This HDFS is difficult the demand for meeting various application scenarios, especially small documents storage scenarios, therefore, it is necessary to by once during analysis Data are imported, and are imported data among HDFS, are caused great inconvenience；

The memory space utilization rate of HDFS three copies is 33.3%, and for big data is stored and is analyzed, cost is Fairly expensive；

HDFS belongs to open source projects, the reliability of file system, it is maintainable in terms of there is more problem, be not suitable for Store the critical data in production environment.

The problem of in correlation technique, effective solution is not yet proposed at present.

The content of the invention

The problem of in correlation technique, the present invention proposes a kind of data-storage system, passes through disk array RAID and the The mode of two combination of protocols, substitute the mode of HHDFS tri- copies storage so that the reliability of system, can safeguard in terms of Larger improvement is arrived, so as to solve asking for distributed file system HDFS cost, reliability and ease for use in the prior art Topic.

The technical proposal of the invention is realized in this way：

According to an aspect of the invention, there is provided a kind of data-storage system.

The data-storage system includes：Hadoop clusters, and component, the nfs server mould being arranged in Hadoop clusters Block, wherein, component includes Map-Reduce frameworks, and Map-Reduce frameworks are used to perform Map-Reduce flows, and Map-Reduce flows include Map tasks and Reduce tasks, are arranged on the disk array module in nfs server module, and And by disk array module and nfs server module composition shared storage device, so that storage is provided for Hadoop clusters, And store the result of each Map task to shared storage device, to remove shuffle processes, so as to optimize Map tasks With the flow of Reduce tasks；And the file cutting used in Hadoop clusters is multiple pieces by component, and by each block Different computer nodes are dealt into, it is achieved thereby that load balancing.

According to one embodiment of present invention, component further comprises：NFS sharing storage modules, HDFS storage agreements turn Shuffle stage modules, Map-Reduce task scheduling modules are gone in mold changing block, Map-Reduce flows.

According to one embodiment of present invention, disk array uses RAID5 or RAID6 storage mode, and will The file used in Hadoop clusters is cut into 64MB block.

The advantageous effects of the present invention are：

The mode that the present invention is combined by using nfs server and disk array constitutes shared storage, substitutes prior art The mode of the middle copies of HDFS tri- storage, so as to reduce cost, improves the cost performance of system, and the text that Hadoop is used Part cutting is multiple pieces, and is uniformly distributed to each calculate node, it is achieved thereby that load balancing, in addition, also optimizing Map- Reduce flows, it eliminates shuffle processes, so as to reduce the process of data interaction, improves task processing time, Jin Erliao Improve systematic function.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to institute in embodiment The accompanying drawing needed to use is briefly described, it should be apparent that, drawings in the following description are only some implementations of the present invention Example, for those of ordinary skill in the art, on the premise of not paying creative work, can also be obtained according to these accompanying drawings Obtain other accompanying drawings.

Fig. 1 is the schematic diagram of data-storage system according to embodiments of the present invention；

Fig. 2 is the layout schematic diagram of mechanism of data-storage system according to embodiments of the present invention；

Fig. 3 is Map-Reduce tasks carryings process schematic of the prior art；

Fig. 4 is Map-Reduce tasks carrying process schematics according to embodiments of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, the every other embodiment that those of ordinary skill in the art are obtained belongs to what the present invention was protected Scope.

There is provided data-storage system for embodiments in accordance with the present invention.

As shown in Figures 1 to 4, data-storage system according to embodiments of the present invention includes：Hadoop clusters, and be arranged on Component, nfs server module in Hadoop clusters, wherein, component includes Map-Reduce frameworks, and Map-Reduce frames Frame is used to perform Map-Reduce flows, and Map-Reduce flows include Map tasks and Reduce tasks, are arranged on NFS Disk array module in server module, and set by disk array module and the shared storage of nfs server module composition It is standby, so as to provide storage for Hadoop clusters, and the result of each Map task is stored to shared storage device, to go Fall shuffle processes, so as to optimize the flow of Map tasks and Reduce tasks；And component will be used in Hadoop clusters File cutting be multiple pieces, and different computer nodes are dealt into by each piece, it is achieved thereby that load balancing.

In this embodiment, as shown in figure 1, disk array RAID is arranged in nfs server, and disk array is passed through The combination of module and nfs server module constitutes shared storage, so that storage is provided for Hadoop clusters, in addition, such as Fig. 2 institutes Show, the file cutting used in Hadoop clusters is multiple pieces (or section) by component, and by each piece be dealt into 3 it is different Computer node, it is achieved thereby that load balancing, as shown in Figure 3 and Figure 4, the result of each Map task is stored to shared and deposited Equipment is stored up,, can so as to optimize the flow of Map tasks and Reduce tasks, it is of course possible to understand to remove shuffle processes Size and the computer node of distribution according to the actual requirements to block is configured, and the present invention is not limited this.

By the such scheme of the present invention, the mode combined by using nfs server and disk array constitutes shared deposit Storage, substitutes the mode of the copies of HDFS tri- storage in the prior art, so as to reduce cost, improves the cost performance of system, and The file cutting that Hadoop is used is multiple pieces, and is uniformly distributed to each calculate node, it is achieved thereby that load balancing, this Outside, Map-Reduce flows are also optimized, it eliminates shuffle processes, so as to reduce the process of data interaction, improves task Processing time, and then improve systematic function.

According to one embodiment of present invention, component further comprises：NFS sharing storage modules, HDFS storage agreements turn Shuffle stage modules, Map-Reduce task scheduling modules are gone in mold changing block, Map-Reduce flows, wherein, it is above-mentioned NFS sharing storage modules are used to nfs server and disk array being arranged to shared storage；Above-mentioned HDFS stores protocol conversion mould Block is used for the protocol data that HDFS protocol data is converted to NFS, so as to realize the access to disk array；Above-mentioned Map- Reduce flows go Shuffle stage modules to be used to remove Shuffle flows；Above-mentioned Map-Reduce task scheduling modules are used for The task scheduling of Map tasks and Reduce tasks.

According to one embodiment of present invention, the disk array uses the RAID5 (independent disks of distributed parity Structure) or RAID6 (disk structure of the parity check code of two kinds of storages) storage mode, and will be used in Hadoop clusters File be cut into 64MB block.

In order to preferably describe the present invention, it is described in detail below by a specific embodiment.

The problem of in order to solve the cost of distributed file system HDFS presence, reliability and ease for use in the prior art, Set forth herein the three copy storage modes substituted using disk array RAID storage modes in HDFS, its one side eliminates number According to the process imported and exported, on the other hand, the traditional RAID5 that the disk array can be arranged in traditional magnetic disk array and RAID6, so as to improve memory space utilization rate, reduces cost.

As shown in figure 1, because disk array can be conducted interviews by NFS protocol (or NFS Network File Systems), therefore It is NFS access protocols by the protocol conversion of HDFS application layers by adding protocol conversion module in Hadoop clusters, so that will Hadoop storage, which is accessed, is converted to the access to the disk array in nfs server, specifically, 1 pair of computer node Hadoop clusters are conducted interviews, and by the Hadoop of component 1 application layer protocol, (or Hadoop clusters are accessed protocol conversion module 1 Agreement) data be converted to the access data of NFS protocol, so as to be conducted interviews to the disk array RAID of the nfs server, its His component 2, the situation of component 3 are similar, are not described in detail herein

In addition, the data storage acquiescence in Hadoop clusters is to be stored in using three copy modes in HDFS systems, Hadoop each component, such as MapReduce frameworks, HBase systems, dependence copy mechanism progress are fault-tolerant, for example, when first Where copy during node failure, Hadoop component can access triplicate data above automatically, still, using disk array Substitute after HDFS, just there is no the concept of copy for file, although what the RAID mechanism of disk array ensure that data itself can By property, but can not ensure copy automatic switchover mechanism inside Hadoop fault tolerant mechanism can normal work, still, due to magnetic Disk array storage is using NFS protocol export, therefore the data that all calculate nodes are seen are completely the same, be therefore, it can It is the memory node of a duplicate of the document to think any one node, as shown in Fig. 2 the original stored according to Hadoop files Then, cutting is carried out according to the object of fixed length to the file that need to store, every piece after such as cutting is 64MB (million), meanwhile, in order to protect Card MapReduce tasks can be distributed to different calculate nodes, meanwhile, can using every piece (section) specify 3 calculate nodes as Node where its stored copies, so, each component of Hadoop inside is taken after the data layout of file, according to acquiescence Algorithm carries out task distribution, it is not necessary to make any change, so that using NFS shared characteristic, data storage is distributed into Row pseudo-random distribution, it is ensured that the harmony of Map-Reduce task schedulings, also, it is to be understood that cutting after block size and The number of the computer node of stored copies can be set according to the actual requirements, and this is not limited by the present invention.

In addition, when selecting calculate node for the copy of each object, using pseudo-random algorithm, it is ensured that each to calculate section The selected probability of point is basically identical, so as to ensure Hadoop system in task scheduling, can make full use of every in system One calculate node, does not result in part of nodes situation hungry to death.

In addition, as shown in figure 3, during MapReduce tasks carryings in Hadoop clusters, wherein, the MapReduce Task includes：Map stages and Reduce stages, the Map stages are responsible for carrying out cutting processing to input file, then collect and divide again Group is handled to the Reduce stages, to reach efficient Distributed Calculation efficiency, and, it is necessary to will before each Map stages terminate Multiple destination files on disk are written to before the stage and carry out merger, a destination file is merged into, and the Reduce stages , it is necessary to pull the destination file of Map tasks from each Map tasks end before starting, and all Map results are subjected to merger, shape Into final destination file, enter Reduce calculation stages, the above since the stage after Map to Reduce before whole processing Process, referred to as Shuffle processing procedures, still, for the MapReduce tasks that task amount is larger, in above flow There are substantial amounts of I/O (input/output) operations, especially data pull stage during Shuffle, Reduce jobs nodes need From Map jobs nodes by network transmission pulling data, the time of process consumption accounts for more than 10% ratio in whole operation Weight, however, as shown in figure 4, for set forth herein use disk array framework, because all data are stored shared On, therefore the process of the network transmission can omit completely, the operation without carrying out data pull, so as to utilize NFS files system The shared characteristic of system, optimizes the shuffle processes of Hadoop clusters, it is to avoid data transfer, improves task processing time.

To sum up, it is shared by disk array and NFS network files this paper presents in big data storage and analysis system The mode of combination of protocols, substitutes the Hadoop copies of HDFS tri- so that system can be safeguarded in cost, data reliability, system, property Larger improvement can be obtained in terms of these, specifically：It is first, as shown in table 1 below for space availability ratio and cost, In which it is assumed that the naked space costs of 1TB are P.

Table 1

As it can be seen from table 1 HDFS memory space utilization rate is 33.3%, purchase 300TB storage is such as assumed, it is actual Free space only has 100TB, and using set forth herein disk array RAID combination NFS network files it is shared by the way of, storage Space availability ratio is up to 90%, and carrying cost saves about 67%.

Secondly as HDFS is not standard storage interface, it is therefore desirable to which the data of analysis must be imported and exported, to dividing Analysis efficiency causes large effect, and use set forth herein scheme, after creation data is produced in front end, can directly use Hadoop processing, it is not necessary to import and export, greatly facilitates the transmission of data, in addition, being assisted using disk array combination NFS After view, the Shuffle processes to MapReduce tasks have carried out local optimum herein, to reduce in Map task nodes and Data transfer is carried out by http protocol between Reduce task nodes, so as to improve the efficiency of whole processing procedure.

Again, open source software is to realize function as main purpose, thus its engineering process and enterprise-level product comparatively, Many weak points are had, therefore often there is more hidden danger in the stability and reliability of system, due to Hadoop system HDFS storage systems are constantly among modification, and stability equally exists certain risk, therefore creation data can not be deposited directly Being placed on has very big risk on HDFS, and the development of decades is passed through in disk array RAID storages, and reliability is entirely enterprise-level Standard, is adapted to the storage of creation data, therefore, in the way of disk array RAID and second protocol combination, substitutes HDFS tri- secondary The mode of this storage, is far above distributed file system HDFS in data reliability.

In summary, by means of the above-mentioned technical proposal of the present invention, combined by using nfs server and disk array Mode constitutes shared storage, substitutes the mode of the copies of HDFS tri- storage in the prior art, so as to reduce cost, improves system Cost performance, and the file cutting that Hadoop is used is multiple pieces, and each calculate node is uniformly distributed to, so as to realize Load balancing, in addition, also optimizing Map-Reduce flows, it eliminates shuffle processes, so as to reduce data interaction Process, improves task processing time, and then improve systematic function.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God is with principle, and any modification, equivalent substitution and improvements made etc. should be included in the scope of the protection.

Claims

1. a kind of data-storage system, the storage system includes Hadoop clusters, and is arranged in the Hadoop clusters Component, nfs server module, wherein, the component includes Map-Reduce frameworks, and the Map-Reduce frameworks are used for Map-Reduce flows are performed, and the Map-Reduce flows include Map tasks and Reduce tasks, it is characterised in that

The disk array module in the nfs server module is arranged on, and passes through the disk array module and the NFS Server module constitutes shared storage device, so as to provide storage for the Hadoop clusters, and each described Map is appointed The result of business is stored to the shared storage device, to remove shuffle processes, so as to optimize Map tasks and Reduce The flow of business；And

The file cutting used in Hadoop clusters is multiple pieces by the component, and is dealt into different calculating by each piece Machine node, it is achieved thereby that load balancing.

2. according to the storage system described in claim 1, it is characterised in that the component further comprises：The shared storage moulds of NFS Shuffle stage modules, Map-Reduce is gone to appoint in block, HDFS storages protocol conversion module, the Map-Reduce flows Business scheduler module.

3. storage system according to claim 1, it is characterised in that the disk array is deposited using RAID5's or RAID6 Storage mode, and the file used in Hadoop clusters is cut into 64MB block.