CN107729514A

CN107729514A - A kind of Replica placement node based on hadoop determines method and device

Info

Publication number: CN107729514A
Application number: CN201711007971.9A
Authority: CN
Inventors: 王宜燕; 江超
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2017-10-25
Filing date: 2017-10-25
Publication date: 2018-02-23

Abstract

The invention discloses a kind of Replica placement node based on hadoop to determine method, apparatus, equipment and computer-readable recording medium, including：Target rack server is determined according to the copy type of target copy；Node to be placed is chosen from target rack server, forms node cluster to be placed；The node that work connection number is less than connection number threshold value is chosen from node cluster to be placed, and is less than in the node of connection number threshold value the really minimum node of real time load, the placement node as target copy from work connection number.It can be seen that, in this programme, when selection is prevented putting the node of copy, need to consider simultaneously the real time load and HDFS progress of work numbers of node, so as to effectively raise the reasonable layout of copy, compared to the Replica Placement Strategy of acquiescence, the Replica Placement Strategy of optimization has more specific object, minimum real time load node is have selected as far as possible, avoids the storage of high capacity node, improves the time of transfer copies.

Description

A kind of Replica placement node based on hadoop determines method and device

Technical field

The present invention relates to distributed file system copy technical field of memory, it is based on more specifically to one kind Hadoop Replica placement node determines method, apparatus, equipment and computer-readable recording medium.

Background technology

At present, Hadoop is enterprise's big data analysis platform of current main-stream.Hadoop is using the distributed texts of HDFS Part system carries out data storage.HDFS using master-slave architecture design pattern (master/slavearchitecture), one Individual name node (NameNode) and some back end (DataNode) form HDFS clusters.Wherein HDFS is superfluous using three copies Remaining mechanism ensures the security of data.HDFS acquiescence Replica Placement Strategy principle be：As much as possible by two of which data Block copy is stored in a frame, another data block copy is stored in another frame, well in bandwidth resources And balanced in terms of reliability.

But acquiescence Replica Placement Strategy has certain limitation, major embodiment is as follows：When choosing copy memory node Random machine mode is employed, although HDFS have also contemplated that the load information that the work of back end counts in succession, but relatively easy, and And just judged after memory node is randomly selected.Such Replica placement mode is random by the distribution for causing copy Greatly, especially in isomerous environment it is very possible there is distributing more data trnascription node be poor-performing node, these feelings Condition, which will further result in some nodes, has very high load, and some nodes but cause data transmission efficiency in idle condition Decline.

Therefore, the placement node of copy how is determined, to improve the harmony of clustered node load, is finally reached lifting number It is that those skilled in the art need to solve according to the purpose of efficiency of transmission.

The content of the invention

It is an object of the invention to provide a kind of Replica placement node based on hadoop determine method, apparatus, equipment and Computer-readable recording medium, to determine the placement node of copy, the harmony of clustered node load is improved, is finally reached lifting The purpose of data transmission efficiency.

To achieve the above object, the embodiments of the invention provide following technical scheme：

A kind of Replica placement node based on hadoop determines method, including：

Target rack server is determined according to the copy type of target copy；

Node to be placed is chosen from the target rack server, forms node cluster to be placed；

The node that work connection number is less than connection number threshold value is chosen from the node cluster to be placed, and from the work Connect number and be less than in the node of connection number threshold value the really minimum node of real time load, the placement node as the target copy.

Wherein, the copy type according to target copy determines target rack server, including：

If the copy type of the target copy is first copy, rack server is randomly selected as the target Rack server；

If the copy type of the target copy is second copy, from first pair corresponding with the target copy In other rack servers outside the rack server of this placement, target rack server is chosen；

If the copy type of the target copy is the 3rd copy, first pair corresponding to the target copy is judged Whether the rack server of this placement is identical in the rack server of second Replica placement；If identical, from it is described In other rack servers outside the rack server of second Replica placement corresponding to target copy, target frame clothes are chosen Business device；If differing, using the rack server of second Replica placement corresponding to the target copy as the target machine Frame server.

Wherein, the node that work connection number is chosen from the node cluster to be placed and is less than connection number threshold value, bag Include：

Determine the work connection number of each node in the node cluster to be placed；

The average operation that the node cluster to be placed is calculated according to the work connection number of each node connects number, by described in Average operation connection number chooses work connection number from the node cluster to be placed and is less than company as the connection number threshold value Connect the node of several threshold values.

It is wherein, described to be less than in the node of connection number threshold value the really minimum node of real time load from work connection number, As the placement node of the target copy, including：

It is determined that each work connection number be less than the disk I/O load of the node of connection number threshold value, internal memory load, cpu load with And network load；

According to each work connect number be less than connection number threshold value node disk I/O load, internal memory load, cpu load with And network load and load factor, it is determined that each work connection number is less than the real time load of the node of connection number threshold value, and choose Placement node of the minimum node of real time load as the target copy.

Wherein, the basis each work connection number be less than connection number threshold value node disk I/O load, internal memory load, Cpu load and network load and load factor, it is determined that each work connection number is less than the negative in real time of the node of connection number threshold value Carry, including：

Determine that rule determines that each work connection number is less than the real time load of the node of connection number threshold value using real time load； The real time load determines that rule is：

W=λ_io×w_io+λ_mem×w_mem+λ_cpu×w_cpu+λ_band×w_band；

Wherein, W is real time load, w_ioFor disk I/O load, w_memFor internal memory load, w_cpuFor cpu load, w_bandFor network Load, λ_ioFor disk specific gravity factor, λ_memFor internal memory specific gravity factor, λ_cpuFor CPU specific gravity factors, λ_bandFor network bandwidth proportion system Number, λ_io+λ_mem+λ_cpu+λ_band=1, and λ_io、λ_mem、λ_cpu、λ_band∈[0,1]。

A kind of Replica placement node determining device based on hadoop, including：

Target rack server determining module, target rack server is determined for the copy type according to target copy；

Cluster determining module, for choosing node to be placed from the target rack server, form node to be placed Cluster；

Node selection module, it is less than connection number threshold value for choosing work connection number from the node cluster to be placed Node, and it is less than in the node of connection number threshold value the really minimum node of real time load from work connection number, as the mesh Mark the placement node of copy.

Wherein, the node selection module includes：

Work connection number determining unit, for determining that the work of each node in the node cluster to be placed connects number；

Average operation connects number computing unit, and the node to be placed is calculated for connecting number according to the work of each node The average operation connection number of cluster；

Node selection unit, for the average operation to be connected into number as the connection number threshold value, and wait to put from described Put and the node that work connection number is less than connection number threshold value is chosen in node cluster.

Wherein, the node selection module includes：

Load determining unit, for determine each work connection number be less than the node of connection number threshold value disk I/O load, Internal memory load, cpu load and network load；

Real time load determining unit, the disk I/O for connecting the node that number is less than connection number threshold value according to each work are born Load, internal memory load, cpu load and network load and load factor, it is determined that each work connection number is less than connection number threshold value The real time load of node, and choose placement node of the minimum node of real time load as the target copy.

A kind of Replica placement node based on hadoop determines equipment, including：

Memory, for storing computer program；Processor, above-mentioned copy is realized during for performing the computer program Place the step of node determines method.

A kind of computer-readable recording medium, computer program is stored with the computer-readable recording medium, it is described The step of above-mentioned Replica placement node determines method is realized when computer program is executed by processor.

By above scheme, a kind of Replica placement node determination side based on hadoop provided in an embodiment of the present invention Method, including：Target rack server is determined according to the copy type of target copy；Choose and treat from the target rack server Node is placed, forms node cluster to be placed；Work connection number is chosen from the node cluster to be placed and is less than connection number threshold The node of value, and it is less than in the node of connection number threshold value the really minimum node of real time load from work connection number, as institute State the placement node of target copy.

It can be seen that in this programme, choose it is anti-put the node of copy when, it is necessary to consider simultaneously node real time load and HDFS progress of work numbers, so as to effectively raise the reasonable layout of copy, compared to the Replica Placement Strategy of acquiescence, optimization Replica Placement Strategy has more specific object, have selected minimum real time load node as far as possible, avoids high capacity node Storage, improve the time of transfer copies；The invention also discloses a kind of Replica placement node determining device based on hadoop, Equipment and computer-readable recording medium, it can equally realize above-mentioned technique effect.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is that a kind of Replica placement node based on hadoop disclosed in the embodiment of the present invention determines that method flow is illustrated Figure；

Fig. 2 is a kind of Replica placement node determining device structural representation based on hadoop disclosed in the embodiment of the present invention Figure.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

The embodiment of the invention discloses a kind of Replica placement node based on hadoop to determine method, apparatus, equipment and meter Calculation machine readable storage medium storing program for executing, to determine the placement node of copy, the harmony of clustered node load is improved, is finally reached lifting number According to the purpose of efficiency of transmission.

Referring to Fig. 1, a kind of Replica placement node based on hadoop provided in an embodiment of the present invention determines method, including：

S101, target rack server determined according to the copy type of target copy；

Specifically, before the node for determining copy to be placed, it is necessary first to determine rack server, in this programme, really Determining rack server is determined according to the type information of copy, and the type information refers to which copy copy is, typically For, the selection of the placement location of copy follows is placed on copy to ensure the principle of reliability in different frames as far as possible, at this In embodiment, by taking most common 3 copy scheme as an example, that is to say, that the type of target copy can be：First copy, second Individual copy or the 3rd copy；Specifically, the Replica placement selection strategy of this programme is as follows：

1) number of copies chosen is needed>0 and need select first copy:

Judge whether client node is back end；The node is selected to deposit if client node is back end Copy, otherwise, rack server is determined at random, and the node of drop target copy is chosen by the method described in this programme；

2) if selection triplicate:

Then specify in institute's organic frame where removing the first authentic copy outside frame and choose target rack server, and pass through we The method choice node of case；

3) if selection triplicate:

If node where the first and second copies is in same frame, the institute where specifying removing triplicate outside frame Organic frame goes to choose node by the method described in this programme, otherwise specifies frame where triplicate to pass through described in this programme Method goes to choose node.

It should be noted that when target rack server is chosen from multiple rack servers, selection that can be random, It can be chosen according to predetermined selection rule, it is not specific herein to limit.

S102, node to be placed is chosen from the target rack server, form node cluster to be placed；

Specifically, this programme chooses a number of back end from specified shelf position, node to be placed is generated Cluster, the cluster is used for therefrom determining the node of drop target copy, it is necessary to explanation, and the number of nodes in the cluster can be with It is set in advance, and when choosing node, can also be chosen according to node selection rule set in advance, equally can also Randomly select, it is not specific herein to limit.

S103, the node that work connection number is less than connection number threshold value is chosen from the node cluster to be placed, and from institute State work connection number and be less than in the node of connection number threshold value the really minimum node of real time load, the placement as the target copy Node.

In this programme, determine after cluster, it is necessary to further choose work connection number from from the cluster less than connection number The node of threshold value, and choose the minimum node of real time load as locations of copies from the node less than connection number threshold value and place section Point；It should be noted that the connection number threshold value in this programme can be that user is set in advance, equally can also be according to actual feelings Condition dynamically changes, such as cluster average operation is connected into number as connection number threshold value, not specific herein to limit.

It is understood that the HDFS back end progresses of work are the connection of the work such as back end HDFS write-ins, reading Number.Because these loads are the relations of ratio, some nodes may be due to better performances under isomerous environment, and its is some negative in real time Load is in reduced levels, and a large amount of copies of cluster will be caused to be stored in indivedual high-performance nodes when joint behavior is seriously unbalanced On.The load information can control the HDFS progresses of work carried out on a back end, suppress some back end and carry out excessively HDFS service.

It can be seen that in this programme, choose it is anti-put the node of copy when, it is necessary to consider simultaneously node real time load and HDFS progress of work numbers, so as to effectively raise the reasonable layout of copy, compared to the Replica Placement Strategy of acquiescence, optimization Replica Placement Strategy has more specific object, have selected minimum real time load node as far as possible, avoids high capacity node Storage, improve the time of transfer copies.

It is described to connect number less than true real time load in the node for connecting number threshold value most from the work based on above-described embodiment Small node, as the placement node of the target copy, including：

Specifically, this programme determines that rule determines that each work connection number is less than the section of connection number threshold value using real time load The real time load of point；The real time load determines that rule is：

W=λ_io×w_io+λ_mem×w_mem+λ_cpu×w_cpu+λ_band×w_band；

Specifically, the real time load of back end can be weighed by multiple indexs, in this programme, born with disk I/O Carry, internal memory load, cpu load, this programme is illustrated exemplified by network load.Assuming that back end real time load is W, then：

W=λ_io×w_io+λ_mem×w_mem+λ_cpu×w_cpu+λ_band×w_band；

Wherein, w_ioFor disk I/O load, w_memFor internal memory load, w_cpuFor cpu load, w_bandFor network load, λ_io, λ_mem, λ_cpu, λ_bandThen represent node disk, internal memory, CPU, the proportion shared by network bandwidth when weighing node workload, λ_io+ λ_mem+λ_cpu+λ_band=1, λ_io、λ_mem、λ_cpu、λ_band∈ [0,1], further, the selection of the weights in this programme, which uses, plans strategies for Analytic hierarchy process (AHP) in determines that the weights of wherein real time load are defined as：λ_cpu=0.153, λ_mem=0.072, λ_io= 0.531、λ_band=0.245.

Replica placement node determining device provided in an embodiment of the present invention is introduced below, copy described below is put Put node determining device and determine that method can be with cross-referenced with above-described Replica placement node.

Referring to Fig. 2, a kind of Replica placement node determining device based on hadoop provided in an embodiment of the present invention, including：

Target rack server determining module 100, for determining target frame service according to the copy type of target copy Device；

Cluster determining module 200, for choosing node to be placed from the target rack server, form section to be placed Point cluster；

Node selection module 300, it is less than connection number threshold for choosing work connection number from the node cluster to be placed The node of value, and it is less than in the node of connection number threshold value the really minimum node of real time load from work connection number, as institute State the placement node of target copy.

Wherein, the node selection module includes：

Based on above-mentioned any embodiment, this programme discloses a kind of Replica placement node based on hadoop and determines equipment, Including：

Memory, for storing computer program；Processor, realize during for performing the computer program and above-mentioned state pair The step of this placement node determines method.

This programme also discloses a kind of computer-readable recording medium, and computer is stored with computer-readable recording medium Program, the computer program realize the step of above-mentioned Replica placement node determines method when being executed by processor.

It should be noted that the storage medium can include：USB flash disk, mobile hard disk, read-only storage (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. are various to deposit Store up the medium of program code.

Each embodiment is described by the way of progressive in this specification, what each embodiment stressed be and other The difference of embodiment, between each embodiment identical similar portion mutually referring to.

The foregoing description of the disclosed embodiments, professional and technical personnel in the field are enable to realize or using the present invention. A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope caused.

Claims

1. a kind of Replica placement node based on hadoop determines method, it is characterised in that including：

Target rack server is determined according to the copy type of target copy；

Work connection number is chosen from the node cluster to be placed and is less than the node of connection number threshold value, and is connected from the work Number is less than in the node of connection number threshold value the really minimum node of real time load, the placement node as the target copy.

2. Replica placement node according to claim 1 determines method, it is characterised in that the pair according to target copy This type determines target rack server, including：

If the copy type of the target copy is first copy, rack server is randomly selected as the target frame Server；

If the copy type of the target copy is second copy, put from first copy corresponding with the target copy In other rack servers outside the rack server put, target rack server is chosen；

If the copy type of the target copy is the 3rd copy, judge that first copy is put corresponding to the target copy Whether the rack server put is identical in the rack server of second Replica placement；If identical, from the target In other rack servers outside the rack server of second Replica placement corresponding to copy, target frame service is chosen Device；If differing, using the rack server of second Replica placement corresponding to the target copy as the target frame Server.

3. Replica placement node according to claim 1 determines method, it is characterised in that described from the node to be placed The node that work connection number is less than connection number threshold value is chosen in cluster, including：

The average operation that the node cluster to be placed is calculated according to the work connection number of each node connects number, will be described average Work connection number chooses work connection number from the node cluster to be placed and is less than connection number as the connection number threshold value The node of threshold value.

4. Replica placement node as claimed in any of claims 1 to 3 determines method, it is characterised in that it is described from The work connection number is less than in the node of connection number threshold value the really minimum node of real time load, as putting for the target copy Node is put, including：

It is determined that each work connection number is less than disk I/O load, internal memory load, cpu load and the net of the node of connection number threshold value Network loads；

Number is connected according to each work and is less than the disk I/O load for the node for connecting number threshold value, internal memory load, cpu load and net Network loads and load factor, it is determined that each work connection number is less than the real time load of the node of connection number threshold value, and chooses real-time Load placement node of the minimum node as the target copy.

5. Replica placement node according to claim 4 determines method, it is characterised in that the basis each works connection Number is less than the disk I/O load of the node of connection number threshold value, internal memory load, cpu load and network load and load factor, really Fixed each work connection number is less than the real time load of the node of connection number threshold value, including：

Determine that rule determines that each work connection number is less than the real time load of the node of connection number threshold value using real time load；It is described Real time load determines that rule is：

W=λ_io×w_io+λ_mem×w_mem+λ_cpu×w_cpu+λ_band×w_band；

Wherein, W is real time load, w_ioFor disk I/O load, w_memFor internal memory load, w_cpuFor cpu load, w_bandBorn for network Carry, λ_ioFor disk specific gravity factor, λ_memFor internal memory specific gravity factor, λ_cpuFor CPU specific gravity factors, λ_bandFor network bandwidth proportion system Number, λ_io+λ_mem+λ_cpu+λ_band=1, and λ_io、λ_mem、λ_cpu、λ_band∈[0,1]。

A kind of 6. Replica placement node determining device based on hadoop, it is characterised in that including：

Cluster determining module, for choosing node to be placed from the target rack server, form node cluster to be placed；

Node selection module, the section of connection number threshold value is less than for choosing work connection number from the node cluster to be placed Point, and it is less than in the node of connection number threshold value the really minimum node of real time load from work connection number, as the target The placement node of copy.

7. Replica placement node determining device according to claim 6, it is characterised in that the node selection module bag Include：

Average operation connects number computing unit, and the node cluster to be placed is calculated for connecting number according to the work of each node Average operation connection number；

Node selection unit, for the average operation to be connected into number as the connection number threshold value, and from the section to be placed The node that work connection number is less than connection number threshold value is chosen in point cluster.

8. the Replica placement node determining device according to claim 6 or 7, it is characterised in that the node selection module Including：

Load determining unit, for determining that each work connection number is less than disk I/O load, the internal memory of the node of connection number threshold value Load, cpu load and network load；

Real time load determining unit, the disk I/O load of the node for being less than connection number threshold value according to each work connection number, Internal memory load, cpu load and network load and load factor, it is determined that each work connection number is less than the node of connection number threshold value Real time load, and choose placement node of the minimum node of real time load as the target copy.

9. a kind of Replica placement node based on hadoop determines equipment, it is characterised in that including：

Memory, for storing computer program；

Processor, realize that the Replica placement node as described in any one of claim 1 to 5 is true during for performing the computer program The step of determining method.

10. a kind of computer-readable recording medium, it is characterised in that be stored with computer on the computer-readable recording medium Program, realize that Replica placement node determines as described in any one of claim 1 to 5 when the computer program is executed by processor The step of method.