CN106598729A - Data distribution method and system of distributed parallel computing system - Google Patents

Data distribution method and system of distributed parallel computing system Download PDF

Info

Publication number
CN106598729A
CN106598729A CN201611042373.0A CN201611042373A CN106598729A CN 106598729 A CN106598729 A CN 106598729A CN 201611042373 A CN201611042373 A CN 201611042373A CN 106598729 A CN106598729 A CN 106598729A
Authority
CN
China
Prior art keywords
cluster
data
room
data block
taking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611042373.0A
Other languages
Chinese (zh)
Inventor
杨黎
付仲明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhengtong Electronics Co Ltd
Original Assignee
Shenzhen Zhengtong Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhengtong Electronics Co Ltd filed Critical Shenzhen Zhengtong Electronics Co Ltd
Priority to CN201611042373.0A priority Critical patent/CN106598729A/en
Publication of CN106598729A publication Critical patent/CN106598729A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration

Abstract

The invention discloses a data distribution method of a distributed parallel computing system. The method comprises the following steps: estimating occupied spaces of clusters in an input data set; establishing a corresponding relation between the clusters and data blocks for storing the clusters according to the occupied spaces of the clusters in the input data set and preset residual spaces of the data blocks; and storing the clusters in the corresponding data blocks according to the corresponding relation between the clusters and the data blocks for storing the clusters. The invention further discloses a data distribution system of the distributed parallel computing system. By adoption of the data distribution method and system of the distributed parallel computing system disclosed by the invention, the load of a reduce task can be balanced, thereby improving the work execution efficiency and reduce the loss of time.

Description

The data distributing method and system of distributed parallel system
Technical field
The present invention relates to Computer Applied Technology field, more particularly to a kind of data distribution of distributed parallel system Method and system.
Background technology
Into 21 century since, the innovation development of Internet information technique enters fast traffic lane, on the one hand directly brings The speed transmission of internet is constantly improved, and is on the other hand indirectly brought and is used the user of the Internet to present all over the world Constantly increase, finally bring being significantly increased for data volume, the growth of geometric data amount is presented.OSDI was published in 2004 In (USENIX Symposium on Operating System Design and Implementation) meeting with regard to Distributive parallel computation framework MapReduce, can realize efficiently processing big data by MapReduce.It is based on The distributed paralleling calculation platform of MapReduce has several big advantageous feature, respectively:Programming is simple, reliability is high, node increases Volume reduction is easy, parallelization process task, lower-price characteristic, through development and the practice of several years, the programming model of MapReduce It is proved to be the effective method for processing big data.Compare traditional programming model, it is possible to use MapReduce operations are more The mapping tasks map and reduction task reduce of individual parallelization calculating is in distributed data automatically and efficiently across multiple machines.
In the MapReduce realizations of current popular, compare Hadoop and other distributed computing frameworks, Apache Spark has more efficient realization mechanism to large-scale data process.In Spark the process of MapReduce treat it is all of in Between data regard as<Key, value>Right, a cluster is the shared identical key assignments of all paired subsets.Because mapper and Reducer is the container of map tasks and reduce tasks respectively, and Spark realizes cluster distribution criteria, is gone point using hash algorithm With cluster to reducers, all clusters are by same reduce process one subregion of composition.The size of subregion depends on correlation Quantity<Key, value>Right, for tilt data, the hash partition of acquiescence can not distribute cluster well, this may result in not There is larger difference with the live load of reducers.Incline what is be constantly present as the key assignments of intermediate data, increased program The time of operation, working performance is low.
The content of the invention
Present invention is primarily targeted at proposing a kind of data distributing method and system of distributed parallel system, purport In the load of balance reduce tasks, so as to improve the efficiency of Job execution, the loss of time is reduced.
For achieving the above object, a kind of data distributing method of distributed parallel system that the present invention is provided includes:
Estimate taking up room for each cluster that input data is concentrated;
According to the input data set each cluster take up room and default each data block remaining space, Set up each cluster and store the corresponding relation of each data block of the cluster;
According to each cluster and the corresponding relation of each data block for storing the cluster, by each cluster storage In corresponding data block.
Preferably, it is described estimation input data concentrate each cluster include the step of taking up room:
The cluster for accounting for the input data set total amount of data preset ratio is extracted as data using cistern sampling algorithm Collection sample;
Count the data set sample each cluster take up room;
According to the data set sample each cluster take up room and the preset ratio determines input data set Each cluster take up room.
Preferably, each cluster according to the input data set take up room and each data block sky Between, include the step of the corresponding relation for setting up each cluster and each data block for storing the cluster:
The rated capacity of each data block is calculated, the occupancy that the rated capacity is equal to each cluster of input data set is empty Between summation divided by data block number, wherein, the initial residual space of each data block is equal to the volume of the data block Constant volume;
According to the rated capacity of data block, be divided in different data blocks after segmentation combination being carried out to cluster, set up and divide The corresponding relation of the data block of the cluster after cutting the cluster after combination and storing the segmentation combination.
Preferably, the rated capacity according to data block, is divided into different data after carrying out segmentation combination to cluster Step in block includes:
Each is called to cluster successively according to the order for taking up room from big to small, and according to remaining space from big to small Order calls each data block;
When cluster is called every time, judge whether taking up room for the cluster currently called be described more than what is currently called The remaining space of data block;
The cluster currently called take up room less than or equal to the data block currently called remaining space when, By in the clustering for currently calling to the data block currently called, and continue to call next cluster;
The cluster currently called take up room more than the data block currently called remaining space when, according to work as Before the cluster of the remaining space of the data block that calls to currently calling cut;
Will the clustering that obtains of cutting to the data block currently called, remaining cluster addition will be cut to not adjusting In cluster, and call subsequent data chunk.
Preferably, it is described set up each cluster and each data block for storing the cluster corresponding relation the step of it Afterwards, the data distributing method of the distributed parallel system also includes:
Obtain actually taking up room for each cluster;
Have a cluster it is described actually take up room be not estimated in taking up room described in estimation when, using acquiescence Hash algorithm is by the cluster distribution into data block;
Have a cluster it is described actually take up room be estimated in taking up room described in estimation when, perform the basis Each clusters and stores the corresponding relation of each data block of the cluster, and each described cluster is stored in corresponding data Step in block.
Additionally, for achieving the above object, the present invention also proposes a kind of data distribution system of distributed parallel system, Including:
Estimation block, for estimating taking up room for each cluster of input data concentration;
Relation sets up module, for according to the input data set each cluster take up room and it is default each The remaining space of data block, sets up each cluster and stores the corresponding relation of each data block of the cluster;
Distribute module, for the corresponding relation according to each cluster and each data block for storing the cluster, will be each The individual cluster is stored in corresponding data block.
Preferably, the estimation block includes:
Sampling unit, accounts for the default ratio of the input data set total amount of data for extracting using cistern sampling algorithm The cluster of example is used as data set sample;
Sample statistics unit, for counting taking up room for each cluster of the data set sample;
Evaluation unit, for according to the data set sample each cluster take up room and the preset ratio is true Determine input data set each cluster take up room.
Preferably, the relation is set up module and is included:
Rated capacity unit, for calculating the rated capacity of each data block, the rated capacity is equal to input data set Each cluster the summation for taking up room divided by data block number, wherein, the initial residual space of each data block Equal to the rated capacity of the data block;
Relation sets up unit, for the rated capacity according to data block, is divided into difference after carrying out segmentation combination to cluster Data block in, the correspondence pass of the data block of the cluster set up after segmentation combination and the cluster after storing the segmentation combination System.
Preferably, the relation is set up unit and is included:
Sequence subelement, for calling each to cluster successively according to the order for taking up room from big to small, and according to surplus Complementary space order from big to small calls each data block;
Judgment sub-unit, for when cluster is called every time, judging whether taking up room for the cluster currently called be big In the remaining space of the data block currently called;
First divides subelement, for taking up room less than or equal to described in currently calling in the cluster currently called During the remaining space of data block, by the clustering for currently calling to the data block currently called, and continue to adjust With next cluster;
Second divides subelement, for taking up room more than the data block currently called in the cluster currently called Remaining space when, the cluster currently called is cut according to the remaining space of the data block currently called; The clustering that cutting is obtained will cut the cluster of remaining cluster addition to never call to the data block currently called In, and call subsequent data chunk.
Preferably, the data distribution system of the distributed parallel system also includes:
Space acquisition module, for obtaining actually taking up room for each cluster;
Default allocation module, for not estimating in taking up room described in estimation in described actually the taking up room for having cluster Calculate then, distributed the cluster into data block using the hash algorithm of acquiescence;
The distribute module, is additionally operable to estimate in taking up room described in estimation in described actually the taking up room for having cluster Calculate then, according to each cluster and the corresponding relation of each data block for storing the cluster, by each cluster storage In corresponding data block.
In technical scheme proposed by the present invention, due to the quantity of the cluster of input data set it is more, and the difference that takes up room, In order to the balancedly distribution in data block will be clustered, it is necessary first to which the taking up room for multiple clusters of input data set is estimated Survey, then according to multiple clusters of the input data set take up room and multiple data blocks space, set up each and gather The corresponding relation of each data block of class and the storage cluster, it is ensured that total occupancy of the cluster being assigned in each data block Space is roughly equal, when the cluster key-value pair number of mapping tasks output is received, can gather this according to the corresponding relation Class key assignments logarithm mesh distributes to corresponding data block.Due to the substantially phase that always takes up room of cluster being assigned in each data block Deng, the data load balance between each data block, and then the balance of the preferably live load in reduce tasks can be obtained, The time of task run is reduced, the performance of system is indirectly improve.
Description of the drawings
Fig. 1 is the schematic flow sheet of the data distributing method first embodiment of distributed parallel system of the present invention;
Fig. 2 be distributed parallel system of the present invention data distributing method second embodiment in estimate input data set In each cluster refinement schematic flow sheet the step of taking up room;
Fig. 3 be distributed parallel system of the present invention data distributing method 3rd embodiment according to the input number According to collection each cluster take up room and each data block space, set up each cluster and store each of the cluster The refinement schematic flow sheet of the step of corresponding relation of individual data block;
Fig. 4 be distributed parallel system of the present invention data distributing method fourth embodiment according to the volume of data block Constant volume, the refinement schematic flow sheet of the step being divided into after carrying out segmentation combination to cluster in different data blocks;
Fig. 5 is that the functional module of the data distribution system first embodiment of distributed parallel system of the present invention is illustrated Figure;
Fig. 6 be distributed parallel system of the present invention data distribution system second embodiment in estimation block refinement High-level schematic functional block diagram;
Fig. 7 be distributed parallel system of the present invention data distribution system 3rd embodiment in relation set up module Refinement high-level schematic functional block diagram;
Fig. 8 is that the functional module of the data distribution system fourth embodiment of distributed parallel system of the present invention is illustrated Figure;
Fig. 9 is the data distributing method of distributed parallel system of the present invention based under Spark cloud service environment Shuffle data allocation process figures;
Figure 10 is the system architecture and load balance structure of the data distributing method of distributed parallel system of the present invention Figure;
Figure 11 is the clusters segmentation combination flow charts of the data distributing method of distributed parallel system of the present invention;
Figure 12 is property of the data distributing method of distributed parallel system of the present invention in Sort under different degrees of skewness Can comparison diagram.
The realization of the object of the invention, functional characteristics and advantage will be described further in conjunction with the embodiments referring to the drawings.
Specific embodiment
It should be appreciated that specific embodiment described herein is not intended to limit the present invention only to explain the present invention.
As shown in figure 1, the data distributing method of the distributed parallel system of first embodiment of the invention proposition, bag Include:
Step S100, estimates taking up room for each cluster that input data is concentrated.
Specifically, for tilt data, the distribution that takes up room of the cluster (cluster) of the data set of input file It is uneven, the cluster of the data set of input file can be taken up room by sampling method and be estimated, for follow-up advance Distribution conditions of the cluster in data block (bucket) is prepared.Cluster is actually<Key, value>.
Step S200, according to taking up room and default each data block for each cluster of the input data set Remaining space, sets up each cluster and stores the corresponding relation of each data block of the cluster.
Specifically, with reference to Fig. 9, be distributed parallel system of the present invention data distributing method based on Spark clouds Shuffle data allocation process figure under service environment.Obtain the space size of each bucket, according to the space size of bucket with And the space size distribution situation of the cluster of input data set, the equilibrium assignment that can be set up between bucket and cluster Relation, it is ensured that the space in bucket is all fully utilized, it is to avoid be assigned in each bucket<Key, value>Number difference compared with Greatly, the live load that may result in different reducers has larger difference, increases the time of program operation.
Step S300, according to each cluster and the corresponding relation of each data block for storing the cluster, by each institute State cluster to be stored in corresponding data block.
Specifically, there are mapping relations between bucket and cluster, when input data set passes through mapping tasks (map) During the cluster of output, according to mapping relations, the cluster is positioned in corresponding bucket.
In technical scheme proposed by the present invention, due to the quantity of the cluster of input data set it is more, and the difference that takes up room, In order to the balancedly distribution in data block will be clustered, it is necessary first to which the taking up room for multiple clusters of input data set is estimated Survey, then according to multiple clusters of the input data set take up room and multiple data blocks space, set up each and gather The corresponding relation of each data block of class and the storage cluster, it is ensured that total occupancy of the cluster being assigned in each data block Space is roughly equal, when the cluster key-value pair number of mapping tasks output is received, can gather this according to the corresponding relation Class key assignments logarithm mesh distributes to corresponding data block.Due to the substantially phase that always takes up room of cluster being assigned in each data block Deng, the data load balance between each data block, and then the balance of the preferably live load in reduce tasks can be obtained, The time of task run is reduced, the performance of system is indirectly improve.
Further, with reference to Fig. 2, it is the second enforcement of the data distributing method of distributed parallel system of the present invention Example, based on the first embodiment of the data distributing method of distributed parallel system of the present invention, the estimation input data set In each cluster include the step of taking up room:
Step S110, extracts the cluster for accounting for the input data set total amount of data preset ratio using cistern sampling algorithm As data set sample.
Specifically, with reference to Figure 10, be distributed parallel system of the present invention data distributing method system architecture with Load balance structure chart, the structure chart increased sampling and two parts of data distribution strategy.
For sampling of data part, in the present embodiment, preset ratio is 20%.Extracted using cistern sampling algorithm total The data block of the 20% of data volume is used as sample.
A kind of cistern sampling algorithm presented below realizes pseudo-code:
Step S120, each cluster for counting the data set sample take up room;
Step S130, according to the data set sample each cluster take up room and the preset ratio determine it is defeated Enter data set each cluster take up room.
Specifically, the taking up room for cluster of all of data set sample is counted, then according to data set sample This data volume accounts for the ratio of total amount of data and taking up room for cluster is amplified, so as to estimate whole input data The distribution situation that takes up room of the clusters of collection.
What a kind of estimation cluster presented below took up room realizes pseudo-code:
The cluster that input data set is estimated according to the ratio takes up room distribution situation, obtains one group of cluster Set SC={ SC1,SC2,…,SCi... SCm, 1≤i≤m, wherein arbitrarily SCiValue be an integer, represent one it is specific clusters CiSize of data.
In the present embodiment, for estimating that the cluster of input data set takes up room distribution problem, employ cistern and take out Sample algorithm obtains sample, and the sampling algorithm need not travel through the sample that whole data set can just extract data-oriented amount, while The randomness of sampling is ensure that, the added burden of sampling task is reduced.Cistern sampling algorithm more accurately can be obtained The APPROXIMATE DISTRIBUTION that the cluster of input data set takes up room.
Further, with reference to Fig. 3, it is the 3rd enforcement of the data distributing method of distributed parallel system of the present invention Example, on the basis of above-mentioned second embodiment, each cluster according to the input data set takes up room and respectively The step of space of individual data block, corresponding relation for setting up each cluster and each data block for storing the cluster, includes:
Step S210, calculates the rated capacity of each data block, and the rated capacity is equal to each of input data set and gathers Number of the summation for taking up room of class divided by data block, wherein, the initial residual space of each data block is equal to described The rated capacity of data block.
Specifically, gather the rated capacity for calculating each bucket according to the cluster, computing formula is:
Wherein m represents cluster numbers, and n represents the number of bucket.According to the number of bucket, current bucket's Capacity can be represented with set RB, RB={ RB1,RB2,…,RBj..., RBn}。
Step S220, according to the rated capacity of data block, is divided into different data blocks after carrying out segmentation combination to cluster In, the corresponding relation of the data block of the cluster after the cluster set up after segmentation combination and the storage segmentation combination.
Specifically, the rated capacity according to bucket, is divided into after carrying out segmentation combination to cluster using SCID algorithms In different bucket, i.e. which bucket each cluster should be put into, and how many data volume be put into this or it is several In individual bucket, an intermediate data can be obtained and place matrix P.With reference to Figure 11, it is distributed parallel system of the present invention Data distributing method cluster segmentation combination flow charts, describe using SCID algorithms formed intermediate data laying method Process, i.e. data distribution strategy.
A kind of cluster segmentation presented below and combine realize pseudo-code:
Once map exports a tuple<KI, v>, it is possible to determine which bucket should be chosen to write this key assignments. Based on the analysis to sampling, it is known that cluster set C may not include all of key assignments.That is, should sentence first Break this tuple under matrix P<KI, v>Whether can be allocated.If no key-value pair can match this Ki in set C, this Individual tuple should adopt the hash allocation strategies of acquiescence to be allocated.Otherwise, it should distribute this key assignments under current matrix P.
Further, with reference to Fig. 4, it is the 4th enforcement of the data distributing method of distributed parallel system of the present invention Example, on the basis of above-mentioned 3rd embodiment, the rated capacity according to data block carries out dividing after segmentation combination to cluster Include to the step in different data blocks:
Step S221, calls each to cluster according to the order for taking up room from big to small successively, and according to remaining space Order from big to small calls each data block.
Specifically, the ascending order arrangement mode for being taken up room according to cluster in the set SC is arranged to cluster Sequence, obtains the set SC that cluster set C and corresponding cluster takes up room, wherein, C={ C1,C2,…,Ci... Cm, SC={ SC1,SC2,…,SCi... SCm},1≤i≤m。
Step S222, when cluster is called every time, judges whether taking up room more than current for the cluster currently called The remaining space of the data block called;
The cluster currently called take up room less than or equal to the data block currently called remaining space when, Into step S223, by the clustering for currently calling to the data block currently called, and continue to call next The cluster;
The cluster currently called take up room more than the data block currently called remaining space when, into step Rapid S224, cuts to the cluster currently called according to the remaining space of the data block currently called;Will cutting The clustering for obtaining will cut remaining cluster addition into the cluster of never call to the data block currently called, and Call subsequent data chunk.
Specifically, first time iteration is carried out, according to current cluster CiSize of data cluster set C is carried out Ascending sort.In this embodiment, k values are illustrated for 1, B1Represent first bucket.
From maximum cluster CmStart, if SCm≥RB1, will be from CmSize of data is partitioned into for WavgNew data Section is filled into B1.Meanwhile, CmRemaining part size is SCm-RB1Next iteration is entered with other remaining cluster.This In the case of kind, a step is filled with having expired a bucket.But, it is at most of conditions, full in current maximum cluster Foot is under such circumstances:SC<RB1, Jing can not often fill up bucket.
Work as SCmLess than B1Space when, CmIt is filled into B1In, for remaining space segment, judge second largest cluster Cm-1Whether space can be filled up, if SCm+SCm-1≥B1, at this time clusterCm-1To be divided, otherwise Cm-1 To be filled in this bucket, this process will travel through all remaining clusters, until finding a cluster Ci Meet following condition:
Wherein k represents the serial number of current bucket.
In this algorithm, for each iteration, it will be filled with a bucket, and k can similarly represent current iteration Number of times.In iteration each time, size SC of clusters can be used.By this processing procedure, one specific Size of data under mapper in all of buckets all substantially equal.
After step S221 to step S224, allocation matrix of the intermediate data in each bucket is obtained:
P={ pI, jRepresent j-th cluster<Key, value>To quantity, i-th bucket should be assigned to.This square Contain each cluster in matrix representation each bucket<Key, value>To quantity.
In order to verify the advantage of SCID algorithms, using common Spark experiment benchmark:Sort, mainly performs from reduce The aspects such as time, load balance degree carry out same Range, and LIBRA, PCWC and DEFH strategy carries out Performance comparision.Figure 12 (a) shows When deflection is more than 0.65, reduces tasks process the deflection factor rapid growth of buckets.By to clusters Segmentation and the mode for combining, SCID algorithms can more cause the equilibrium that data processing is reached in reduce tasks.Figure 12 (b) shows In the reduce tasks impact of load imbalance performance.In the case where deflection is relatively low, PCWC, DEFH and LIBRA Somewhat hurry up than SCID.Once because being considered map tasks output intermediate data in the time started of reduce tasks, for SCID obtains any key assignments under matrix P, and how decision distributes<Key, value>It is right, there is a certain degree of computing redundancy here.More More extra consumption, whole performance will be brought to affect input data importantly, for the sampling before operation. But when deflection is more than 0.7, by the Placement Strategy for optimizing intermediate data, performance can be improved a lot.By balance The load balancing of reduce tasks, the impact that extra operation causes can be eliminated to a great extent.It is apparent that SCID The reduce execution times of algorithm compare the most slow of other algorithms growths.
Test result indicate that, data volume that can be in efficient balance each bucket using SCID algorithms on the Spark is entered And execution time of reduce tasks is reduced, improve the performance of system.
Further, it is on the basis of above-mentioned first to fourth any embodiment, described to set up each cluster and store After the step of corresponding relation of each data block of the cluster, the data distributing method of the distributed parallel system Also include:
Step S400, obtains actually taking up room for each cluster;
Have a cluster it is described actually take up room be not estimated in taking up room described in estimation when, into step S500, is distributed the cluster into data block using the hash algorithm of acquiescence;
Have a cluster it is described actually take up room be estimated in taking up room described in estimation when, execution step S300, according to each cluster and the corresponding relation of each data block for storing the cluster, each described cluster is stored in In corresponding data block.
Specifically, once map exports a tuple<KI, v>, according to matrix P, it is possible to determine which bucket should be by Select to write this key assignments.Based on the analysis to sampling, it is known that cluster set C may not include all of key assignments. That is, this tuple under matrix P should be judged first<KI, v>Whether can be allocated.If no key-value pair in set C This Ki can be matched, this tuple should adopt the hash allocation strategies of acquiescence to be allocated.Otherwise, it should in current matrix Distribute this key assignments under P.
With reference to Fig. 5, it is the first embodiment of the data distribution system of distributed parallel system of the present invention, this is distributed The data distribution system of concurrent computational system includes:
Estimation block 100, for estimating taking up room for each cluster of input data concentration.
Specifically, for tilt data, the distribution that takes up room of the cluster (cluster) of the data set of input file It is uneven, the cluster of the data set of input file can be taken up room by sampling method and be estimated, for follow-up advance Distribution conditions of the cluster in data block (bucket) is prepared.Cluster is actually<Key, value>.
Relation sets up module 200, for taking up room and default according to each cluster of the input data set The remaining space of each data block, sets up each cluster and stores the corresponding relation of each data block of the cluster.
Specifically, with reference to Fig. 9, be distributed parallel system of the present invention data distributing method based on Spark clouds Shuffle data allocation process figure under service environment.Obtain the space size of each bucket, according to the space size of bucket with And the space size distribution situation of the cluster of input data set, the equilibrium assignment that can be set up between bucket and cluster Relation, it is ensured that the space in bucket is all fully utilized, it is to avoid be assigned in each bucket<Key, value>Number difference compared with Greatly, the live load that may result in different reducers has larger difference, increases the time of program operation.
Distribute module 300, for the corresponding relation according to each cluster and each data block for storing the cluster, will Each described cluster is stored in corresponding data block.
Specifically, there are mapping relations between bucket and cluster, when input data set passes through mapping tasks (map) During the cluster of output, according to mapping relations, the cluster is positioned in corresponding bucket.
In technical scheme proposed by the present invention, due to the quantity of the cluster of input data set it is more, and the difference that takes up room, In order to the balancedly distribution in data block will be clustered, it is necessary first to which the taking up room for multiple clusters of input data set is estimated Survey, then according to multiple clusters of the input data set take up room and multiple data blocks space, set up each and gather The corresponding relation of each data block of class and the storage cluster, it is ensured that total occupancy of the cluster being assigned in each data block Space is roughly equal, when the cluster key-value pair number of mapping tasks output is received, can gather this according to the corresponding relation Class key assignments logarithm mesh distributes to corresponding data block.Due to the substantially phase that always takes up room of cluster being assigned in each data block Deng, the data load balance between each data block, and then the balance of the preferably live load in reduce tasks can be obtained, The time of task run is reduced, the performance of system is indirectly improve.
Further, with reference to Fig. 6, it is the second enforcement of the data distribution system of distributed parallel system of the present invention Example, based on the first embodiment of the data distribution system of distributed parallel system of the present invention, the estimation block 100 is wrapped Include:
Sampling unit 110, it is pre- for accounting for the input data set total amount of data using the extraction of cistern sampling algorithm If the cluster of ratio is used as data set sample.
Specifically, with reference to Figure 10, be distributed parallel system of the present invention data distributing method system architecture with Load balance structure chart, the structure chart increased sampling and two parts of data distribution strategy.
For sampling of data part, in the present embodiment, preset ratio is 20%.Extracted using cistern sampling algorithm total The data block of the 20% of data volume is used as sample.
A kind of cistern sampling algorithm presented below realizes pseudo-code:
Sample statistics unit 120, for counting taking up room for each cluster of the data set sample;
Evaluation unit 130, for taking up room and the default ratio according to each cluster of the data set sample Example determines taking up room for each cluster of input data set.
Specifically, the taking up room for cluster of all of data set sample is counted, then according to data set sample This data volume accounts for the ratio of total amount of data and taking up room for cluster is amplified, so as to estimate whole input data The distribution situation that takes up room of the clusters of collection.
What a kind of estimation cluster presented below took up room realizes pseudo-code:
The cluster that input data set is estimated according to the ratio takes up room distribution situation, obtains one group of cluster Set SC={ SC1,SC2,…,SCi... SCm, 1≤i≤m, wherein arbitrarily SCiValue be an integer, represent one it is specific clusters CiSize of data.
In the present embodiment, for estimating that the cluster of input data set takes up room distribution problem, employ cistern and take out Sample algorithm obtains sample, and the sampling algorithm need not travel through the sample that whole data set can just extract data-oriented amount, while The randomness of sampling is ensure that, the added burden of sampling task is reduced.Cistern sampling algorithm more accurately can be obtained The APPROXIMATE DISTRIBUTION that the cluster of input data set takes up room.
Further, with reference to Fig. 7, it is the 3rd enforcement of the data distribution system of distributed parallel system of the present invention Example, based on the second embodiment of the data distribution system of distributed parallel system of the present invention, the relation sets up module 200 Including:
Rated capacity unit 210, for calculating the rated capacity of each data block, the rated capacity is equal to input data Number of the summation for taking up room of each cluster of collection divided by data block, wherein, the initial residual of each data block is empty Between be equal to the data block rated capacity.
Specifically, gather the rated capacity for calculating each bucket according to the cluster, computing formula is:
Wherein m represents cluster numbers, and n represents the number of bucket.According to the number of bucket, current bucket's Capacity can be represented with set RB, RB={ RB1,RB2,…,RBj..., RBn}。
Relation sets up unit 220, for the rated capacity according to data block, is divided into not after carrying out segmentation combination to cluster In same data block, the correspondence pass of the data block of the cluster after the cluster set up after segmentation combination and the storage segmentation combination System.
Specifically, the rated capacity according to bucket, is divided into after carrying out segmentation combination to cluster using SCID algorithms In different bucket, i.e. which bucket each cluster should be put into, and how many data volume be put into this or it is several In individual bucket, an intermediate data can be obtained and place matrix P.With reference to Figure 11, it is distributed parallel system of the present invention Data distributing method cluster segmentation combination flow charts, describe using SCID algorithms formed intermediate data laying method Process, i.e. data distribution strategy.
A kind of cluster segmentation presented below and combine realize pseudo-code:
Once map exports a tuple<KI, v>, it is possible to determine which bucket should be chosen to write this key assignments. Based on the analysis to sampling, it is known that cluster set C may not include all of key assignments.That is, should sentence first Break this tuple under matrix P<KI, v>Whether can be allocated.If no key-value pair can match this Ki in set C, this Individual tuple should adopt the hash allocation strategies of acquiescence to be allocated.Otherwise, it should distribute this key assignments under current matrix P.
Further, on the basis of the data distribution system 3rd embodiment of above-mentioned distributed parallel system, institute The relation of stating sets up unit 220 to be included:
Sequence subelement 221, for calling each to cluster successively according to the order for taking up room from big to small, and according to Remaining space order from big to small calls each data block;
Judgment sub-unit 222, for when cluster is called every time, whether taking up room for the cluster currently called being judged More than the remaining space of the data block currently called;
First divides subelement 223, for taking up room less than or equal to currently calling in the cluster currently called During the remaining space of the data block, by the clustering for currently calling to the data block currently called, and after It is continuous to call next cluster;
Second divides subelement 224, for taking up room more than the number for currently calling in the cluster currently called According to block remaining space when, the cluster currently called is cut according to the remaining space of the data block currently called Cut;The clustering that cutting is obtained will cut remaining cluster addition to never call to the data block currently called In cluster, and call subsequent data chunk.
Specifically, the ascending order arrangement mode for being taken up room according to cluster in the set SC is arranged to cluster Sequence, obtains the set SC that cluster set C and corresponding cluster takes up room, wherein, C={ C1,C2,…,Ci... Cm, SC={ SC1,SC2,…,SCi... SCm},1≤i≤m。
First time iteration is carried out, according to current cluster CiSize of data ascending order row is carried out to cluster set C Sequence.In this embodiment, k values are illustrated for 1, B1Represent first bucket.
From maximum cluster CmStart, if SCm≥RB1, will be from CmSize of data is partitioned into for WavgNew data Section is filled into B1.Meanwhile, CmRemaining part size is SCm-RB1Next iteration is entered with other remaining cluster.This In the case of kind, a step is filled with having expired a bucket.But, it is at most of conditions, full in current maximum cluster Foot is under such circumstances:SC<RB1, Jing can not often fill up bucket.
Work as SCmLess than B1Space when, CmIt is filled into B1In, for remaining space segment, judge second largest cluster Cm-1Whether space can be filled up, if SCm+SCm-1≥B1, at this time clusterCm-1To be divided, otherwise Cm-1 To be filled in this bucket, this process will travel through all remaining clusters, until finding a cluster Ci Meet following condition:
Wherein k represents the serial number of current bucket.
In this algorithm, for each iteration, it will be filled with a bucket, and k can similarly represent current iteration Number of times.In iteration each time, size SC of clusters can be used.By this processing procedure, one specific Size of data under mapper in all of buckets all substantially equal.
After step S221 to step S224, allocation matrix of the intermediate data in each bucket is obtained:
P={ pI, jRepresent j-th cluster<Key, value>To quantity, i-th bucket should be assigned to.This square Contain each cluster in matrix representation each bucket<Key, value>To quantity.
In order to verify the advantage of SCID algorithms, using common Spark experiment benchmark:Sort, mainly performs from reduce The aspects such as time, load balance degree carry out same Range, and LIBRA, PCWC and DEFH strategy carries out Performance comparision.Figure 12 (a) shows When deflection is more than 0.65, reduces tasks process the deflection factor rapid growth of buckets.By to clusters Segmentation and the mode for combining, SCID algorithms can more cause the equilibrium that data processing is reached in reduce tasks.Figure 12 (b) shows In the reduce tasks impact of load imbalance performance.In the case where deflection is relatively low, PCWC, DEFH and LIBRA Somewhat hurry up than SCID.Once because being considered map tasks output intermediate data in the time started of reduce tasks, for SCID obtains any key assignments under matrix P, and how decision distributes<Key, value>It is right, there is a certain degree of computing redundancy here.More More extra consumption, whole performance will be brought to affect input data importantly, for the sampling before operation. But when deflection is more than 0.7, by the Placement Strategy for optimizing intermediate data, performance can be improved a lot.By balance The load balancing of reduce tasks, the impact that extra operation causes can be eliminated to a great extent.It is apparent that SCID The reduce execution times of algorithm compare the most slow of other algorithms growths.
Test result indicate that, data volume that can be in efficient balance each bucket using SCID algorithms on the Spark is entered And execution time of reduce tasks is reduced, improve the performance of system.
Further, with reference to Fig. 8, it is the 4th enforcement of the data distribution system of distributed parallel system of the present invention Example, on the basis of first to the 3rd any embodiment of data distribution system of above-mentioned distributed parallel system,
The data distribution system of the distributed parallel system also includes:
Space acquisition module 400, for obtaining actually taking up room for each cluster;
Default allocation module 500, for there is the described of cluster actually to take up room in taking up room described in estimation When not being estimated to, the cluster is distributed into data block using the hash algorithm of acquiescence;
The distribute module 300, is additionally operable to taking up room described in estimation in described actually the taking up room for having cluster In when being estimated to, according to each cluster and the corresponding relation of each data block for storing the cluster, by each cluster It is stored in corresponding data block.
Specifically, once map exports a tuple<KI, v>, according to matrix P, it is possible to determine which bucket should be by Select to write this key assignments.Based on the analysis to sampling, it is known that clusters set C may not include all of key assignments. That is, this tuple under matrix P should be judged first<KI, v>Whether can be allocated.If no key assignments in set C To matching this Ki, this tuple should adopt the hash allocation strategies of acquiescence to be allocated.Otherwise, it should in current square Distribute this key assignments under battle array P.
The alternative embodiment of the present invention is these are only, the scope of the claims of the present invention is not thereby limited, it is every using this Equivalent structure or equivalent flow conversion that bright description and accompanying drawing content are made, or directly or indirectly it is used in other related skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of data distributing method of distributed parallel system, it is characterised in that include:
Estimate taking up room for each cluster that input data is concentrated;
According to the input data set each cluster take up room and default each data block remaining space, set up Each clusters and stores the corresponding relation of each data block of the cluster;
According to each cluster and the corresponding relation of each data block for storing the cluster, it is right that each described cluster is stored in In the data block answered.
2. the data distributing method of distributed parallel system as claimed in claim 1, it is characterised in that the estimation is defeated Enter including the step of taking up room for each cluster in data set:
The cluster for accounting for the input data set total amount of data preset ratio is extracted as data set sample using cistern sampling algorithm This;
Count the data set sample each cluster take up room;
According to the data set sample each cluster take up room and the preset ratio determines each of input data set Individual cluster takes up room.
3. the data distributing method of distributed parallel system as claimed in claim 2, it is characterised in that described according to institute State input data set each cluster take up room and each data block remaining space, set up each cluster and store The step of corresponding relation of each data block of the cluster, includes:
Calculate the rated capacity of each data block, the rated capacity is equal to taking up room for each cluster of input data set Number of the summation divided by data block, wherein, the initial residual space of each data block is equal to the specified appearance of the data block Amount;
According to the rated capacity of data block, be divided in different data blocks after segmentation combination being carried out to cluster, set up segmentation group The corresponding relation of the data block of the cluster after cluster and the storage segmentation combination after conjunction.
4. the data distributing method of distributed parallel system as claimed in claim 3, it is characterised in that described according to number According to the rated capacity of block, the step being divided in different data blocks after carrying out segmentation combination to cluster includes:
Each is called to cluster successively according to the order for taking up room from big to small, and the order according to remaining space from big to small Call each data block;
When cluster is called every time, whether taking up room more than the data currently called for the cluster currently called is judged The remaining space of block;
The cluster currently called take up room less than or equal to the data block currently called remaining space when, ought Before the clustering that calls into the data block currently called, and continue to call next cluster;
The cluster currently called take up room more than the data block currently called remaining space when, according to current tune With the remaining space of the data block cluster currently called is cut;
The clustering that cutting is obtained will cut remaining cluster addition to never call to the data block currently called In cluster, and call subsequent data chunk.
5. the data distributing method of the distributed parallel system as described in any one of claim 1-4, it is characterised in that institute After stating the step of setting up the corresponding relation of each cluster and each data block for storing the cluster, the distributed parallel The data distributing method of computing system also includes:
Obtain actually taking up room for each cluster;
Have a cluster it is described actually take up room be not estimated in taking up room described in estimation when, using the Hash of acquiescence Algorithm is by the cluster distribution into data block;
Have a cluster it is described actually take up room be estimated in taking up room described in estimation when, perform described according to each The corresponding relation of each data block of the cluster is clustered and stored, each described cluster is stored in corresponding data block The step of.
6. a kind of data distribution system of distributed parallel system, it is characterised in that include:
Estimation block, for estimating taking up room for each cluster of input data concentration;
Relation sets up module, for taking up room and default each data according to each cluster of the input data set The remaining space of block, sets up each cluster and stores the corresponding relation of each data block of the cluster;
Distribute module, for the corresponding relation according to each cluster and each data block for storing the cluster, by each institute State cluster to be stored in corresponding data block.
7. the data distribution system of distributed parallel system as claimed in claim 6, it is characterised in that the estimation mould Block includes:
Sampling unit, accounts for the input data set total amount of data preset ratio for extracting using cistern sampling algorithm Cluster is used as data set sample;
Sample statistics unit, for counting taking up room for each cluster of the data set sample;
Evaluation unit, for according to the data set sample each cluster take up room and the preset ratio determine it is defeated Enter data set each cluster take up room.
8. the data distribution system of distributed parallel system as claimed in claim 7, it is characterised in that the relation is built Formwork erection block includes:
Rated capacity unit, for calculating the rated capacity of each data block, the rated capacity is equal to each of input data set Number of the summation for taking up room of individual cluster divided by data block, wherein, the initial residual space of each data block is equal to The rated capacity of the data block;
Relation sets up unit, for the rated capacity according to data block, is divided into different numbers after carrying out segmentation combination to cluster According to block, the corresponding relation of the data block of the cluster after the cluster set up after segmentation combination and the storage segmentation combination.
9. the data distribution system of distributed parallel system as claimed in claim 8, it is characterised in that the relation is built Vertical unit includes:
Sequence subelement, for calling each to cluster successively according to the order for taking up room from big to small, and according to remaining empty Between order from big to small call each data block;
Judgment sub-unit, for when cluster is called every time, judging whether the taking up room more than working as of the cluster currently called Before the remaining space of the data block that calls;
First divides subelement, for taking up room less than or equal to the data currently called in the cluster currently called During the remaining space of block, by the clustering for currently calling to the data block currently called, and continue to call down Cluster described in one;
Second divides subelement, for surplus more than the data block currently called in taking up room for the cluster currently called During complementary space, the cluster currently called is cut according to the remaining space of the data block currently called;To cut The clustering for obtaining is cut to the data block currently called, remaining cluster will be cut and added into the cluster of never call, And call subsequent data chunk.
10. the data distribution system of the distributed parallel system as described in any one of claim 6-9, it is characterised in that The data distribution system of the distributed parallel system also includes:
Space acquisition module, for obtaining actually taking up room for each cluster;
Default allocation module, for not being estimated in taking up room described in estimation in described actually the taking up room for having cluster When, the cluster is distributed into data block using the hash algorithm of acquiescence;
The distribute module, is additionally operable to be estimated in taking up room described in estimation in described actually the taking up room for having cluster When, according to each cluster and the corresponding relation of each data block for storing the cluster, it is right that each described cluster is stored in In the data block answered.
CN201611042373.0A 2016-11-18 2016-11-18 Data distribution method and system of distributed parallel computing system Pending CN106598729A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611042373.0A CN106598729A (en) 2016-11-18 2016-11-18 Data distribution method and system of distributed parallel computing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611042373.0A CN106598729A (en) 2016-11-18 2016-11-18 Data distribution method and system of distributed parallel computing system

Publications (1)

Publication Number Publication Date
CN106598729A true CN106598729A (en) 2017-04-26

Family

ID=58592938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611042373.0A Pending CN106598729A (en) 2016-11-18 2016-11-18 Data distribution method and system of distributed parallel computing system

Country Status (1)

Country Link
CN (1) CN106598729A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832009A (en) * 2017-10-30 2018-03-23 努比亚技术有限公司 A kind of data distributing method, equipment and computer-readable storage medium
CN108491476A (en) * 2018-03-09 2018-09-04 深圳大学 The partitioning method and device of big data stochastical sampling data sub-block
CN108595268A (en) * 2018-04-24 2018-09-28 咪咕文化科技有限公司 A kind of data distributing method, device and computer readable storage medium based on MapReduce
CN110019499A (en) * 2017-08-17 2019-07-16 阿里巴巴集团控股有限公司 The treating method and apparatus and electronic equipment of fast resampling
CN113569200A (en) * 2021-08-03 2021-10-29 北京金山云网络技术有限公司 Data statistics method and device and server
CN114138494A (en) * 2021-12-06 2022-03-04 昆明理工大学 Load balancing method combining node computing capacity
CN114138494B (en) * 2021-12-06 2024-05-10 昆明理工大学 Load balancing method combining node computing capacity

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629219A (en) * 2012-02-27 2012-08-08 北京大学 Self-adaptive load balancing method for Reduce ends in parallel computing framework
CN102929989A (en) * 2012-10-19 2013-02-13 南京邮电大学 Load balancing method for geospatial data on cloud computing platform
CN103152395A (en) * 2013-02-05 2013-06-12 北京奇虎科技有限公司 Storage method and device of distributed file system
US20150341283A1 (en) * 2014-05-21 2015-11-26 Oracle International Corporation System and method for providing a distributed queue in a distributed data grid
CN106021567A (en) * 2016-05-31 2016-10-12 中国农业大学 Mass vector data partition method and system based on Hadoop

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629219A (en) * 2012-02-27 2012-08-08 北京大学 Self-adaptive load balancing method for Reduce ends in parallel computing framework
CN102929989A (en) * 2012-10-19 2013-02-13 南京邮电大学 Load balancing method for geospatial data on cloud computing platform
CN103152395A (en) * 2013-02-05 2013-06-12 北京奇虎科技有限公司 Storage method and device of distributed file system
US20150341283A1 (en) * 2014-05-21 2015-11-26 Oracle International Corporation System and method for providing a distributed queue in a distributed data grid
CN106021567A (en) * 2016-05-31 2016-10-12 中国农业大学 Mass vector data partition method and system based on Hadoop

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019499A (en) * 2017-08-17 2019-07-16 阿里巴巴集团控股有限公司 The treating method and apparatus and electronic equipment of fast resampling
CN110019499B (en) * 2017-08-17 2023-07-28 阿里巴巴集团控股有限公司 Data redistribution processing method and device and electronic equipment
CN107832009A (en) * 2017-10-30 2018-03-23 努比亚技术有限公司 A kind of data distributing method, equipment and computer-readable storage medium
CN107832009B (en) * 2017-10-30 2020-10-23 厦门万匹思网络科技有限公司 Data distribution method, equipment and computer storage medium
CN108491476A (en) * 2018-03-09 2018-09-04 深圳大学 The partitioning method and device of big data stochastical sampling data sub-block
CN108595268A (en) * 2018-04-24 2018-09-28 咪咕文化科技有限公司 A kind of data distributing method, device and computer readable storage medium based on MapReduce
CN108595268B (en) * 2018-04-24 2021-03-09 咪咕文化科技有限公司 Data distribution method and device based on MapReduce and computer-readable storage medium
CN113569200A (en) * 2021-08-03 2021-10-29 北京金山云网络技术有限公司 Data statistics method and device and server
CN114138494A (en) * 2021-12-06 2022-03-04 昆明理工大学 Load balancing method combining node computing capacity
CN114138494B (en) * 2021-12-06 2024-05-10 昆明理工大学 Load balancing method combining node computing capacity

Similar Documents

Publication Publication Date Title
CN106598729A (en) Data distribution method and system of distributed parallel computing system
Tang et al. An intermediate data placement algorithm for load balancing in spark computing environment
CN109684083B (en) Multistage transaction scheduling allocation strategy oriented to edge-cloud heterogeneous environment
CN105446979B (en) Data digging method and node
CN102279771B (en) Method and system for adaptively allocating resources as required in virtualization environment
CN104102543B (en) The method and apparatus of adjustment of load in a kind of cloud computing environment
CN109885397B (en) Delay optimization load task migration algorithm in edge computing environment
CN104881322B (en) A kind of cluster resource dispatching method and device based on vanning model
CN107832129B (en) Dynamic task scheduling optimization method for distributed stream computing system
CN102929989A (en) Load balancing method for geospatial data on cloud computing platform
CN103188346A (en) Distributed decision making supporting massive high-concurrency access I/O (Input/output) server load balancing system
CN103970879B (en) Method and system for regulating storage positions of data blocks
CN110069502A (en) Data balancing partition method and computer storage medium based on Spark framework
CN107329831A (en) A kind of artificial resource dispatching method based on improved adaptive GA-IAGA
CN115480876A (en) Cloud computing task scheduling method and system based on ant colony algorithm optimization
CN109976879B (en) Cloud computing virtual machine placement method based on resource usage curve complementation
CN105488134A (en) Big data processing method and big data processing device
CN110287179A (en) A kind of filling equipment of shortage of data attribute value, device and method
CN108055070A (en) The empty net mapping method of mixing
CN113568759B (en) Cloud computing-based big data processing method and system
CN115981562A (en) Data processing method and device
CN107301094A (en) The dynamic self-adapting data model inquired about towards extensive dynamic transaction
CN110084507A (en) The scientific workflow method for optimizing scheduling of perception is classified under cloud computing environment
CN112232401A (en) Data classification method based on differential privacy and random gradient descent
KR20220089383A (en) Load balancing method based on transaction execution time in Ethereum sharding environment and Ethereum sharding system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170426

RJ01 Rejection of invention patent application after publication