CN103106253B

CN103106253B - A kind of data balancing method based on genetic algorithm in MapReduce computation model

Info

Publication number: CN103106253B
Application number: CN201310015988.4A
Authority: CN
Inventors: 伍卫国; 樊源泉; 魏伟; 朱霍; 高颜
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2013-01-16
Filing date: 2013-01-16
Publication date: 2016-05-04
Anticipated expiration: 2033-01-16
Also published as: CN103106253A

Abstract

A kind of data balancing method based on genetic algorithm in MapReduce computation model, first obtain overall Map output information, utilize genetic algorithm to carry out Combinatorial Optimization: first by metadata collecting, encode, then population is carried out to repeatedly random division, each division forms a genome, calculate the fitness function value of all subsets in each gene, and calculate the probability of each object function, on the basis that the fitness of each gene is evaluated, selection operator is applied to genome, utilize roulette algorithm random select several Fineness genes in genome, the gene of electing is carried out to crossing operation, carry out again mutation operation, through too much selecting according to elite's retention strategy the gene retaining after wheel evolution, and gene is decoded, just can obtain the combination to metadata a optimization, ensure that the handled data volume of each reducer is approximately equalised, the invention solves the reduce stage inputs the unbalanced problem of data, save computational resource, minimizing assesses the cost.

Description

A kind of data balancing method based on genetic algorithm in MapReduce computation model

Technical field

The invention belongs to computer MapReduce computation model technical field, be specifically related to a kind of MapReduce and calculate mouldData balancing method based on genetic algorithm in type.

Background technology

Hadoop be by Apache increase income storage that of organization development has high reliability and an enhanced scalability with pointCloth formula parallel computing platform, develops Zhi Houcong the earliest as the basic platform of the search engine project Nutch that increases incomeIndependent in Nutch project, become one of the cloud computing platform of typically increasing income. Hadoop core has realized dividing by piece storageCloth formula file system (HadoopDistributedFileSystem, HDFS) and for Distributed CalculationMapReduce computation model.

The processing stage that MapReduce computation model being divided into the large task of two of Map and Reduce. Process at MapReduceIn process, the Map stage is by change into<Key of input data, Value>data mode of key-value pair, offering the Reduce stage entersRow is further processed. Before Reduce accepts the key-value pair data of Map output and it is processed, also need through oneThe Shuffle stage. The Shuffle stage mainly shuffles the output data of each Map task, and collects these Map tasksThe data that need to be processed by same reduce task in output data. Because the data scale of collecting may be larger,The Shuffle stage can merge data to store in the local file system of reduce task place node, thereby reduces internal memorySpace hold rate.

Each Map task is divided into output data according to the quantity of reduce task the subregion number of equal parts, singleIndividual reduce task is collected the partition data of answering in contrast from all Map tasks, and all Map that possess identical key value are defeatedGo out key-value pair and will be assigned to same reduce task and process, thereby ensure that the final process result of each reduce is to buildStand in global scope.

The feature in Shuffle stage has determined data volume that each reduce task of Reduce stage accepts, and likely the utmost point is notBalance, thus cause the Reduce stage to calculate the problem tilting.

1) Reduce being caused by User Defined partitioning strategies calculates

When MapReduce operation is submitted to, according to the partitioning strategies of specifying, the Map stage need to be divided the number of output subregion,Set up the corresponding relation between Map output and reduce input. User-defined partitioning strategies, will according to practical application requestThe data that are mutually related are divided in same subregion, complete processing by same reduce task, are just ensureing final resultReally property, but also may cause each reduce task deal with data amount imbalance simultaneously.

In the time that the concrete subregion of data is indifferent in MapReduce operation, for completing fast point zoning of Map output dataPoint, what conventionally adopt is hash subregion method, hash value by Key definite whole<Key, Value>the affiliated subregion of key-value pairNumber, i.e. partition number partitionNum=hashCode (Key) %REDUCER_NUM. This method is limited by hash and calculates conflictAnd the factor such as reduce Limited Number, probably occur that a large amount of key converges on same subregion, cause each reduce taskOn data volume imbalance.

2) Reduce being caused by input data unique characteristics calculates

Because division operation is at each Map<Key, Value>the rear execution of key-value pair data output, foundation oftenSome feature of Key is determined its district location, lacks the global statistics information of the corresponding Value data scale of Key. Therefore,Make partitioning strategies can ensure the roughly balance of quantity of key in each subregion, but because the Map stage is inputted the own characteristic of data,The corresponding Value data volume of some specific key is measured much larger than Value corresponding to other key, thereby causes part reduceTask data volume to be processed is excessive. This phenomenon comes across the situation that has some hot spot datas in input data conventionally. OneAs in situation, the input data skew in Reduce stage will make some reduce task with respect to other reduce tasks carryingsTime increases, and has extended the running time in whole Reduce stage, finally affects the deadline of whole MapReduce operation.

Summary of the invention

In order to overcome the shortcoming of above-mentioned prior art, the object of the present invention is to provide a kind of MapReduce computation modelIn data balancing method based on genetic algorithm, reduced the processing time of task reducer, and then reduced wholeIn the processing time of MapReduce, can well save computational resource and minimizing assesses the cost.

In order to achieve the above object, the technical scheme that the present invention takes is:

A data balancing method based on genetic algorithm in MapReduce computation model, comprises the following steps:

1), obtain overall Map output information, obtain the metadata information of the subregion of reduce task processing, Reduce unitThe acquisition process of data is:

1.1, each Map task, completing processing procedure and Output rusults being write after local disk, can be passed throughTaskTracker utilizes heartbeat message transmission task to complete message to JobTracker;

1.2, JobTracker is that each MapReduce operation safeguards that a Map task completes message queue, when certain fortuneWhen the TaskTracker acquisition request Map task of row reduce task, according to the operation under this reduce task, from correspondingIn queue, take out message and pass to TaskTracker;

1.3, the reduce task in same operation is obtained Map task from the TaskTracker at place and is completed message, therefromExtract Map task operation time information, comprise Map mission number, XM, utilizes these information, reduce task creationBe connected with the HTTP between XM, and ask the metadata information of Map task output;

1.4, TaskTracker, according to the Map mission number of request, reads corresponding Map task from local file systemThe index file of output, and send to the reduce task of request;

1.5, the identical numbering virtual partition in reduce task merging different index file, gathers in each virtual partitionAll same kind<Key, Value>data volume of key-value pair, because each reduce task will be obtained all map tasks outputMetadata information;

2), the output data of Map are processed, reduce task is obtained the subregion original number of each map task outputAccording to; Metadata after gathering is submitted to repartition device, adopt genetic algorithm to carry out equilibrium to metadata, genetic algorithm isBit string is operated, and its concrete steps are as follows:

2.1, the metadata collecting of Map being exported to data gets up to be placed in a set, as a population, in populationEach element encode, so-called coding use exactly " 0,1 " composition each element of coded representation, the coding staff of employingFormula is to represent the subscript in the set of element place by 1 number, and this population is carried out to random division, is divided into N subset, itsMiddle N is corresponding with the number of reduce, and division each time forms a gene, after repeatedly dividing, forms a baseBecause of group;

2.2, in genetic algorithm, fitness function is for weighing the individual adaptedness for living environment of heredity, suitableThe individuality that response is higher obtains more duplicator meeting, and vice versa, therefore, defines a fitness function

\min {Σ_{j = 1}^{n} | S_{j} - S |} / n

Formula (1),

Wherein,For whole mean value of the element sum of subsets, in formula (1), object function is retouchedWhat state is the average distance that each subset is incorporated into mean value, utilizes this formula (1), and each gene is calculated to its fitness letterNumber, forms a new set, then obtains the probability of each Gene sufficiency function, i.e. the fitness function of a geneValue divided by whole genomic fitness function value sum;

2.3, selection operator is applied to genome, the selection operator of employing is roulette wheel selection, utilizes random functionProduce a random number between [0,1], judge the position in its fitness probability sequence in genome, if itMultipotency is greater than m value in sequence, represents that m gene is selected, freely specifies the number that needs the gene of selecting;

2.4, carry out crossing operation to electing gene, the part-structure of Fineness gene is replaced and reconfigured shapeThe gene of Cheng Xin, adopts single-point crossover operator, and concrete operations are: set at random a crosspoint, corresponding roulette selection algorithmThe gene choosing, intersects, and the part-structure of two genes before and after this crosspoint exchanges, and generates twoNew individual, and guarantee that the genome after exchange there will not be the situation that has null set, set a nullGen mark, timeGo through the genome after intersection, exist if find that there is null set, be set to false by nullGen mark, and identify with thisThe gene of this deletion;

2.5,, to the gene computing that makes a variation after intersecting, variation computing is by some base in genome according to variation probabilityThereby form a new individuality because replacing with other gene, adopt fixed bit mutation operator, and the probability that will make a variation is establishedBe 0.1, to obtaining optimal solution, fixed bit mutation operator refers to a certain position or a few the bases of the appointment fixing to individual geneBecause making mutation operation: original gene is 0, become 1, original gene is 1, becomes 0, through after mutation operation, rightGene after variation carries out non-NULLCHECK, ensures that the gene after compiling still has N subset;

2.6, described above one and taken turns evolutionary process, warp retains according to the selection of elite's retention strategy after too much taking turns and evolvingGene, the gene retention strategy of employing is: after above step, calculate the target function value of each gene, and by its withIn genome, the target function value of all genes is compared, and the gene that the former is less than to the latter remains;

2.7, the gene remaining is decoded, just can obtain the combination to metadata a optimization, be about to unitData are divided into N the subset that size is substantially suitable, then, and by data allocations to corresponding each subset reducerUpper, so just ensure that the handled data volume of each reducer is approximately equalised.

The invention has the beneficial effects as follows:

Calculate tilt problem for the Reduce stage existing in MapReduce platform, proposed solution, the methodUtilize genetic algorithm to carry out repartition by being exported to data the Map stage, guarantee that the data volume of each subregion is unanimous on the whole, makeReduce task is used the resource of system more efficiently, has avoided due to uneven the locating of causing of reducer input data volumeReason time inconsistent, thus processing time of task reducer reduced, and then reduced the processing of whole MapReduceTime. From business aspect, new method can well save computational resource and minimizing assesses the cost.

Brief description of the drawings

Fig. 1 Reduce metadata is obtained flow chart.

Fig. 2 Map output metadata acquisition module class figure.

The flow chart of the data balancing method of Fig. 3 based on genetic algorithm.

Detailed description of the invention

Below in conjunction with accompanying drawing, the present invention is described in detail.

1), obtain overall Map output information, obtain the metadata information of the subregion of reduce task processing, Reduce unitThe acquisition process of data is as shown in Figure 1:

1.5, the identical numbering virtual partition in reduce task merging different index file, gathers in each virtual partitionAll same kind<Key, Value>data volume of key-value pair, because each reduce task will be obtained all map tasks outputMetadata information, consider in practical situation, map task number is conventionally more, and is distributed on multiple computing nodes, for carryingHigh efficiency is accelerated metadata acquisition process, in can realizing, adopts multithreading to complete this process, and Map output metadata is obtained mouldThe main class formation of piece as shown in Figure 2;

2), the output data of Map are processed, reduce task is obtained the subregion original number of each map task outputAccording to; Metadata after gathering is submitted to repartition device, in order to make the large of input data volume that each reducer obtainsLittle basically identical, the present invention adopts genetic algorithm, and metadata is carried out to equilibrium, and genetic algorithm is that bit string is carried outOperation, instead of to data itself, its concrete steps are as follows:

2.1, the metadata collecting of Map being exported to data gets up to be placed in a set, as a population, in populationEach element encode, so-called coding use exactly " 0,1 " composition each element of coded representation, the present invention adoptCoded system is to represent the subscript in the set of element place by 1 number, and this population is carried out to random division, is divided into NSubset, wherein N is corresponding with the number of reduce, and division each time forms a gene, after repeatedly dividing, formsA genome;

2.2, in genetic algorithm, fitness function is for weighing the individual adaptedness for living environment of heredity, suitableThe individuality that response is higher obtains more duplicator meeting, and vice versa, and therefore, the present invention defines a fitness function

\min {Σ_{j = 1}^{n} | S_{j} - S |} / n

Formula (1),

2.3, selection operator is applied to genome, the selection operator that the present invention adopts is roulette wheel selection, rouletteBack-and-forth method is a kind of conventional random system of selection, is similar to the roulette in gambling game, and its main thought is individual suitableResponse is converted to the probability of selection in proportion, and the ratio shared by individuality carries out ratio cut partition on disk, each rotary disk,Treat that it is the individuality of choosing that disk stops individuality corresponding to backpointer stop sector, adopt the benefit of this selection algorithm to be, individual generalRate is larger, and the area occupied ratio of this individuality in disk is also larger, and selected probability is also just larger, utilizes this thought, thisThe specific implementation of invention is: utilize random function to produce a random number between [0,1], judge its fitting in genomePosition in response probability sequence, if its multipotency is greater than m value in sequence, represents that m gene is selected, generalIn situation, can freely specify the number that needs the gene of selecting;

2.4, several genes of electing are carried out to crossing operation, the part-structure of Fineness gene is replaced heavilyNewly be combined to form new gene, crossing operation is the key character that genetic algorithm is different from other evolution algorithms, and the present invention adoptsSingle-point crossover operator, concrete operations are: set at random a crosspoint, the gene that corresponding roulette selection algorithm chooses,Intersect, the part-structure of two genes before and after this crosspoint exchanges, and generates two new individualities, and guaranteesGenome after exchange there will not be the situation that has null set, sets a nullGen mark, the gene after traversal is intersectedGroup, exists if find that there is null set, is set to false, and identifies the gene of this deletion with this by nullGen mark;

2.5,, to the gene computing that makes a variation after intersecting, variation computing is by some base in genome according to variation probabilityThereby form a new individuality because replacing with other gene, the object that genetic algorithm is introduced variation has two: the one, makeGenetic algorithm has local random searching ability, in the time that genetic algorithm has approached optimal solution neighborhood by crossover operator, utilizesThis local random searching ability of mutation operator can be accelerated to optimal solution convergence, obviously, and variation probability in such casesShould get smaller value, otherwise the building block that approaches optimal solution can be destroyed because of variation; The 2nd, make genetic algorithm can maintain colonyDiversity, to prevent prematurity Convergent Phenomenon, now convergent probability should be got higher value, and based on above consideration, the present invention adoptsUse fixed bit mutation operator, and variation probability is made as to 0.1, to obtaining optimal solution, fixed bit mutation operator refers to listA certain position or a few the genes of the fixing appointment of individual gene are made mutation operation: original gene is 0, becomes 1, original geneBe 1, become 0, through after mutation operation, the gene after variation is carried out to non-NULLCHECK, ensure that the gene after compiling is complied withSo have N subset;

2.6, described above one and taken turns evolutionary process, warp retains according to the selection of elite's retention strategy after too much taking turns and evolvingGene, the gene retention strategy that the present invention adopts is: after above step, calculate the target function value of each gene, andIt is compared with the target function value of all genes in genome, and the gene that the former is less than to the latter remains;

2.7, the gene remaining is decoded, just can obtain the combination to metadata a optimization, be about to unitData are divided into N the subset that size is substantially suitable, then, and by data allocations to corresponding each subset reducerUpper, so just can ensure that the handled data volume of each reducer is suitable, well solve the reduce stage to inputThe problem of data skew. In MapReduce computation model, a kind of flow chart of the data balancing method based on genetic algorithm is as Fig. 3Shown in.

Claims

1. the data balancing method based on genetic algorithm in MapReduce computation model, is characterized in that, comprises following stepRapid:

1), obtain overall Map output information, obtain the metadata information of the subregion of Reduce task processing, Reduce metadataAcquisition process be:

1.1, each Map task, completing processing procedure and Output rusults being write after local disk, can be passed through TaskTrackerUtilize heartbeat message transmission task to complete message to JobTracker;

1.2, JobTracker is that each MapReduce operation safeguards that a Map task completes message queue, when certain operationWhen the TaskTracker acquisition request Map task of Reduce task, according to the operation under this Reduce task, from corresponding teamIn row, take out message and pass to TaskTracker;

1.3, the Reduce task in same operation is obtained Map task from the TaskTracker at place and is completed message, therefrom extractsThe information when operation of Map task, comprises Map mission number, and XM, utilizes these information, Reduce task creation with holdThe internodal HTTP of row connects, and asks the metadata information of Map task output;

1.4, TaskTracker, according to the Map mission number of request, reads corresponding Map task output from local file systemIndex file, and send to the Reduce task of request;

1.5, the identical numbering virtual partition in Reduce task merging different index file, gathers in each virtual partition allSame kind<Key, Value>data volume of key-value pair, each Reduce task will be obtained the metadata of all Map tasks outputsInformation;

2), the output data of Map are processed, Reduce task is obtained the subregion initial data of each Map task output; WillMetadata after gathering is submitted to repartition device, adopts genetic algorithm to carry out equilibrium to metadata, and genetic algorithm is to twoSystem bit string operates, and its concrete steps are as follows:

2.1, the metadata collecting of Map being exported to data gets up to be placed in a set, as a population, and every in populationIndividual element is encoded, and so-called coding is used each element of coded representation of " 0,1 " composition exactly, the coding that the present invention adoptsMode is to represent the subscript in the set of element place by 1 number, and this population is carried out to random division, is divided into N subset,Wherein N is corresponding with the number of Reduce, and division each time forms a gene, after repeatedly dividing, forms oneGenome;

2.2, in genetic algorithm, fitness function is for weighing the individual adaptedness for living environment of heredity, fitnessHigher individuality obtains more duplicator meeting, and vice versa, therefore, defines a fitness function

Formula (1),

Wherein,For whole mean value of the element sum of subsets, min represents to get minimum of a value, formula (1)What middle object function was described is the average distance that each subset is incorporated into mean value, utilizes this formula (1), and each gene is calculatedIts fitness function, forms a new set, then obtain the probability of each Gene sufficiency function, i.e. a geneThe value of fitness function is divided by whole genomic fitness function value sum;

2.3, selection operator is applied to genome, the selection operator of employing is roulette wheel selection, utilizes random function to produceA random number between [0,1], judges the position in its fitness probability sequence in genome, if its multipotencyBe greater than m value in sequence, represent that m gene is selected, freely specify the number that needs the gene of selecting;

2.4, the gene of electing is carried out to crossing operation, the part-structure of Fineness gene is replaced and reconfigured formationNew gene, adopts single-point crossover operator, and concrete operations are: set at random a crosspoint, corresponding roulette selection algorithm choosingThe gene of selecting out, intersects, and the part-structure of two genes before and after this crosspoint exchanges, and generates two newlyIndividuality, and guarantee that the genome after exchange there will not be the situation that has null set, set a nullGen mark, traversalGenome after intersection, exists if find that there is null set, is set to false, and identifies this with this by nullGen markThe gene of deleting;

2.5,, to the gene computing that makes a variation after intersecting, variation computing is according to variation probability, some gene in genome to be usedThereby other gene is replaced and is formed a new individuality, adopts fixed bit mutation operator, and variation probability is made as0.1, to obtaining optimal solution, fixed bit mutation operator refers to a certain position or a few the genes of the appointment fixing to individual geneMake mutation operation: original gene is 0, become 1, original gene is 1, becomes 0, through after mutation operation, to becomingGene after different carries out non-NULLCHECK, ensures that the gene after compiling still has N subset;

2.6, described above one and taken turns evolutionary process, through too much selecting according to elite's retention strategy the base retaining after wheel evolutionCause, the gene retention strategy of employing is: after above step, calculate the target function value of each gene, and by itself and baseCompare because of the target function value of all genes in group, the gene that the former is less than to the latter remains;

2.7, the gene remaining is decoded, just can obtain the combination to metadata a optimization, by metadataBe divided into N the subset that size is substantially suitable, then, data allocations to corresponding each subset reducer is upper, thisSample just ensures that the handled data volume of each reducer is suitable.