CN103218263A - Dynamic determining method and device for MapReduce parameter - Google Patents

Dynamic determining method and device for MapReduce parameter Download PDF

Info

Publication number
CN103218263A
CN103218263A CN2013100785074A CN201310078507A CN103218263A CN 103218263 A CN103218263 A CN 103218263A CN 2013100785074 A CN2013100785074 A CN 2013100785074A CN 201310078507 A CN201310078507 A CN 201310078507A CN 103218263 A CN103218263 A CN 103218263A
Authority
CN
China
Prior art keywords
task
reduce task
reduce
adjusted
default
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013100785074A
Other languages
Chinese (zh)
Other versions
CN103218263B (en
Inventor
林学练
于晨晖
韩军
叶玥
崔晓龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201310078507.4A priority Critical patent/CN103218263B/en
Publication of CN103218263A publication Critical patent/CN103218263A/en
Application granted granted Critical
Publication of CN103218263B publication Critical patent/CN103218263B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a dynamic determining method and device for MapReduce parameters. The method comprises the steps of acquiring a MapReduce operation request, wherein the MapReduce operation request comprises a dataset to be operated, preset Reduce task quantity and parameters which are used for expressing whether an adjusting mechanism for the Reduce task quantity is allowed to be started or not; if the parameters which are used for expressing whether the adjusting mechanism for the Reduce task quantity is allowed to be started or not are allowable, monitoring the execution of Map tasks; if the quantity of the executed Map tasks satisfies a preset first threshold value, determining the adjusted Reduce task quantity; and according to the adjusted Reduce task quantity, enabling unexecuted preset Reduce tasks to correspond to the adjusted Reduce tasks, so as to realize the goal of dynamically determining a reasonable Reduce task quantity during MapReduce operation.

Description

The dynamically definite method and the device of MapReduce parameter
Technical field
The present invention relates to the distributed computing technology field, relate in particular to a kind of dynamically definite method and device of MapReduce parameter.
Background technology
Mapping abbreviation MapReduce is a kind of distributed computing framework, and it uses for reference the thought of functional expression programming, and large-scale dataset is carried out Distributed Calculation efficiently.The MapReduce framework is divided into several mapping Map task and abbreviation Reduce tasks with a computational tasks Job, the data model of the input and output of Map task phase and Reduce task phase all is the Key-Value form, and the Reduce task phase relies on the input of the output of Map task phase as oneself.And the quantity of Map task determines that by the data set of importing the quantity of Reduce task is specified by the user.Because the data set of input is generally bigger, therefore, data set can be cut into a plurality of data block chunk, after the MapReduce operation is submitted to, the quantity of the data block that the scheduler Master of MapReduce framework can comprise according to input data set is determined the quantity of corresponding M ap task, makes each Map task handle a data block.
And the data block of each Map task input is converted into the Key-Value form, through the Map computing, the intermediate result of output Key-Value form, intermediate result can sort by Key, ranking results will be output on the local disk of this Map task run place computing node, and the MapReduce framework can be done polymerization to the key subregion and to the Value that drops in the same subregion by Key ordering and by the quantity of the Reduce task of user's appointment.The input data of each Reduce task are the parts of the intermediate result of a plurality of Map task outputs, for example, if the user has specified n Reduce task, just there be n subregion, the intermediate result that belongs to each subregion by Network Transmission to carrying out in the Reduce task that this subregion calculates, carry out the Reduce algorithm of user's appointment, export the result at last.
Because in existing MapReduce framework, before carrying out, the Map task just must know the accurate quantity of Reduce task, thereby can carry out subregion to the intermediate result of output according to the quantity of the Reduce task of user's appointment in the Map task phase, and the quantity of Reduce task is normally artificially specified by the user, therefore, how much intermediate result no matter the Map task phase is exported has, and all is to carry out the operation of Reduce task according to the quantity of set Reduce task.When the intermediate result of Map task phase output seldom the time, can in 1~2 Reduce task, carry out fully, but the quantity of the Reduce task of possible user's appointment is much larger than 2, if also according to the quantity operation of the Reduce task of user's appointment then can cause the unnecessary wasting of resources; And the quantity of the Reduce task of, user appointment a lot of when the intermediate result of Map task phase output relatively more after a little while, if according to the quantity operation of the Reduce task of user's appointment then can cause long problem of execution time.
Summary of the invention
The object of the present invention is to provide a kind of dynamically definite method and device of MapReduce parameter, thereby be implemented in the quantity of dynamically determining a rational Reduce task in the MapReduce operation.
First aspect of the present invention provides a kind of method of dynamically determining of MapReduce parameter, comprising:
Obtain the MapReduce job request, the parameter that described MapReduce job request comprises the quantity for the treatment of work data collection, default Reduce task and is used to represent whether allow start the adjustment mechanism of Reduce task quantity;
If the parameter of the described adjustment mechanism that is used to represent whether to allow to start Reduce task quantity is then monitored the Map task executions for allowing;
If the quantity of executed Map task satisfies presetting first threshold, the output result of described executed Map task is mapped to the subregion of the quantity of default Reduce task, then determines the quantity of adjusted Reduce task;
According to the quantity of described adjusted Reduce task, each unenforced default Reduce task is corresponded to each adjusted Reduce task, so that carry out each adjusted Reduce task.
Another aspect of the present invention provides a kind of dynamically definite device of MapReduce parameter, comprising:
The job request acquisition module, be used to obtain the MapReduce job request, the parameter that described MapReduce job request comprises the quantity for the treatment of work data collection, default Reduce task and is used to represent whether allow start the adjustment mechanism of Reduce task quantity;
Monitoring module is used for if the parameter of the described adjustment mechanism that is used to represent whether to allow to start Reduce task quantity for allowing, is then monitored the Map task executions;
Determination module is used for if the quantity of executed Map task satisfies presetting first threshold, the subregion of the quantity of the default Reduce task that the output result of described executed Map task is mapped to, the then quantity of definite adjusted Reduce task;
Mapping block is used for the quantity according to described adjusted Reduce task, and each unenforced default Reduce task is corresponded to each adjusted Reduce task, so that carry out each adjusted Reduce task.
Adopt the beneficial effect of the invention described above technical scheme to be: present embodiment is monitored by the Master to the MapReduce framework, thereby can be according to Map task executions situation, quantity to Reduce task in the MapReduce job request is dynamically adjusted, thereby carry out the Reduce task according to the quantity of adjusted Reduce task, to solve problems such as the wasting of resources that caused by the quantity of the static Reduce of appointment of user task in the prior art or execution time be long.
Description of drawings
The process flow diagram of the method for dynamically determining of a kind of MapReduce parameter that Fig. 1 provides for the embodiment of the invention one;
The structural representation of dynamically determining device of a kind of MapReduce parameter that Fig. 2 provides for the embodiment of the invention two.
Embodiment
The process flow diagram of the method for dynamically determining of a kind of MapReduce parameter that Fig. 1 provides for the embodiment of the invention one, as shown in Figure 1, described method can comprise the steps:
Step 101 is obtained the MapReduce job request; The parameter that described MapReduce job request comprises the quantity for the treatment of work data collection, default Reduce task and is used to represent whether allow start the adjustment mechanism of Reduce task quantity;
Need to prove that the executive agent of present embodiment method can be dynamically definite device of MapReduce parameter, this device is monitored the Master of MapReduce framework, thereby can obtain the job request of MapReduce by Master.Wherein, the job request of the MapReduce parameter that can comprise the quantity for the treatment of work data collection, default Reduce task and be used to represent whether allow start the adjustment mechanism of Reduce task quantity.Owing to before carrying out the Map task, just must know the accurate quantity of Reduce task, so that can carry out subregion to the intermediate result of output according to the quantity of Reduce task in the Map task phase, therefore, the job request of each MapReduce is all preset the quantity of a Reduce task, so that carry out the Map task; In addition, the device of dynamically determining of MapReduce parameter determines whether to start the adjustment of Reduce task quantity in the present embodiment according to the parameter that is used to represent whether to allow to start the adjustment mechanism of Reduce task quantity, for instance, this parameter can be the parameter value that expression allows startup or do not allow to start, as being True, False or default, in the present embodiment, True can represent to allow when treating the work data collection carries out the MapReduce operation device of dynamically determining of MapReduce parameter that the quantity of Reduce task is dynamically adjusted, so that can carry out the Reduce task according to the quantity of adjusted Reduce task when carrying out the Reduce task; False or defaultly then can represent not allow when treating the work data collection carries out the MapReduce operation device of dynamically determining of MapReduce parameter that the quantity of Reduce task is dynamically adjusted, and when carrying out the Reduce task, can only carry out the Reduce task according to the quantity of default Reduce task.
Step 102 is if the parameter of the described adjustment mechanism that is used to represent whether to allow to start Reduce task quantity is then monitored the Map task executions for allowing;
In the present embodiment, when the parameter of the adjustment mechanism that is used to represent whether to allow to start Reduce task quantity when allowing, then this parameter can trigger the adjustment of dynamically determining device startup Reduce task quantity of MapReduce parameter, therefore, dynamically definite device of MapReduce parameter continues the Master of MapReduce framework is monitored, and according to treating that the work data collection determines the quantity of Map task, or, monitor the Map task executions according to job request simultaneously by the quantity that Master obtains Map task in this MapReduce operation.
Step 103 if the quantity of executed Map task satisfies presetting first threshold, is then determined the quantity of adjusted Reduce task;
In the present embodiment, in order can reasonably to adjust to the quantity of Reduce task, therefore, can be after the Map task be carried out a period of time, determine the input data total amount of Reduce task in this MapReduce operation in conjunction with the data volume of the intermediate result of the quantity of executed Map task and the output of executed Map task, specifically, because the data volume of each Map task input is identical, therefore, the data volume of the intermediate result of its output is also basic identical, again because the input of Reduce task is exactly the output of Map task, therefore, can determine the input data total amount of Reduce task according to the quantity of Map task in the data volume of the intermediate result of the quantity of executed Map task and the output of executed Map task and this MapReduce operation, thereby the quantity of Reduce task be adjusted according to the data total amount of Reduce task.In the present embodiment, owing to be the intermediate result of output to be carried out subregion when beginning to carry out the Map task according to the quantity of default Reduce task, therefore, in the subregion of the quantity of the default Reduce task that is mapped to of the intermediate result of executed Map task output.In the present embodiment, start the adjustment opportunity of Reduce task quantity by the first threshold that is provided with as the device of dynamically determining that triggers the MapReduce parameter, specifically, when the quantity of executed Map task satisfies presetting first threshold, then triggering the device of dynamically determining of MapReduce parameter dynamically adjusts the quantity of Reduce task according to executed Map task, thereby determine the quantity of adjusted Reduce task, generally, the quantity of adjusted Reduce task is less than the quantity of default Reduce task.
Step 104 according to the quantity of described adjusted Reduce task, corresponds to each adjusted Reduce task with each unenforced default Reduce task, so that carry out each adjusted Reduce task.
In the present embodiment, owing to be the intermediate result of output to be carried out subregion when beginning to carry out the Map task according to the quantity of default Reduce task, it is the respectively corresponding subregion of each default Reduce task, therefore, after the quantity of Reduce task is adjusted, then the pairing partition map of unenforced Reduce task in the quantity of default Reduce task is arrived adjusted Reduce task, be the subregion that an adjusted Reduce task can corresponding one or more default Reduce task correspondences, thereby make that Master can the adjusted Reduce task of scheduled for executing.
Present embodiment is monitored by the Master to the MapReduce framework, thereby can be according to Map task executions situation, quantity to Reduce task in the MapReduce job request is dynamically adjusted, thereby carry out the Reduce task according to the quantity of adjusted Reduce task, to solve problems such as the wasting of resources that caused by the quantity of the static Reduce of appointment of user task in the prior art or execution time be long.
Concrete, presetting first threshold described in the foregoing description can be the amount threshold of the Map task preset, for example, can be 30 Map tasks or 100 Map task dispatchings, concrete choosing of numerical value can be provided with according to the actual job situation, and present embodiment does not limit this.Therefore, when the quantity of executed Map task satisfied the amount threshold of default Map task, the device of dynamically determining that then triggers the MapReduce parameter started the adjustment of Reduce task quantity, thereby determines the quantity of adjusted Reduce task.
Preferably, the presetting first threshold described in the foregoing description can be the ratio of presetting, and for example, can be 1/5 or 1/3 etc., and concrete choosing of numerical value can be provided with according to the actual job situation, and present embodiment does not limit this.Therefore, when the ratio between the total quantity of the quantity of executed Map task and Map task satisfies default ratio, the device of dynamically determining that then triggers the MapReduce parameter starts the adjustment of Reduce task quantity, thereby determine the quantity of adjusted Reduce task, in the present embodiment, the total quantity of Map task can be definite according to treating the work data collection, also can obtain by Master.
Further, on the basis of above-mentioned arbitrary embodiment, can also comprise in the MapReduce job request and be used for the second default threshold value that forward scheduling Reduce task is carried out, wherein, second threshold value also can be the amount threshold or the ratio of the Map task preset, promptly when the quantity of executed Map task satisfies second threshold value, can trigger Master and begin to carry out default Reduce task.
In the present embodiment, when second threshold value during less than first threshold, be that the quantity of executed Map task satisfies the second default threshold value but when not satisfying presetting first threshold, then the device of dynamically determining of MapReduce parameter can also be monitored default Reduce task executions, wherein, a subregion in the subregion of the quantity of the corresponding Reduce task of presetting of each default Reduce task difference; And after the quantity of executed Map task satisfied presetting first threshold, then dynamically definite device of MapReduce parameter can also be indicated and be stopped unenforced default Reduce task.And determine the data volume of unenforced Reduce task according to the data total amount of Reduce task in the quantity of executed Map task and this MapReduce operation, thereby determine the quantity of corresponding unenforced Reduce task according to the data volume of unenforced Reduce task, concrete, when carrying out unenforced Map task, the intermediate result of its output is mapped in the subregion of quantity of the Reduce task that redefines, promptly unenforced default Reduce task is carried out by the Reduce task that redefines.
For a MapReduce operation, at Map and Reduce in two stages, because the quantity of Map task is fixed, therefore, the cost of Map task also is relatively-stationary, and by the MapReduce performance model, calculate the T.T. cost and finding during the execution time of operation under the situation of quantity of the corresponding different Reduce tasks of same MapReduce operation respectively, the common meeting of T.T. cost of operation increases along with the increase of the quantity of Reduce task, simultaneously, the quantity that increases the Reduce task can improve the computing power of cluster and the degree of parallelism between the task, shortens the execution time of operation; Vice versa.Therefore, in a specific implementation of the present invention,, calculate the cost of Reduce task, and seek equilibrium point so that the quantity of Reduce task is adjusted at time cost and between the execution time by the MapReduce performance model.
For instance, in the MapReduce performance model, the time cost of Reduce task comprises TR1_init, TR2_read, TR3_net, TR4_merge, TR5_serial, TR6_io, TR7_parse, TR8_Reducer, TR9_net and TR10_write, wherein, the time cost of system when TR1_init represents initialization Reduce task, be initialization task, open task, close task, time of loading procedure etc., TR1_init=RedSysCost+RedInit usually; The IO cost of read data when TR2_read represents to begin to carry out the Reduce task, TR2_read=ReduceInput/seqRead usually; The network cost of transmission data when TR3_net represents to carry out the Reduce task, usually The TR4_merge required CPU cost that sorts when representing to carry out the Reduce task, TR4_merge=SortCEF*RIRNumber*logNumberofMap usually; TR5_serial represents to carry out the time cost of Reduce task time seriesization, usually TR5_serial=(se1*RIRNumber+se2*ReduceInput); The disk I cost that TR6_io relates to when representing to carry out the Reduce task, usually TR 6 _ io = ReduceInput * ( 1 seqRead + 1 seqWrite ) ; The required time cost of resolution data when TR7_parse represents to carry out the Reduce task, TR7_parse=pa1*RIRNumber+pa2*ReduceInput usually; The required time cost of function calculation when TR8_Reducer represents to carry out the Reduce task, TR8_Reducer=ReduceInput*ComplexOfReduce*CEF usually; TR9_net represents to carry out the network cost of transmission data after the Reduce task, usually TR 9 _ net = HDFSReplica * ReduceOutput BandWidth ; TR10_write represents to carry out after the Reduce task to the IO of disk write data cost, usually TR 10 _ write = HDFSReplica * ReduceOutput SeqWrite .
Therefore, the time cost TR of each Reduce task equals above-mentioned every sum, that is: TR = Σ k = 1 10 T R k ;
Supposing has n Reduce task in the MapReduce operation, therefore, the T.T. cost TRS of all Reduce tasks is in MapReduce operation:
TRS=n*TR。
With above-mentioned various substitution pricing formula TR, the time cost TR that then carries out a Reduce task equals: TR = RedSysCost + RedInit + ReduceInput / seqRead + ReduceInput BandWidth
+ SortCEF * RIRNunber * log NunberofMap + ( sel * RIRNumber + se 2 * ReduceInput ) + ReduceInput * ( 1 seqRead + 1 seqWrite ) + pa 1 * RIRNumber + pa 2 * ReduceInput + ReduceInput * ComplexOfReduce * CEF + HDFSReplica * ReduceOutput BandWidth + HDFSReplica * ReduceOutput SeqWrite ;
Wherein, initialization when RedSysCost represents initialization Reduce task, open, close the time cost of task, be traditionally arranged to be 2000 milliseconds (ms); When representing initialization Reduce task, RedInit loads additional programs or required time of data; ReduceInput represents to carry out the data volume that a Reduce task is imported; SeqRead represents to read in the Preset Time data volume of disk; Bandwidth represents network transfer speeds in the Preset Time; SortCEF represents the coefficient that sorts; The quantity of the data that RIRNumber imports when representing to carry out the Reduce task; NumberOfMap represents the quantity of Map task; SeqWrite represents to write in the Preset Time data volume of disk; The complexity that ComplexOfReduce calculates when representing to carry out the Reduce task; HDFSReplica represents the quantity of copy; ReduceOutput represents to carry out the data volume of being exported after the Reduce task; Se1 and se2 are two parameters in serializing stage, and in the Mapredcue performance model, the needed time cost of serializing is linear with data volume, se1 and se2 are exactly two parameters of this linear relationship, say for a cluster, and be constant, therefore, can think constant; Pa1 and pa2 are two constant parameters of phase sorting; CEF is that clustered machine is handled the needed time of standard operation, also is constant.Remove in above-mentioned each parameter outside the Pass in ReduceInput, ReduceOutput, RIRNumber, NumberOfMap and the MapReduce operation there being the data volume of Map task and Reduce task, other parameter is the Given information of systemic presupposition.
Further, n is the quantity of Reduce task in the MapReduce operation if make, ReduceInput=Input/n then, wherein, Input represents total input data volume of all Reduce tasks in the MapReduce operation, and establishing Y is the data conversion rate of Reduce, be Reduce output data quantity and the ratio of importing data volume, Y=ReduceOutput/ReduceInput, ReduceOutput=Y*ReduceInput=Y*Input/n then, then
RIRNumber=ReduceInput/RIRLength=Input/n/RIRLength; Wherein, RIRlength represents to import the average length of data.
The above-mentioned formula of substitution also by behind the abbreviation, obtains:
TR = RedSysCost + RedInit + [ 2 / SeqRead + ( 1 + HDFSReplica * Y ) * ( 1 SeqWrite + 1 BandWidth ) + ( se 1 + pa 1 + SortCEF * log NumberofMap ) / RIRLength + ( se 2 + pa 2 + ComplexOfReduce * CEF ) ] * Input / n ;
If make a=RedSysCost+RedInit,
b = 2 / SeqRed + ( 1 + HDFSReplica * Y ) * ( 1 SeqWrite + 1 BandWidth ) + ( se 1 + pa 1 + SortCEF * log NumberofMap ) / RIRLength + ( se 2 + pa 2 + ComplexOfReduce * CEF ) ;
For given MapReduce operation, the value of a and b can be tried to achieve by above-mentioned parameter, is equivalent to constant, therefore carries out the time cost TR=a+b*Input/n of a Reduce task; Correspondingly, TRS=n*TR=a*n+b*Input.
Formation at the T.T. cost TRS of all Reduce tasks in the time cost TR of a Reduce task and the MapReduce operation, the first order derivative sum that can select TR and TRS is that 0 point is as equilibrium point, thereby adjust the quantity of Reduce task, that is: ∂ ( TR ) ∂ ( n ) + ∂ ( TRS ) ∂ ( n ) = 0 ; The substitution equation: n = b * Input / a = b a * Input .
Thereby can determine the quantity of Reduce task according to following formula, further, for fear of thereby the long situation of more single Reduce task executions time of data volume that each Reduce handles occurring, the maximum processing data volume that can further limit single Reduce task is U, therefore, use the Reduce task of Input/U quantity at the minimum needs of each MapReduce operation and handle all data, the value of U is specifically as follows: U=ReduceJVM*ShuffleBufferPercent*ShuffleMergePercent, therefore:
n = { b / a * Input , ifinput < U 2 * b / a Input U , else .
Present embodiment is monitored by the Master to the MapReduce framework, thereby can be according to Map task executions situation, determine the total amount of data of Reduce task in the MapReduce operation, thereby according to the total amount of data of Reduce task and take all factors into consideration the time cost of carrying out the Reduce task and the quantity of Reduce task is adjusted, to determine the quantity of a rational Reduce task, and carry out the Reduce task according to the quantity of adjusted Reduce task, to solve problems such as the wasting of resources that caused by the quantity of the static Reduce of appointment of user task in the prior art or execution time be long, make the time cost of carrying out the Reduce task be tending towards minimum simultaneously.
One of ordinary skill in the art will appreciate that: all or part of step that realizes above-mentioned each method embodiment can be finished by the relevant hardware of programmed instruction.Aforesaid program can be stored in the computer read/write memory medium.This program is carried out the step that comprises above-mentioned each method embodiment when carrying out; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CD.
The structural representation of dynamically determining device of a kind of MapReduce parameter that Fig. 2 provides for the embodiment of the invention two, as shown in Figure 2, the MapReduce parameter of present embodiment determine that dynamically device can comprise:
Job request acquisition module 201, be used to obtain the MapReduce job request, the parameter that described MapReduce job request comprises the quantity for the treatment of work data collection, default Reduce task and is used to represent whether allow start the adjustment mechanism of Reduce task quantity;
Monitoring module 202 is used for then monitoring the Map task executions if being used to represent whether allow start the parameter of the adjustment mechanism of Reduce task quantity is to allow;
Determination module 203 is used for if the quantity of executed Map task satisfies presetting first threshold, the subregion of the quantity of the default Reduce task that the output result of described executed Map task is mapped to, the then quantity of definite adjusted Reduce task;
Mapping block 204 is used for the quantity according to adjusted Reduce task, and each unenforced default Reduce task is corresponded to each adjusted Reduce task, so that carry out each adjusted Reduce task.
The device of dynamically determining of the MapReduce parameter of present embodiment can be used to carry out the technical scheme of method embodiment shown in Figure 1, and its realization principle and technique effect are similar, repeat no more herein.
Further, determination module specifically can be used for: according to the data total amount of unenforced Reduce task, determine the quantity of adjusted Reduce task.
Preferably, determination module specifically can be used for:
According to following formula, determine the quantity of adjusted Reduce task:
&PartialD; ( TR ) &PartialD; ( n ) + &PartialD; ( TRS ) &PartialD; ( n ) = 0 ;
Wherein, n is the quantity of adjusted Reduce task, TR is for carrying out the time cost of an adjusted Reduce task, TRS is for carrying out the T.T. cost of adjusted all Reduce tasks, and TR depends on the data total amount of unenforced Reduce task and the quantity of adjusted Reduce task.
Concrete, the MapReduce job request in the foregoing description can also comprise the second default threshold value that is used for the execution of forward scheduling Reduce task, less than first threshold, then monitoring module specifically can be used for as if second threshold value:
If the quantity of executed Map task satisfies described second threshold value and does not satisfy presetting first threshold, the default Reduce task executions of monitoring then, a subregion in the subregion of the quantity of the corresponding respectively described default Reduce task of each default Reduce task;
And after the quantity of executed Map task satisfied presetting first threshold, indication stopped to carry out unenforced default Reduce task.
Concrete, in above-mentioned arbitrary embodiment, presetting first threshold can be the amount threshold of default Map task or default ratio, then determination module specifically is used for:
If the quantity of executed Map task satisfies the amount threshold of default Map task, then start the adjustment mechanism of Reduce task quantity, determine the quantity of adjusted Reduce task; Or,
If the ratio between the total quantity of the quantity of executed Map task and Map task satisfies default ratio, then start the adjustment mechanism of Reduce task quantity, determine the quantity of adjusted Reduce task, wherein, the total quantity of Map task is determined according to the described work data collection for the treatment of.
It should be noted that at last: above each embodiment is not intended to limit only in order to technical scheme of the present invention to be described; Although the present invention is had been described in detail with reference to aforementioned each embodiment, those of ordinary skill in the art is to be understood that: it still can be made amendment to the technical scheme that aforementioned each embodiment put down in writing, and perhaps some or all of technical characterictic wherein is equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution break away from the scope of various embodiments of the present invention technical scheme.

Claims (10)

1. the method for dynamically determining of a MapReduce parameter is characterized in that, comprising:
Obtain the MapReduce job request, the parameter that described MapReduce job request comprises the quantity for the treatment of work data collection, default Reduce task and is used to represent whether allow start the adjustment mechanism of Reduce task quantity;
If the parameter of the described adjustment mechanism that is used to represent whether to allow to start Reduce task quantity is then monitored the Map task executions for allowing;
If the quantity of executed Map task satisfies presetting first threshold, the output result of described executed Map task is mapped to the subregion of the quantity of default Reduce task, then determines the quantity of adjusted Reduce task;
According to the quantity of described adjusted Reduce task, each unenforced default Reduce task is corresponded to each adjusted Reduce task, so that carry out each adjusted Reduce task.
2. method according to claim 1 is characterized in that, the described quantity of determining adjusted Reduce task specifically comprises:
According to the data total amount of unenforced Reduce task, determine the quantity of adjusted Reduce task.
3. method according to claim 2 is characterized in that, described data total amount according to unenforced Reduce task is determined specifically to comprise the quantity of adjusted Reduce task:
According to following formula, determine the quantity of adjusted Reduce task:
&PartialD; ( TR ) &PartialD; ( n ) + &PartialD; ( TRS ) &PartialD; ( n ) = 0 ;
Wherein, n is the quantity of adjusted Reduce task, TR is for carrying out the time cost of an adjusted Reduce task, TRS is for carrying out the T.T. cost of adjusted all Reduce tasks, and TR depends on the data total amount of unenforced Reduce task and the quantity of adjusted Reduce task.
4. method according to claim 1, it is characterized in that, described MapReduce job request also comprises the second default threshold value that is used for the execution of forward scheduling Reduce task, if described second threshold value is less than described first threshold, then described if the quantity of executed Map task satisfies before the presetting first threshold, also comprise:
If the quantity of described executed Map task satisfies described second threshold value and does not satisfy presetting first threshold, the default Reduce task executions of monitoring then, a subregion in the subregion of the quantity of the corresponding respectively described default Reduce task of each default Reduce task;
Described if the quantity of executed Map task satisfies after the presetting first threshold, also comprise:
Indication stops to carry out unenforced default Reduce task.
5. according to each described method in the claim 1~4, it is characterized in that, described presetting first threshold is the amount threshold of default Map task or default ratio, it is described if the quantity of described executed Map task satisfies presetting first threshold, then determine the quantity of adjusted Reduce task, be specially:
If the quantity of described executed Map task satisfies the amount threshold of described default Map task, then start the adjustment mechanism of Reduce task quantity, determine the quantity of adjusted Reduce task; Perhaps,
If the ratio between the total quantity of the quantity of described executed Map task and Map task satisfies default ratio, then start the adjustment mechanism of Reduce task quantity, determine the quantity of adjusted Reduce task, the total quantity of described Map task is determined according to the described work data collection for the treatment of.
6. dynamically definite device of a MapReduce parameter is characterized in that, comprising:
The job request acquisition module, be used to obtain the MapReduce job request, the parameter that described MapReduce job request comprises the quantity for the treatment of work data collection, default Reduce task and is used to represent whether allow start the adjustment mechanism of Reduce task quantity;
Monitoring module is used for if the parameter of the described adjustment mechanism that is used to represent whether to allow to start Reduce task quantity for allowing, is then monitored the Map task executions;
Determination module is used for if the quantity of executed Map task satisfies presetting first threshold, the subregion of the quantity of the default Reduce task that the output result of described executed Map task is mapped to, the then quantity of definite adjusted Reduce task;
Mapping block is used for the quantity according to described adjusted Reduce task, and each unenforced default Reduce task is corresponded to each adjusted Reduce task, so that carry out each adjusted Reduce task.
7. device according to claim 6 is characterized in that, described determination module specifically is used for:
According to the data total amount of unenforced Reduce task, determine the quantity of adjusted Reduce task.
8. device according to claim 7 is characterized in that, described determination module specifically is used for:
According to following formula, determine the quantity of adjusted Reduce task:
&PartialD; ( TR ) &PartialD; ( n ) + &PartialD; ( TRS ) &PartialD; ( n ) = 0 ;
Wherein, n is the quantity of adjusted Reduce task, TR is for carrying out the time cost of an adjusted Reduce task, TRS is for carrying out the T.T. cost of adjusted all Reduce tasks, and TR depends on the data total amount of unenforced Reduce task and the quantity of adjusted Reduce task.
9. device according to claim 6, it is characterized in that, described MapReduce job request also comprises the second default threshold value that is used for the execution of forward scheduling Reduce task, and less than described first threshold, then described monitoring module specifically is used for as if described second threshold value:
If the quantity of described executed Map task satisfies described second threshold value and does not satisfy presetting first threshold, the default Reduce task executions of monitoring then, a subregion in the subregion of the quantity of the corresponding respectively described default Reduce task of each default Reduce task;
And after the quantity of described executed Map task satisfied presetting first threshold, indication stopped to carry out unenforced default Reduce task.
10. according to each described device of claim 6~9, it is characterized in that described presetting first threshold is the amount threshold of default Map task or default ratio, described determination module specifically is used for:
If the quantity of described executed Map task satisfies the amount threshold of described default Map task, then start the adjustment mechanism of Reduce task quantity, determine the quantity of adjusted Reduce task; Or,
If the ratio between the total quantity of the quantity of described executed Map task and Map task satisfies default ratio, then start the adjustment mechanism of Reduce task quantity, determine the quantity of adjusted Reduce task, the total quantity of described Map task is determined according to the described work data collection for the treatment of.
CN201310078507.4A 2013-03-12 2013-03-12 The dynamic defining method of MapReduce parameter and device Active CN103218263B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310078507.4A CN103218263B (en) 2013-03-12 2013-03-12 The dynamic defining method of MapReduce parameter and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310078507.4A CN103218263B (en) 2013-03-12 2013-03-12 The dynamic defining method of MapReduce parameter and device

Publications (2)

Publication Number Publication Date
CN103218263A true CN103218263A (en) 2013-07-24
CN103218263B CN103218263B (en) 2016-03-23

Family

ID=48816085

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310078507.4A Active CN103218263B (en) 2013-03-12 2013-03-12 The dynamic defining method of MapReduce parameter and device

Country Status (1)

Country Link
CN (1) CN103218263B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103645952A (en) * 2013-08-08 2014-03-19 中国人民解放军国防科学技术大学 Non-accurate task parallel processing method based on MapReduce
CN104598304A (en) * 2013-10-31 2015-05-06 国际商业机器公司 Dispatch method and device used in operation execution
CN104978228A (en) * 2014-04-09 2015-10-14 腾讯科技(深圳)有限公司 Scheduling method and scheduling device of distributed computing system
CN105302536A (en) * 2014-07-31 2016-02-03 国际商业机器公司 Configuration method and apparatus for related parameters of MapReduce application
WO2017031961A1 (en) * 2015-08-24 2017-03-02 华为技术有限公司 Data processing method and apparatus
WO2017162027A1 (en) * 2016-03-21 2017-09-28 阿里巴巴集团控股有限公司 Control method and device for map end aggregation regarding user task in mr computing platform
CN107402952A (en) * 2016-05-20 2017-11-28 伟萨科技有限公司 Big data processor accelerator and big data processing system
CN108196970A (en) * 2017-12-29 2018-06-22 东软集团股份有限公司 The dynamic memory management method and device of Spark platforms
CN110209645A (en) * 2017-12-30 2019-09-06 ***通信集团四川有限公司 Task processing method, device, electronic equipment and storage medium
CN110222105A (en) * 2019-05-14 2019-09-10 联动优势科技有限公司 Data summarization processing method and processing device
CN110413396A (en) * 2019-07-30 2019-11-05 广东工业大学 A kind of resource regulating method, device, equipment and readable storage medium storing program for executing
CN110795301A (en) * 2018-08-01 2020-02-14 马上消费金融股份有限公司 Job monitoring method, device, terminal and computer storage medium
CN113157448A (en) * 2014-06-30 2021-07-23 亚马逊科技公司 System and method for managing feature processing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770402A (en) * 2008-12-29 2010-07-07 ***通信集团公司 Map task scheduling method, equipment and system in MapReduce system
CN102096603A (en) * 2009-12-14 2011-06-15 ***通信集团公司 Task decomposition control method in MapReduce system and scheduling node equipment
US20120304186A1 (en) * 2011-05-26 2012-11-29 International Business Machines Corporation Scheduling Mapreduce Jobs in the Presence of Priority Classes
US20120317579A1 (en) * 2011-06-13 2012-12-13 Huan Liu System and method for performing distributed parallel processing tasks in a spot market

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770402A (en) * 2008-12-29 2010-07-07 ***通信集团公司 Map task scheduling method, equipment and system in MapReduce system
CN102096603A (en) * 2009-12-14 2011-06-15 ***通信集团公司 Task decomposition control method in MapReduce system and scheduling node equipment
US20120304186A1 (en) * 2011-05-26 2012-11-29 International Business Machines Corporation Scheduling Mapreduce Jobs in the Presence of Priority Classes
US20120317579A1 (en) * 2011-06-13 2012-12-13 Huan Liu System and method for performing distributed parallel processing tasks in a spot market

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
KEWEN WANG等: "Predator - An Experience Guided Configuration Optimizer for Hadoop MapReduce", 《2012 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING TECHNOLOGY AND SCIENCE》 *
XUELIAN LIN等: "A Practical Performance Model for Hadoop MapReduce", 《2012 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING WORKSHOPS》 *
周锋: "一种改进的MapReduce并行编程模型", 《科协论坛(下半月)》 *
奚建清: "基于MapReduce的封闭立方体并行计算方法", 《华南理工大学学报(自然科学版)》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103645952B (en) * 2013-08-08 2017-06-06 中国人民解放军国防科学技术大学 A kind of non-precision tasks in parallel processing method based on MapReduce
CN103645952A (en) * 2013-08-08 2014-03-19 中国人民解放军国防科学技术大学 Non-accurate task parallel processing method based on MapReduce
CN104598304A (en) * 2013-10-31 2015-05-06 国际商业机器公司 Dispatch method and device used in operation execution
CN104598304B (en) * 2013-10-31 2018-03-13 国际商业机器公司 Method and apparatus for the scheduling in Job execution
CN104978228B (en) * 2014-04-09 2019-08-30 腾讯科技(深圳)有限公司 A kind of dispatching method and device of distributed computing system
CN104978228A (en) * 2014-04-09 2015-10-14 腾讯科技(深圳)有限公司 Scheduling method and scheduling device of distributed computing system
CN113157448B (en) * 2014-06-30 2024-04-12 亚马逊科技公司 System and method for managing feature processing
CN113157448A (en) * 2014-06-30 2021-07-23 亚马逊科技公司 System and method for managing feature processing
US10831716B2 (en) 2014-07-31 2020-11-10 International Business Machines Corporation Method and apparatus for configuring relevant parameters of MapReduce applications
CN105302536A (en) * 2014-07-31 2016-02-03 国际商业机器公司 Configuration method and apparatus for related parameters of MapReduce application
WO2017031961A1 (en) * 2015-08-24 2017-03-02 华为技术有限公司 Data processing method and apparatus
CN106484689A (en) * 2015-08-24 2017-03-08 杭州华为数字技术有限公司 Data processing method and device
CN106484689B (en) * 2015-08-24 2019-09-03 杭州华为数字技术有限公司 Data processing method and device
WO2017162027A1 (en) * 2016-03-21 2017-09-28 阿里巴巴集团控股有限公司 Control method and device for map end aggregation regarding user task in mr computing platform
CN107220247A (en) * 2016-03-21 2017-09-29 阿里巴巴集团控股有限公司 The control method and device that user task map ends polymerize in MR calculating platforms
TWI730051B (en) * 2016-03-21 2021-06-11 香港商阿里巴巴集團服務有限公司 Method and device for controlling user task mapping (map) end aggregation in a mapping induction (MR) computing platform
CN107402952A (en) * 2016-05-20 2017-11-28 伟萨科技有限公司 Big data processor accelerator and big data processing system
CN108196970A (en) * 2017-12-29 2018-06-22 东软集团股份有限公司 The dynamic memory management method and device of Spark platforms
CN110209645A (en) * 2017-12-30 2019-09-06 ***通信集团四川有限公司 Task processing method, device, electronic equipment and storage medium
CN110795301A (en) * 2018-08-01 2020-02-14 马上消费金融股份有限公司 Job monitoring method, device, terminal and computer storage medium
CN110222105A (en) * 2019-05-14 2019-09-10 联动优势科技有限公司 Data summarization processing method and processing device
CN110222105B (en) * 2019-05-14 2021-06-29 联动优势科技有限公司 Data summarization processing method and device
CN110413396A (en) * 2019-07-30 2019-11-05 广东工业大学 A kind of resource regulating method, device, equipment and readable storage medium storing program for executing
CN110413396B (en) * 2019-07-30 2022-02-15 广东工业大学 Resource scheduling method, device and equipment and readable storage medium

Also Published As

Publication number Publication date
CN103218263B (en) 2016-03-23

Similar Documents

Publication Publication Date Title
CN103218263A (en) Dynamic determining method and device for MapReduce parameter
US20200342322A1 (en) Method and device for training data, storage medium, and electronic device
US20150295970A1 (en) Method and device for augmenting and releasing capacity of computing resources in real-time stream computing system
US20070038987A1 (en) Preprocessor to improve the performance of message-passing-based parallel programs on virtualized multi-core processors
CN103399800B (en) Based on the dynamic load balancing method of Linux parallel computing platform
US11144330B2 (en) Algorithm program loading method and related apparatus
CN105022670A (en) Heterogeneous distributed task processing system and processing method in cloud computing platform
CN106339252B (en) Self-adaptive optimization method and device for distributed DAG system
US20200193964A1 (en) Method and device for training an acoustic model
CN109726004B (en) Data processing method and device
WO2015094269A1 (en) Hybrid flows containing a continuous flow
CN103019855A (en) Method for forecasting executive time of Map Reduce operation
CN110618860A (en) Spark-based Kafka consumption concurrent processing method and device
US20220300323A1 (en) Job Scheduling Method and Job Scheduling Apparatus
CN106648839B (en) Data processing method and device
TW201723878A (en) Method and system for recommending application parameter setting and system specification setting in distributed computation
CN111198754A (en) Task scheduling method and device
CN110134646B (en) Knowledge platform service data storage and integration method and system
CN104717251A (en) Scheduling method and system for Cell nodes through OpenStack cloud computing management platform
CN105095515A (en) Bucket dividing method, device and equipment supporting fast query of Map-Reduce output result
CN103442087B (en) A kind of Web service system visit capacity based on response time trend analysis controls apparatus and method
CN110362387B (en) Distributed task processing method, device, system and storage medium
WO2021017701A1 (en) Spark performance optimization control method and apparatus, and device and storage medium
CN110109970B (en) Data query processing method and device
CN106874129A (en) A kind of operating system process scheduling order determines method and control method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant