CN103617087B - MapReduce optimizing method suitable for iterative computations - Google Patents

MapReduce optimizing method suitable for iterative computations Download PDF

Info

Publication number
CN103617087B
CN103617087B CN201310600745.7A CN201310600745A CN103617087B CN 103617087 B CN103617087 B CN 103617087B CN 201310600745 A CN201310600745 A CN 201310600745A CN 103617087 B CN103617087 B CN 103617087B
Authority
CN
China
Prior art keywords
task
node
hadoop
map
tasks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310600745.7A
Other languages
Chinese (zh)
Other versions
CN103617087A (en
Inventor
金海�
郑然�
余根茂
章勤
朱磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201310600745.7A priority Critical patent/CN103617087B/en
Publication of CN103617087A publication Critical patent/CN103617087A/en
Application granted granted Critical
Publication of CN103617087B publication Critical patent/CN103617087B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a MapReduce optimizing method suitable for iterative computations. The MapReduce optimizing method is applied to a Hadoop trunking system. The trunking system comprises a major node and a plurality of secondary nodes. The MapReduce optimizing method comprises the following steps that a plurality of Hadoop jobs submitted by a user are received by the major node; the jobs are placed in a job queue by a job service process of the major node and wait for being scheduled by a job scheduler of the major node; the major node waits for a task request transmitted from the secondary nodes; after the major node receives the task request, localized tasks are scheduled preferentially by the job scheduler of the major node; and if the secondary nodes which transmit the task request do not have localized tasks, prediction scheduling is performed according to task types of the Hadoop jobs. The MapReduce optimizing method can support the traditional data-intensive application, and can also support iterative computations transparently and efficiently; dynamic data and static data can be respectively researched; and data transmission quantity can be reduced.

Description

A kind of MapReduce optimization methods of suitable iterative calculation
Technical field
The invention belongs to parallel computation and mass data processing field, more particularly, to a kind of suitable iterative calculation MapReduce optimization methods.
Background technology
21 century is entered into, the treatment scale of data is increasing, and the scale of TB ranks is increasingly common, or even occurs in that The scale of PB ranks.Disposal ability of the data scale of this rank far beyond PC.Exactly this disposal ability Demand promote the development of parallel or distributed computing platform.In this case, the MapReduce model of Google meet the tendency of and Raw, it is Data-intensive computing model under a kind of popular big cluster environment.
MapReduce is a kind of programming model, for large-scale dataset(More than 1TB)Concurrent operation.Concept " Map (Mapping)" and " Reduce(Abbreviation)", and their main thought is all borrowed from Functional Programming, also from The characteristic borrowed in vector programming language.He be very easy to programming personnel will not distributed parallel program in the case of, The program of oneself is operated in distributed system.In this model, the type of organization of all data is a kind of<Key, value>It is right.During programming, programmer needs that what is done simply to realize Map and Reduce functions.The process input of Map functions<Key, value>Pair and export zero or several key-value pairs, Reduce functions read Map in the middle of output, finally give zero or Several results.MapReduce model structure is followed and do not exist between relatively independent principle, i.e. Map or Reduce data dependence Relation.
The mentality of designing of MapReduce model allows it to be good at carrying out the calculating of batch mode, for example log analysis and text Present treatment etc..But except the application of these batch processing modes, also there is the application based on machine learning or pattern recognition, allusion quotation Type has computer vision and data mining application etc..In such applications, core algorithm is designed based on iterative manner.So And current Hadoop(MapReduce model is increased income realization)Transparent can not efficiently support to iterate to calculate, or even Hadoop Some characteristics are not suitable for iterative calculation.With the development of social networkies, computer vision, data mining etc., the number of this kind of application It is increasing according to treatment scale.Can effectively support that the demand of the parallel computational model of this kind of application is increasing.
The content of the invention
For the disadvantages described above or Improvement requirement of prior art, the invention provides a kind of suitable iterative calculation MapReduce optimization methods, it is intended that being improved on the basis of Hadoop, can either support that traditional data is intensive Type application, transparent can efficiently support iterative calculation again, and be ground in terms of dynamic data and static data two respectively Study carefully and realize the reduction of volume of transmitted data.
For achieving the above object, according to one aspect of the present invention, there is provided a kind of MapReduce of suitable iterative calculation Optimization method, is applied in a kind of Hadoop group systems, and the group system includes a host node and multiple from node, The method comprising the steps of:
(1)Multiple Hadoop operations that host node receive user is submitted to, the job service process of host node is put into operation In job queue, and the job scheduler of host node is waited to carry out job scheduling;
(2)Host node waits the task requests sent from node, and after task requests are received, the operation of host node is adjusted Degree device priority scheduling localization tasks, if sending no localization tasks, the basis from node of task requests The task type of Hadoop operations is predicted scheduling, directly the Hadoop operations is scheduled for calculation type task, pin Certain intervals are then postponed to mode transmission task, just the Hadoop operations are carried out when total delay time interval reaches delay threshold value Scheduling;
(3)From node after the task of Hadoop operations of host node scheduling is received, judge that Hadoop homework types enter Row different disposal, homework type is divided into two kinds of iterative type and non-iterative type, for the operation of non-iterative type according to Hadoop routine Processing mode is processed, and for iterative type operation, was increased Map ends before the Map stages and is shuffled process, for for Map tasks Dynamic data is read, in the Reduce stages, the dynamic data caching group that dynamic data is carried out local cache and transfers to from node Part is managed, and final result is stored in HDFS after operation is disposed.
Preferably, step(2)Specifically include following sub-step:
(2-1)The heart that task service process of the job service process monitoring and wait on host node from node sends Hop-information, the heartbeat message includes the current operation information from node, specifically includes total slot number and the slot number being currently running;
(2-2)Host node is calculated currently from section after the heartbeat message sent from node is received according to the heartbeat message Point idle slot number and whole Hadoop group systems average operation slot number, according to calculate result judge whether need to work as The front task from the node distribution operation, if task, return to step need not be distributed(2-1), otherwise execution step(2- 3);
(2-3)Enumerator i=0 is set;
(2-4)Judge that whether i-th Hadoop operation from node have localization tasks currently, i.e., currently from node whether Be stored with the input data burst of i-th Hadoop operation(Split), step is proceeded to if not(2-5), step is proceeded to if having Suddenly(2-11);
(2-5)I=i+1 is set, and judges whether i is equal to the number of Hadoop operations, step is entered if being equal to(2-7), Otherwise return to step(2-4);
(2-6)Enumerator j=0 is set;
(2-7)The task type for judging j-th Hadoop operation is calculation type task or mode transmission task, if calculating Type task, then into step(2-11)If, mode transmission task, then into step(2-8);
(2-8)The task scheduling of j-th Hadoop operation is postponed into a heart time;
(2-9)Whether the total delay time for judging j-th Hadoop job tasks scheduling reaches a threshold value, if reaching, Then proceed to step(2-11), otherwise proceed to step(2-10);
(2-10)J=j+1 is set, and judges whether j is equal to the number of Hadoop operations, step is entered if being equal to(2- 12), otherwise return to step(2-1);
(2-11)The localization tasks of i-th Hadoop operation are dispatched to current from node, then process terminates;
(2-12)To the task scheduling of j-th Hadoop operation to current from node, then process terminates.
Preferably, step(2-2)Specifically, currently deduct equal to total slot number from the idle slot number of node being currently running Slot number, the average operation slot number of whole Hadoop group systems be track process monitoring to all be currently running from node Slot number and divided by all slot numbers from node, if the idle slot number of present node is equal to 0, task need not be distributed, together The slot number that Shi Ruguo is currently currently running from node is more than the average operation slot number of whole Hadoop group systems, then need not divide With task.
Preferably, step(3)Specifically include following sub-step:
(3-1)Receive the task of the Hadoop operations of host node scheduling;
(3-2)The homework type of judgement task is iterative type operation or non-iterative type operation, if iterative type operation Then proceed to step(3-3), if the operation of non-iterative type then proceeds to step(3-4);
(3-3)The task type for judging the iterative type operation is Map tasks or Reduce tasks, if Map tasks Then proceed to step(3-5), if Reduce tasks then proceed to step(3-9);
(3-4)The task type for judging the non-iterative type operation is Map tasks or Reduce tasks, if Map appoints Business then proceeds to step(3-8), if Reduce tasks then proceed to step(3-9);
(3-5)Judge that whether the iterative type operation is operation for the first time, if not then proceeding to step(3-6), if Then proceed to step(3-7);
(3-6)Map task process(Mapper)The multiple data copy threads of task service process initiation being located from node, Ask Reduce task process to be located from node by HTTP modes and obtain the calculated dynamic data of Reduce task process File, then proceeds to step(3-8);
(3-7)Map task process reads dynamic data initialization value, then proceeds to step(3-8);
(3-8)The input file of operation is resolved into Hadoop group systems burst one by one, and Map task process is to dividing Piece is processed, and then proceeds to step(3-14);
(3-9)Reducer task process log-on data copy thread from node, asks Map to appoint by HTTP modes Business process is located from node and obtains the intermediate output file of Map task process, and intermediate output file is stored in from the local of node In disk, copying the file that comes can be firstly placed in core buffer, and multiple copied files can be final big according to being merged into File, this big file sorts according to key, then proceeds to step(3-10);
(3-10)Reduce task process from the big file for obtaining with<key,iterator>Form reads record, and holds Row Reduce () method, then proceeds to step(3-11);
(3-11)Judge that homework type is iterative type or non-iterative type, if iterative type operation, then proceed to step(3- 12), if non-iterative type operation, then proceed to step(3-13);
(3-12)From the dynamic data caching component of node by the result cache after the execution of Reduce task process in internal memory It is central, spill in local disk file when relief area is full, then proceed to step(3-14);
(3-13)Reduce task process is written to the result after execution in HDFS, then proceeds to step(3-14);
(3-14)Tasks carrying terminates, and is then back to step(3-1).
Preferably, step(3-6)In dynamic data file by from node dynamic data caching component manage, be stored in In internal memory and local disk;The dynamic data that copy comes also is managed by dynamic data caching component, same from many of node Individual Map task process is from from the local dynamic data request file of node, the dynamic that these data will need as Map task process Data input.
Preferably, step(3-8)In, the size of burst is defaulted as HDFS block sizes, and block size is configured by configuration file, Burst is resolved into Map task process needs by Map task process<Key, value>The record of form, performs Map () method, By the result cache for performing in the middle of internal memory, can spill in the middle of disk when relief area is full, the file of spilling can record partitioning Information, single spill file be first according to subregion sequence, then according to key sequence;If multiple spill files need to close And into a big file, this process carries out merger sequence to multiple spill files.
In general, by the contemplated above technical scheme of the present invention compared with prior art, can obtain down and show Beneficial effect:
(1)The data locality of task is more preferable:As a result of step(2), the scheduling strategy of proposition is compared in the present invention In delay dispatching strategy, task demand for localization and task is better balanced and has postponed expense, task that will be in Hadoop (task)Conception division is computation-intensive and transmits intensive two class, and the load information with reference to cluster network is predicted in real time Time delay.Raising task localization ratio can be so reached, the bulk delay expense of operation can be effectively reduced again.Therefore The present invention has obvious advantage.
(2)Dynamic data transmission expense is less:As a result of step(3), dynamic data caching plan proposed by the present invention Slightly significantly reduce the cluster network transport overhead that read-write dynamic data brings.Theoretical proof and experimental verification, iterative type is made The dynamic data transmission total amount of industry is only directly proportional to task place interstitial content, and has the definite upper limit, i.e. clustered node number.Cause This present invention has obvious advantage.
(3)Cluster efficiency under many operations and multi-user's use environment is higher:Under many operations and multi-user environment, cluster Internet resources become the bottleneck of trunking efficiency, it will greatly limit the effective utilization of cluster.The present invention is by optimization The network data flow of Hadoop, reduce cluster network transport overhead, effectively alleviate cluster network load, reduce user between and Internet resources competition between operation, improves the cluster effective utilization under many operations and multi-user.Therefore the present invention has bright Aobvious advantage.
(4)Support iterative calculation in high-efficient transparent ground.Compared to traditional Hadoop, the present invention can both support traditional criticizing Operation is processed, iterative type operation can be preferably supported again, so the use field of the present invention is more extensive, such as social networkies, Computer vision, data mining etc..Therefore the present invention has obvious advantage.
Description of the drawings
Fig. 1 is the flow chart of the MapReduce optimization methods that the present invention is adapted to iterative calculation.
Fig. 2 is step of the present invention(2)Refined flow chart.
Fig. 3 is step of the present invention(3)Refined flow chart.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and It is not used in the restriction present invention.As long as additionally, technical characteristic involved in invention described below each embodiment Not constituting conflict each other just can be mutually combined.
Hereinafter the technical term of the present invention is explained and illustrated first:
Dynamic data:In iterative calculation problem, one directly or indirectly constantly by old value recursion goes out the variable of new value.
Static data:In iterative calculation problem, a data of any change are not had, generally algorithm is originally inputted number According to.
Calculation type task:The calculating time of task accounts for the task of major part in whole processing procedures of Map tasks.
Mode transmission task:The data transmission period of task accounts for appointing for major part in whole processing procedures of Map tasks Business.
Localization tasks:From the locally stored Map tasks for having an input data burst of node.
Delay dispatching strategy:A kind of strategy for postponing non-localized task scheduling.
The Integral Thought of the present invention is to be conceived to multi-user and many operation cluster environment, by optimizing static data flow Reduce the network transport load of cluster with shared data stream.For the optimization of static data flow, main contributions of the present invention Individual forecast dispatching algorithm;For the optimization of shared data stream, the present invention is by data caching method and increases Map ends Shuffle processes are achieving the goal.
The present invention is adapted to the MapReduce optimization methods of iterative calculation to be applied in a kind of Hadoop group systems, should Group system includes a master(Master)Node and it is multiple from(Slave)Node, the method is comprised the following steps(Such as Fig. 1 It is shown):
(1)Multiple Hadoop operations that host node receive user is submitted to, the job service process of host node (JobTracker)Operation is put in job queue, and waits the job scheduler of host node to carry out job scheduling;
(2)Host node waits the task requests sent from node, and after task requests are received, the operation of host node is adjusted Degree device priority scheduling localization tasks, if sending no localization tasks, the basis from node of task requests The task type of Hadoop operations is predicted scheduling, directly the Hadoop operations is scheduled for calculation type task, pin Certain intervals are then postponed to mode transmission task, just the Hadoop operations are carried out when total delay time interval reaches delay threshold value Scheduling.Specifically include following sub-step(As shown in Figure 2):
(2-1)Task service process of the job service process monitoring and wait on host node from node (TaskTracker)The heartbeat message for sending, the heartbeat message includes the current operation information from node, specifically includes total groove (slot)Slot number for counting and being currently running etc.;
(2-2)Host node is calculated currently from section after the heartbeat message sent from node is received according to the heartbeat message Point idle slot number and whole Hadoop group systems average operation slot number, according to calculate result judge whether need to work as The front task from the node distribution operation, if task, return to step need not be distributed(2-1), otherwise execution step(2- 3);Specifically, currently it is equal to total slot number from the idle slot number of node and deducts the slot number that is currently running, whole Hadoop collection The average operation slot number of group's system be track process monitoring to it is all from be currently running slot number of node and divided by it is all from The slot number of node;If the idle slot number of present node is equal to 0, task need not be distributed, if while from node just currently It is more than the average operation slot number of whole Hadoop group systems in the slot number of operation, then need not distributes task;
(2-3)Enumerator i=0 is set;
(2-4)Judge that whether i-th Hadoop operation from node have localization tasks currently, i.e., currently from node whether Be stored with the input data burst of i-th Hadoop operation(Split), step is proceeded to if not(2-5), step is proceeded to if having Suddenly(2-11);
(2-5)I=i+1 is set, and judges whether i is equal to the number of Hadoop operations, step is entered if being equal to(2-7), Otherwise return to step(2-4);
(2-6)Enumerator j=0 is set;
(2-7)The task type for judging j-th Hadoop operation is calculation type task or mode transmission task, if calculating Type task, then into step(2-11)If, mode transmission task, then into step(2-8);
(2-8)The task scheduling of j-th Hadoop operation is postponed into a heart time;Specifically, heart time is From node send heartbeat message time interval, specially 3 seconds;
(2-9)Whether the total delay time for judging j-th Hadoop job tasks scheduling reaches a threshold value, if reaching, Then proceed to step(2-11), otherwise proceed to step(2-10);The value of threshold value can be configured by cluster administrator, the foundation of configuration It is:When threshold value is bigger, the localization ratio of task can be bigger, but the expense for postponing also can be bigger;Threshold value is less, localization ratio phase To less, but the expense for postponing also can be less, and threshold value is defaulted as 3 minutes;
(2-10)J=j+1 is set, and judges whether j is equal to the number of Hadoop operations, step is entered if being equal to(2- 12), otherwise return to step(2-1);
(2-11)The localization tasks of i-th Hadoop operation are dispatched to current from node, then process terminates;
(2-12)To the task scheduling of j-th Hadoop operation to current from node, then process terminates.
The advantage of this step is:Task is classified, calculation type task is scheduled using the mode of acquiescence, it is right Mode transmission task is predicted scheduling.The localization ratio of task so both can be improved, delay can be reduced again and is brought Expense.
(3)From node after the task of Hadoop operations of host node scheduling is received, judge that Hadoop homework types enter Row different disposal, homework type is divided into two kinds of iterative type and non-iterative type, for the operation of non-iterative type according to Hadoop routine Processing mode is processed, and for iterative type operation, a Map end was increased before the Map stages and is shuffled(shuffle)Cross Journey, for for the task in Map stages(That is Map tasks)Dynamic data is read, in the Reduce stages, dynamic data is carried out locally Cache and transfer to from the dynamic data caching component of node to manage, and preserve final result after operation is disposed In Hadoop distributed file systems(Hadoop Distributed File System, abbreviation HDFS)In;This step is concrete Including following sub-step(As shown in Figure 3):
(3-1)Receive the task of the Hadoop operations of host node scheduling;
(3-2)The homework type of judgement task is iterative type operation or non-iterative type operation, if iterative type operation Then proceed to step(3-3), if the operation of non-iterative type then proceeds to step(3-4);
(3-3)The task type for judging the iterative type operation is Map tasks or Reduce tasks, if Map tasks Then proceed to step(3-5), if Reduce tasks then proceed to step(3-9);
(3-4)The task type for judging the non-iterative type operation is Map tasks or Reduce tasks, if Map appoints Business then proceeds to step(3-8), if Reduce tasks then proceed to step(3-9);
(3-5)Judge that whether the iterative type operation is operation for the first time, if not then proceeding to step(3-6), if Then proceed to step(3-7);
(3-6)Map task process(Mapper)The multiple data copy threads of task service process initiation being located from node, Reduce task process is asked by HTTP modes(Reducer)Being located, it is calculated to obtain Reduce task process from node Dynamic data file, then proceeds to step(3-8);These dynamic data files are by the dynamic data caching component pipe from node Reason, in being stored in internal memory and local disk;The dynamic data that comes of copy is also managed by dynamic data caching component, it is same from From from the local dynamic data request file of node, these data will be needed multiple Map task process of node as Map task process The dynamic data input wanted;
The advantage of this sub-step is:Dynamic data after 1, Reduce stage is stored in this, reduces and is written to The expense that HDFS brings;2, the dynamic data after the Reduce stages is asked from node by what Map task process was located, and be stored in Map task process from the local of node, Map task process This greatly reduces dynamic from from the local request data of node The transmission volume of state data.
(3-7)Map task process reads dynamic data initialization value, then proceeds to step(3-8);Briefly, iteration Type operation needs and produces some dynamic datas, and this data is to need to provide the dynamic by user in first time Job execution The initialization value of data;
(3-8)The input file of operation is resolved into Hadoop group systems burst one by one, and Map task process is to dividing Piece is processed, and then proceeds to step(3-14);Specifically, the size of burst is defaulted as HDFS block sizes, and block size passes through Configuration file is configured, and burst is resolved into Map task process needs by Map task process<Key, value>The record of form, holds Row Map () method, by the result cache for performing in the middle of internal memory, can spill in the middle of disk when relief area is full;The text of spilling The information of part meeting record partitioning, single spill file is first according to subregion sequence, then according to key sequences;Overflow if multiple Going out file needs to be merged into a big file, and this process carries out merger sequence to multiple spill files;
(3-9)Reducer task process log-on data copy thread from node, asks Map to appoint by HTTP modes Business process is located from node and obtains the intermediate output file of Map task process, and intermediate output file is stored in from the local of node In disk, copying the file that comes can be firstly placed in core buffer, and multiple copied files can be final big according to being merged into File, this big file sorts according to key, then proceeds to step(3-10);
(3-10)Reduce task process from the big file for obtaining with<key,iterator>Form reads record, and holds Row Reduce () method, then proceeds to step(3-11);
(3-11)Judge that homework type is iterative type or non-iterative type, if iterative type operation, then proceed to step(3- 12), if non-iterative type operation, then proceed to step(3-13);
(3-12)From the dynamic data caching component of node by the result cache after the execution of Reduce task process in internal memory It is central, spill in local disk file when relief area is full, then proceed to step(3-14);
(3-13)Reduce task process is written to the result after execution in HDFS, then proceeds to step(3-14);
(3-14)Tasks carrying terminates, and is then back to step(3-1).
Example:
In order to verify the feasibility and effectiveness of the present invention, perform under the experimental configuration environment shown in following table 1 and write Computer program, to invention test, test result is as shown in following table 2 and table 3:
Table 1:Experimental configuration environment
In table 2 and table 3, the comparison other of the present invention is Hadoop-0.20.0 and Haloop, and experiment algorithm is fuzzy C-Means.What table 2 was represented is the network transmission of dynamic data of three MapReduce implementations under different experiments scale Amount compares.What table 3 was represented is 3 MapReduce implementations under certain size of experiment during execution during different iterationses Between compare.Experimental result shows that the present invention has more satisfactory improvement in network data transmission and time performance.
Table 2:Dynamic data transmission amount compares in fuzzy C-Means
Table 3:The fuzzy C-Means execution times compare
As it will be easily appreciated by one skilled in the art that the foregoing is only presently preferred embodiments of the present invention, not to The present invention, all any modification, equivalent and improvement made within the spirit and principles in the present invention etc. are limited, all should be included Within protection scope of the present invention.

Claims (5)

1. a kind of MapReduce optimization methods of suitable iterative calculation, are applied in a kind of Hadoop group systems, the cluster System includes a host node and multiple from node, it is characterised in that the method comprising the steps of:
(1) operation is put into operation by multiple Hadoop operations that host node receive user is submitted to, the job service process of host node In queue, and the job scheduler of host node is waited to carry out job scheduling;
(2) host node waits the task requests sent from node, and after task requests are received, the job scheduler of host node Priority scheduling localization tasks, if sending no localization tasks from node of task requests, make according to Hadoop The task type of industry is predicted scheduling, directly the Hadoop operations is scheduled for calculation type task, for mode transmission Task then postpones certain intervals, and just the Hadoop operations are scheduled when total delay time interval reaches delay threshold value, its In, calculation type task refers to that the calculating time of task accounts for the task of major part in whole processing procedures of Map tasks, transmission Type task refers to that the data transmission period of task accounts for the task of major part in whole processing procedures of Map tasks;
Wherein, step (2) specifically includes following sub-step:
(2-1) the heart beating letter that task service process of the job service process monitoring and wait on host node from node sends Breath, the heartbeat message includes the current operation information from node, specifically includes total slot number and the slot number being currently running;
(2-2) host node is calculated currently from node after the heartbeat message sent from node is received according to the heartbeat message The average operation slot number of idle slot number and whole Hadoop group systems, according to calculate result judge whether need to currently from The task of the node distribution operation, if task need not be distributed, return to step (2-1), otherwise execution step (2-3);
(2-3) enumerator i=0 is set;
(2-4) judge that whether i-th Hadoop operation from node there are localization tasks currently, i.e., currently whether store from node There is the input data burst (Split) of i-th Hadoop operation, step (2-5) is proceeded to if not, step is proceeded to if having (2-11);
(2-5) i=i+1 is set, and judges whether i is equal to the number of Hadoop operations, it is no if step (2-7) is entered equal to if Then return to step (2-4);
(2-6) enumerator j=0 is set;
(2-7) task type for judging j-th Hadoop operation is calculation type task or mode transmission task, if calculation type is appointed Business, then into step (2-11), if mode transmission task, then into step (2-8);
(2-8) task scheduling of j-th Hadoop operation is postponed into a heart time;
(2-9) whether the total delay time for judging j-th Hadoop job tasks scheduling reaches a threshold value, if reaching, turns Enter step (2-11), otherwise proceed to step (2-10);
(2-10) j=j+1 is set, and judges whether j is equal to the number of Hadoop operations, step (2-12) is entered if being equal to, Otherwise return to step (2-1);
(2-11) localization tasks of i-th Hadoop operation are dispatched to current from node, then process terminates;
(2-12) to the task scheduling of j-th Hadoop operation to current from node, then process terminates;
(3) judge that Hadoop homework types are carried out not after the task of Hadoop operations of host node scheduling is received from node With processing, homework type is divided into two kinds of iterative type and non-iterative type, for the operation of non-iterative type according to Hadoop conventional treatment Mode is processed, and for iterative type operation, was increased Map ends before the Map stages and is shuffled process, for reading for Map tasks Dynamic data, in the Reduce stages, the dynamic data caching component pipe that dynamic data is carried out local cache and transfers to from node Reason, and final result is stored in HDFS after operation is disposed.
2. MapReduce optimization methods according to claim 1, it is characterised in that step (2-2) specifically, it is current from The idle slot number of node deducts the slot number being currently running, the average run channel of whole Hadoop group systems equal to total slot number Number for tracking process monitoring to it is all from be currently running slot number of node and divided by all slot numbers from node, if currently The idle slot number of node is equal to 0, then need not distribute task, if while the current slot number being currently running from node is more than whole The average operation slot number of Hadoop group systems, then need not distribute task.
3. MapReduce optimization methods according to claim 1, it is characterised in that step (3) specifically includes following sub-step Suddenly:
(3-1) task of the Hadoop operations of host node scheduling is received;
(3-2) homework type for judging task is iterative type operation or non-iterative type operation, if iterative type operation then turns Enter step (3-3), if the operation of non-iterative type then proceeds to step (3-4);
(3-3) task type for judging the iterative type operation is Map tasks or Reduce tasks, if Map tasks then turn Enter step (3-5), if Reduce tasks then proceed to step (3-9);
(3-4) task type for judging the non-iterative type operation is Map tasks or Reduce tasks, if Map tasks are then Step (3-8) is proceeded to, if Reduce tasks then proceed to step (3-9);
(3-5) judge whether the iterative type operation is operation for the first time, if not step (3-6) is then proceeded to, if it is turn Enter step (3-7);
(3-6) the multiple data copy threads of task service process initiation from node that Map task process (Mapper) is located, pass through HTTP modes are asked Reduce task process to be located from node and obtain the calculated dynamic data file of Reduce task process, Then step (3-8) is proceeded to;
(3-7) Map task process reads dynamic data initialization value, then proceeds to step (3-8);
(3-8) input file of operation is resolved into burst one by one by Hadoop group systems, and Map task process enters to burst Row is processed, and then proceeds to step (3-14);
(3-9) the Reducer task process log-on data copy thread from node, asks Map tasks to be entered by HTTP modes Journey is located from node and obtains the intermediate output file of Map task process, and intermediate output file is stored in the local disk from node In, copying the file that comes can be firstly placed in core buffer, multiple copied files can according to final big file can be merged into, This big file sorts according to key, then proceeds to step (3-10);
(3-10) Reduce task process from the big file for obtaining with<key,iterator>Form reads record, and performs Reduce () method, then proceeds to step (3-11);
(3-11) judge that homework type is iterative type or non-iterative type, if iterative type operation, then proceed to step (3-12), If non-iterative type operation, then step (3-13) is proceeded to;
(3-12) from the dynamic data caching component of node by the result cache after the execution of Reduce task process in the middle of internal memory, Spill in local disk file when relief area is full, then proceed to step (3-14);
(3-13) Reduce task process is written to the result after execution in HDFS, then proceeds to step (3-14);
(3-14) tasks carrying terminates, and is then back to step (3-1).
4. MapReduce optimization methods according to claim 3, it is characterised in that the dynamic data text in step (3-6) Part is managed by the dynamic data caching component from node, in being stored in internal memory and local disk;The dynamic data that copy comes Managed by dynamic data caching component, same multiple Map task process from node are from from the local dynamic data request of node File, these data are input into the dynamic data needed as Map task process.
5. MapReduce optimization methods according to claim 3, it is characterised in that in step (3-8), the size of burst HDFS block sizes are defaulted as, block size is configured by configuration file, burst is resolved into Map task process and needed by Map task process Want<Key, value>The record of form, performs Map () method, by the result cache for performing in the middle of internal memory, works as relief area Man Shihui is spilt in the middle of disk, and the information of the file meeting record partitioning of spilling, single spill file is first according to subregion sequence, Then according to key sequences;If multiple spill files need to be merged into a big file, this process is to multiple spill files Carry out merger sequence.
CN201310600745.7A 2013-11-25 2013-11-25 MapReduce optimizing method suitable for iterative computations Active CN103617087B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310600745.7A CN103617087B (en) 2013-11-25 2013-11-25 MapReduce optimizing method suitable for iterative computations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310600745.7A CN103617087B (en) 2013-11-25 2013-11-25 MapReduce optimizing method suitable for iterative computations

Publications (2)

Publication Number Publication Date
CN103617087A CN103617087A (en) 2014-03-05
CN103617087B true CN103617087B (en) 2017-04-26

Family

ID=50167790

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310600745.7A Active CN103617087B (en) 2013-11-25 2013-11-25 MapReduce optimizing method suitable for iterative computations

Country Status (1)

Country Link
CN (1) CN103617087B (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105204920B (en) * 2014-06-18 2019-07-23 阿里巴巴集团控股有限公司 A kind of implementation method and device of the distributed computing operation based on mapping polymerization
CN104270412A (en) * 2014-06-24 2015-01-07 南京邮电大学 Three-level caching method based on Hadoop distributed file system
CN104158860B (en) * 2014-07-31 2017-09-29 国家超级计算深圳中心(深圳云计算中心) A kind of job scheduling method and job scheduling system
CN104503820B (en) * 2014-12-10 2018-07-24 华南师范大学 A kind of Hadoop optimization methods based on asynchronous starting
US10033570B2 (en) * 2015-01-15 2018-07-24 International Business Machines Corporation Distributed map reduce network
CN106528288A (en) * 2015-09-10 2017-03-22 中兴通讯股份有限公司 Resource management method, device and system
CN106547609B (en) * 2015-09-18 2020-09-18 阿里巴巴集团控股有限公司 Event processing method and device
CN105117286B (en) * 2015-09-22 2018-06-12 北京大学 The dispatching method of task and streamlined perform method in MapReduce
CN106354563B (en) * 2016-08-29 2020-05-22 广州市香港科大***研究院 Distributed computing system for 3D reconstruction and 3D reconstruction method
CN106506255B (en) * 2016-09-21 2019-11-05 微梦创科网络科技(中国)有限公司 A kind of method, apparatus and system of pressure test
CN108153583B (en) * 2016-12-06 2022-05-13 阿里巴巴集团控股有限公司 Task allocation method and device and real-time computing framework system
CN108270634B (en) * 2016-12-30 2021-08-24 中移(苏州)软件技术有限公司 Heartbeat detection method and system
CN106897133B (en) * 2017-02-27 2020-09-29 苏州浪潮智能科技有限公司 Implementation method for managing cluster load based on PBS job scheduling
CN107122238B (en) * 2017-04-25 2018-05-25 郑州轻工业学院 Efficient iterative Mechanism Design method based on Hadoop cloud Computational frame
CN107316124B (en) * 2017-05-10 2018-08-31 中国航天***科学与工程研究院 Extensive affairs type job scheduling and processing general-purpose system under big data environment
CN107391250B (en) * 2017-08-11 2021-02-05 成都优易数据有限公司 Controller scheduling method for improving performance of Mapreduce task Shuffle
CN107562926B (en) * 2017-09-14 2023-09-26 丙申南京网络技术有限公司 Multi-hadoop distributed file system for big data analysis
CN107807983B (en) * 2017-10-30 2021-08-24 辽宁大学 Design method of parallel processing framework supporting large-scale dynamic graph data query
CN108376104B (en) * 2018-02-12 2020-10-27 上海帝联网络科技有限公司 Node scheduling method and device and computer readable storage medium
CN108563497B (en) * 2018-04-11 2022-03-29 中译语通科技股份有限公司 Efficient multi-dimensional algorithm scheduling method and task server
CN109117285B (en) * 2018-07-27 2021-12-28 高新兴科技集团股份有限公司 Distributed memory computing cluster system supporting high concurrency
CN110297714B (en) * 2019-06-19 2023-05-30 上海冰鉴信息科技有限公司 Method and device for acquiring PageRank based on large-scale graph dataset
CN112148202B (en) * 2019-06-26 2023-05-26 杭州海康威视数字技术股份有限公司 Training sample reading method and device
CN110908796B (en) * 2019-11-04 2022-03-18 北京理工大学 Multi-operation merging and optimizing system and method in Gaia system
CN111813527B (en) * 2020-07-15 2022-06-14 江苏方天电力技术有限公司 Data-aware task scheduling method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737114A (en) * 2012-05-18 2012-10-17 北京大学 MapReduce-based big picture distance connection query method
CN103279328A (en) * 2013-04-08 2013-09-04 河海大学 BlogRank algorithm parallelization processing construction method based on Haloop

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120304186A1 (en) * 2011-05-26 2012-11-29 International Business Machines Corporation Scheduling Mapreduce Jobs in the Presence of Priority Classes

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737114A (en) * 2012-05-18 2012-10-17 北京大学 MapReduce-based big picture distance connection query method
CN103279328A (en) * 2013-04-08 2013-09-04 河海大学 BlogRank algorithm parallelization processing construction method based on Haloop

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HPMR:Prefetching and Pre-shuffling in Shared MapReduce Computation Environment;Sangwon Seo等;《IEEE International Conference on Cluster Computing and Workshops,2009》;20091231;第1和4页 *
基于MapReduce的迭代型分布式数据处理研究;冯新建;《中国优秀硕士学位论文全文数据库信息科技辑》;20131015(第10期);第I137-20页 *

Also Published As

Publication number Publication date
CN103617087A (en) 2014-03-05

Similar Documents

Publication Publication Date Title
CN103617087B (en) MapReduce optimizing method suitable for iterative computations
Kalia et al. Analysis of hadoop MapReduce scheduling in heterogeneous environment
CN105117286B (en) The dispatching method of task and streamlined perform method in MapReduce
CN101359333B (en) Parallel data processing method based on latent dirichlet allocation model
CN108469988A (en) A kind of method for scheduling task based on isomery Hadoop clusters
CN106547627A (en) The method and system that a kind of Spark MLlib data processings accelerate
Arfat et al. Big data for smart infrastructure design: Opportunities and challenges
Huynh et al. An efficient approach for mining sequential patterns using multiple threads on very large databases
Laccetti et al. Performance enhancement of a dynamic K-means algorithm through a parallel adaptive strategy on multicore CPUs
Gandomi et al. HybSMRP: a hybrid scheduling algorithm in Hadoop MapReduce framework
Kim et al. Load-balancing in distributed selective search
CN108170861B (en) Distributed database system collaborative optimization method based on dynamic programming
Escobar et al. Parallel high-dimensional multi-objective feature selection for EEG classification with dynamic workload balancing on CPU–GPU architectures
Shi et al. MapReduce short jobs optimization based on resource reuse
CN113255165A (en) Experimental scheme parallel deduction system based on dynamic task allocation
CN104778088B (en) A kind of Parallel I/O optimization methods and system based on reduction interprocess communication expense
Shanker et al. ACTIVE-a real time commit protocol
CN108509259A (en) Obtain the method and air control system in multiparty data source
US8027996B2 (en) Commitment control for less than an entire record in an in-memory database in a parallel computer system
CN115688906A (en) Automatic data arranging and loading method and system for hyperparametric training
Lee et al. ARLS: A MapReduce-based output analysis tool for large-scale simulations
Xu et al. EdgeMesh: A hybrid distributed training mechanism for heterogeneous edge devices
Zhao et al. A holistic cross-layer optimization approach for mitigating stragglers in in-memory data processing
CN106033434A (en) Virtual asset data replica processing method based on data size and popularity
Enokido et al. Energy-Saving Multi-version Timestamp Ordering Algorithm for Virtual Machine Environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant