CN103617087B

CN103617087B - MapReduce optimizing method suitable for iterative computations

Info

Publication number: CN103617087B
Application number: CN201310600745.7A
Authority: CN
Inventors: 金海�; 郑然�; 余根茂; 章勤; 朱磊
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2013-11-25
Filing date: 2013-11-25
Publication date: 2017-04-26
Anticipated expiration: 2033-11-25
Also published as: CN103617087A

Abstract

The invention discloses a MapReduce optimizing method suitable for iterative computations. The MapReduce optimizing method is applied to a Hadoop trunking system. The trunking system comprises a major node and a plurality of secondary nodes. The MapReduce optimizing method comprises the following steps that a plurality of Hadoop jobs submitted by a user are received by the major node; the jobs are placed in a job queue by a job service process of the major node and wait for being scheduled by a job scheduler of the major node; the major node waits for a task request transmitted from the secondary nodes; after the major node receives the task request, localized tasks are scheduled preferentially by the job scheduler of the major node; and if the secondary nodes which transmit the task request do not have localized tasks, prediction scheduling is performed according to task types of the Hadoop jobs. The MapReduce optimizing method can support the traditional data-intensive application, and can also support iterative computations transparently and efficiently; dynamic data and static data can be respectively researched; and data transmission quantity can be reduced.

Description

A kind of MapReduce optimization methods of suitable iterative calculation

Technical field

The invention belongs to parallel computation and mass data processing field, more particularly, to a kind of suitable iterative calculation MapReduce optimization methods.

Background technology

21 century is entered into, the treatment scale of data is increasing, and the scale of TB ranks is increasingly common, or even occurs in that The scale of PB ranks.Disposal ability of the data scale of this rank far beyond PC.Exactly this disposal ability Demand promote the development of parallel or distributed computing platform.In this case, the MapReduce model of Google meet the tendency of and Raw, it is Data-intensive computing model under a kind of popular big cluster environment.

MapReduce is a kind of programming model, for large-scale dataset（More than 1TB）Concurrent operation.Concept " Map （Mapping）" and " Reduce（Abbreviation）", and their main thought is all borrowed from Functional Programming, also from The characteristic borrowed in vector programming language.He be very easy to programming personnel will not distributed parallel program in the case of, The program of oneself is operated in distributed system.In this model, the type of organization of all data is a kind of<Key, value>It is right.During programming, programmer needs that what is done simply to realize Map and Reduce functions.The process input of Map functions<Key, value>Pair and export zero or several key-value pairs, Reduce functions read Map in the middle of output, finally give zero or Several results.MapReduce model structure is followed and do not exist between relatively independent principle, i.e. Map or Reduce data dependence Relation.

The mentality of designing of MapReduce model allows it to be good at carrying out the calculating of batch mode, for example log analysis and text Present treatment etc..But except the application of these batch processing modes, also there is the application based on machine learning or pattern recognition, allusion quotation Type has computer vision and data mining application etc..In such applications, core algorithm is designed based on iterative manner.So And current Hadoop（MapReduce model is increased income realization）Transparent can not efficiently support to iterate to calculate, or even Hadoop Some characteristics are not suitable for iterative calculation.With the development of social networkies, computer vision, data mining etc., the number of this kind of application It is increasing according to treatment scale.Can effectively support that the demand of the parallel computational model of this kind of application is increasing.

The content of the invention

For the disadvantages described above or Improvement requirement of prior art, the invention provides a kind of suitable iterative calculation MapReduce optimization methods, it is intended that being improved on the basis of Hadoop, can either support that traditional data is intensive Type application, transparent can efficiently support iterative calculation again, and be ground in terms of dynamic data and static data two respectively Study carefully and realize the reduction of volume of transmitted data.

For achieving the above object, according to one aspect of the present invention, there is provided a kind of MapReduce of suitable iterative calculation Optimization method, is applied in a kind of Hadoop group systems, and the group system includes a host node and multiple from node, The method comprising the steps of：

（1）Multiple Hadoop operations that host node receive user is submitted to, the job service process of host node is put into operation In job queue, and the job scheduler of host node is waited to carry out job scheduling；

（2）Host node waits the task requests sent from node, and after task requests are received, the operation of host node is adjusted Degree device priority scheduling localization tasks, if sending no localization tasks, the basis from node of task requests The task type of Hadoop operations is predicted scheduling, directly the Hadoop operations is scheduled for calculation type task, pin Certain intervals are then postponed to mode transmission task, just the Hadoop operations are carried out when total delay time interval reaches delay threshold value Scheduling；

（3）From node after the task of Hadoop operations of host node scheduling is received, judge that Hadoop homework types enter Row different disposal, homework type is divided into two kinds of iterative type and non-iterative type, for the operation of non-iterative type according to Hadoop routine Processing mode is processed, and for iterative type operation, was increased Map ends before the Map stages and is shuffled process, for for Map tasks Dynamic data is read, in the Reduce stages, the dynamic data caching group that dynamic data is carried out local cache and transfers to from node Part is managed, and final result is stored in HDFS after operation is disposed.

Preferably, step（2）Specifically include following sub-step：

（2-1）The heart that task service process of the job service process monitoring and wait on host node from node sends Hop-information, the heartbeat message includes the current operation information from node, specifically includes total slot number and the slot number being currently running；

（2-2）Host node is calculated currently from section after the heartbeat message sent from node is received according to the heartbeat message Point idle slot number and whole Hadoop group systems average operation slot number, according to calculate result judge whether need to work as The front task from the node distribution operation, if task, return to step need not be distributed（2-1）, otherwise execution step（2- 3）；

（2-3）Enumerator i=0 is set；

（2-4）Judge that whether i-th Hadoop operation from node have localization tasks currently, i.e., currently from node whether Be stored with the input data burst of i-th Hadoop operation（Split）, step is proceeded to if not（2-5）, step is proceeded to if having Suddenly（2-11）；

（2-5）I=i+1 is set, and judges whether i is equal to the number of Hadoop operations, step is entered if being equal to（2-7）, Otherwise return to step（2-4）；

（2-6）Enumerator j=0 is set；

（2-7）The task type for judging j-th Hadoop operation is calculation type task or mode transmission task, if calculating Type task, then into step（2-11）If, mode transmission task, then into step（2-8）；

（2-8）The task scheduling of j-th Hadoop operation is postponed into a heart time；

（2-9）Whether the total delay time for judging j-th Hadoop job tasks scheduling reaches a threshold value, if reaching, Then proceed to step（2-11）, otherwise proceed to step（2-10）；

（2-10）J=j+1 is set, and judges whether j is equal to the number of Hadoop operations, step is entered if being equal to（2- 12）, otherwise return to step（2-1）；

（2-11）The localization tasks of i-th Hadoop operation are dispatched to current from node, then process terminates；

（2-12）To the task scheduling of j-th Hadoop operation to current from node, then process terminates.

Preferably, step（2-2）Specifically, currently deduct equal to total slot number from the idle slot number of node being currently running Slot number, the average operation slot number of whole Hadoop group systems be track process monitoring to all be currently running from node Slot number and divided by all slot numbers from node, if the idle slot number of present node is equal to 0, task need not be distributed, together The slot number that Shi Ruguo is currently currently running from node is more than the average operation slot number of whole Hadoop group systems, then need not divide With task.

Preferably, step（3）Specifically include following sub-step：

（3-1）Receive the task of the Hadoop operations of host node scheduling；

（3-2）The homework type of judgement task is iterative type operation or non-iterative type operation, if iterative type operation Then proceed to step（3-3）, if the operation of non-iterative type then proceeds to step（3-4）；

（3-3）The task type for judging the iterative type operation is Map tasks or Reduce tasks, if Map tasks Then proceed to step（3-5）, if Reduce tasks then proceed to step（3-9）；

（3-4）The task type for judging the non-iterative type operation is Map tasks or Reduce tasks, if Map appoints Business then proceeds to step（3-8）, if Reduce tasks then proceed to step（3-9）；

（3-5）Judge that whether the iterative type operation is operation for the first time, if not then proceeding to step（3-6）, if Then proceed to step（3-7）；

（3-6）Map task process（Mapper）The multiple data copy threads of task service process initiation being located from node, Ask Reduce task process to be located from node by HTTP modes and obtain the calculated dynamic data of Reduce task process File, then proceeds to step（3-8）；

（3-7）Map task process reads dynamic data initialization value, then proceeds to step（3-8）；

（3-8）The input file of operation is resolved into Hadoop group systems burst one by one, and Map task process is to dividing Piece is processed, and then proceeds to step（3-14）；

（3-9）Reducer task process log-on data copy thread from node, asks Map to appoint by HTTP modes Business process is located from node and obtains the intermediate output file of Map task process, and intermediate output file is stored in from the local of node In disk, copying the file that comes can be firstly placed in core buffer, and multiple copied files can be final big according to being merged into File, this big file sorts according to key, then proceeds to step（3-10）；

（3-10）Reduce task process from the big file for obtaining with<key,iterator>Form reads record, and holds Row Reduce () method, then proceeds to step（3-11）；

（3-11）Judge that homework type is iterative type or non-iterative type, if iterative type operation, then proceed to step（3- 12）, if non-iterative type operation, then proceed to step（3-13）；

（3-12）From the dynamic data caching component of node by the result cache after the execution of Reduce task process in internal memory It is central, spill in local disk file when relief area is full, then proceed to step（3-14）；

（3-13）Reduce task process is written to the result after execution in HDFS, then proceeds to step（3-14）；

（3-14）Tasks carrying terminates, and is then back to step（3-1）.

Preferably, step（3-6）In dynamic data file by from node dynamic data caching component manage, be stored in In internal memory and local disk；The dynamic data that copy comes also is managed by dynamic data caching component, same from many of node Individual Map task process is from from the local dynamic data request file of node, the dynamic that these data will need as Map task process Data input.

Preferably, step（3-8）In, the size of burst is defaulted as HDFS block sizes, and block size is configured by configuration file, Burst is resolved into Map task process needs by Map task process<Key, value>The record of form, performs Map () method, By the result cache for performing in the middle of internal memory, can spill in the middle of disk when relief area is full, the file of spilling can record partitioning Information, single spill file be first according to subregion sequence, then according to key sequence；If multiple spill files need to close And into a big file, this process carries out merger sequence to multiple spill files.

In general, by the contemplated above technical scheme of the present invention compared with prior art, can obtain down and show Beneficial effect：

（1）The data locality of task is more preferable：As a result of step（2）, the scheduling strategy of proposition is compared in the present invention In delay dispatching strategy, task demand for localization and task is better balanced and has postponed expense, task that will be in Hadoop （task）Conception division is computation-intensive and transmits intensive two class, and the load information with reference to cluster network is predicted in real time Time delay.Raising task localization ratio can be so reached, the bulk delay expense of operation can be effectively reduced again.Therefore The present invention has obvious advantage.

（2）Dynamic data transmission expense is less：As a result of step（3）, dynamic data caching plan proposed by the present invention Slightly significantly reduce the cluster network transport overhead that read-write dynamic data brings.Theoretical proof and experimental verification, iterative type is made The dynamic data transmission total amount of industry is only directly proportional to task place interstitial content, and has the definite upper limit, i.e. clustered node number.Cause This present invention has obvious advantage.

（3）Cluster efficiency under many operations and multi-user's use environment is higher：Under many operations and multi-user environment, cluster Internet resources become the bottleneck of trunking efficiency, it will greatly limit the effective utilization of cluster.The present invention is by optimization The network data flow of Hadoop, reduce cluster network transport overhead, effectively alleviate cluster network load, reduce user between and Internet resources competition between operation, improves the cluster effective utilization under many operations and multi-user.Therefore the present invention has bright Aobvious advantage.

（4）Support iterative calculation in high-efficient transparent ground.Compared to traditional Hadoop, the present invention can both support traditional criticizing Operation is processed, iterative type operation can be preferably supported again, so the use field of the present invention is more extensive, such as social networkies, Computer vision, data mining etc..Therefore the present invention has obvious advantage.

Description of the drawings

Fig. 1 is the flow chart of the MapReduce optimization methods that the present invention is adapted to iterative calculation.

Fig. 2 is step of the present invention（2）Refined flow chart.

Fig. 3 is step of the present invention（3）Refined flow chart.

Specific embodiment

In order that the objects, technical solutions and advantages of the present invention become more apparent, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and It is not used in the restriction present invention.As long as additionally, technical characteristic involved in invention described below each embodiment Not constituting conflict each other just can be mutually combined.

Hereinafter the technical term of the present invention is explained and illustrated first：

Dynamic data：In iterative calculation problem, one directly or indirectly constantly by old value recursion goes out the variable of new value.

Static data：In iterative calculation problem, a data of any change are not had, generally algorithm is originally inputted number According to.

Calculation type task：The calculating time of task accounts for the task of major part in whole processing procedures of Map tasks.

Mode transmission task：The data transmission period of task accounts for appointing for major part in whole processing procedures of Map tasks Business.

Localization tasks：From the locally stored Map tasks for having an input data burst of node.

Delay dispatching strategy：A kind of strategy for postponing non-localized task scheduling.

The Integral Thought of the present invention is to be conceived to multi-user and many operation cluster environment, by optimizing static data flow Reduce the network transport load of cluster with shared data stream.For the optimization of static data flow, main contributions of the present invention Individual forecast dispatching algorithm；For the optimization of shared data stream, the present invention is by data caching method and increases Map ends Shuffle processes are achieving the goal.

The present invention is adapted to the MapReduce optimization methods of iterative calculation to be applied in a kind of Hadoop group systems, should Group system includes a master（Master）Node and it is multiple from（Slave）Node, the method is comprised the following steps（Such as Fig. 1 It is shown）：

（1）Multiple Hadoop operations that host node receive user is submitted to, the job service process of host node （JobTracker）Operation is put in job queue, and waits the job scheduler of host node to carry out job scheduling；

（2）Host node waits the task requests sent from node, and after task requests are received, the operation of host node is adjusted Degree device priority scheduling localization tasks, if sending no localization tasks, the basis from node of task requests The task type of Hadoop operations is predicted scheduling, directly the Hadoop operations is scheduled for calculation type task, pin Certain intervals are then postponed to mode transmission task, just the Hadoop operations are carried out when total delay time interval reaches delay threshold value Scheduling.Specifically include following sub-step（As shown in Figure 2）：

（2-1）Task service process of the job service process monitoring and wait on host node from node （TaskTracker）The heartbeat message for sending, the heartbeat message includes the current operation information from node, specifically includes total groove （slot）Slot number for counting and being currently running etc.；

（2-2）Host node is calculated currently from section after the heartbeat message sent from node is received according to the heartbeat message Point idle slot number and whole Hadoop group systems average operation slot number, according to calculate result judge whether need to work as The front task from the node distribution operation, if task, return to step need not be distributed（2-1）, otherwise execution step（2- 3）；Specifically, currently it is equal to total slot number from the idle slot number of node and deducts the slot number that is currently running, whole Hadoop collection The average operation slot number of group's system be track process monitoring to it is all from be currently running slot number of node and divided by it is all from The slot number of node；If the idle slot number of present node is equal to 0, task need not be distributed, if while from node just currently It is more than the average operation slot number of whole Hadoop group systems in the slot number of operation, then need not distributes task；

（2-3）Enumerator i=0 is set；

（2-6）Enumerator j=0 is set；

（2-8）The task scheduling of j-th Hadoop operation is postponed into a heart time；Specifically, heart time is From node send heartbeat message time interval, specially 3 seconds；

（2-9）Whether the total delay time for judging j-th Hadoop job tasks scheduling reaches a threshold value, if reaching, Then proceed to step（2-11）, otherwise proceed to step（2-10）；The value of threshold value can be configured by cluster administrator, the foundation of configuration It is：When threshold value is bigger, the localization ratio of task can be bigger, but the expense for postponing also can be bigger；Threshold value is less, localization ratio phase To less, but the expense for postponing also can be less, and threshold value is defaulted as 3 minutes；

The advantage of this step is：Task is classified, calculation type task is scheduled using the mode of acquiescence, it is right Mode transmission task is predicted scheduling.The localization ratio of task so both can be improved, delay can be reduced again and is brought Expense.

（3）From node after the task of Hadoop operations of host node scheduling is received, judge that Hadoop homework types enter Row different disposal, homework type is divided into two kinds of iterative type and non-iterative type, for the operation of non-iterative type according to Hadoop routine Processing mode is processed, and for iterative type operation, a Map end was increased before the Map stages and is shuffled（shuffle）Cross Journey, for for the task in Map stages（That is Map tasks）Dynamic data is read, in the Reduce stages, dynamic data is carried out locally Cache and transfer to from the dynamic data caching component of node to manage, and preserve final result after operation is disposed In Hadoop distributed file systems（Hadoop Distributed File System, abbreviation HDFS）In；This step is concrete Including following sub-step（As shown in Figure 3）：

（3-1）Receive the task of the Hadoop operations of host node scheduling；

（3-6）Map task process（Mapper）The multiple data copy threads of task service process initiation being located from node, Reduce task process is asked by HTTP modes（Reducer）Being located, it is calculated to obtain Reduce task process from node Dynamic data file, then proceeds to step（3-8）；These dynamic data files are by the dynamic data caching component pipe from node Reason, in being stored in internal memory and local disk；The dynamic data that comes of copy is also managed by dynamic data caching component, it is same from From from the local dynamic data request file of node, these data will be needed multiple Map task process of node as Map task process The dynamic data input wanted；

The advantage of this sub-step is：Dynamic data after 1, Reduce stage is stored in this, reduces and is written to The expense that HDFS brings；2, the dynamic data after the Reduce stages is asked from node by what Map task process was located, and be stored in Map task process from the local of node, Map task process This greatly reduces dynamic from from the local request data of node The transmission volume of state data.

（3-7）Map task process reads dynamic data initialization value, then proceeds to step（3-8）；Briefly, iteration Type operation needs and produces some dynamic datas, and this data is to need to provide the dynamic by user in first time Job execution The initialization value of data；

（3-8）The input file of operation is resolved into Hadoop group systems burst one by one, and Map task process is to dividing Piece is processed, and then proceeds to step（3-14）；Specifically, the size of burst is defaulted as HDFS block sizes, and block size passes through Configuration file is configured, and burst is resolved into Map task process needs by Map task process<Key, value>The record of form, holds Row Map () method, by the result cache for performing in the middle of internal memory, can spill in the middle of disk when relief area is full；The text of spilling The information of part meeting record partitioning, single spill file is first according to subregion sequence, then according to key sequences；Overflow if multiple Going out file needs to be merged into a big file, and this process carries out merger sequence to multiple spill files；

（3-14）Tasks carrying terminates, and is then back to step（3-1）.

Example：

In order to verify the feasibility and effectiveness of the present invention, perform under the experimental configuration environment shown in following table 1 and write Computer program, to invention test, test result is as shown in following table 2 and table 3：

Table 1：Experimental configuration environment

In table 2 and table 3, the comparison other of the present invention is Hadoop-0.20.0 and Haloop, and experiment algorithm is fuzzy C-Means.What table 2 was represented is the network transmission of dynamic data of three MapReduce implementations under different experiments scale Amount compares.What table 3 was represented is 3 MapReduce implementations under certain size of experiment during execution during different iterationses Between compare.Experimental result shows that the present invention has more satisfactory improvement in network data transmission and time performance.

Table 2：Dynamic data transmission amount compares in fuzzy C-Means

Table 3：The fuzzy C-Means execution times compare

As it will be easily appreciated by one skilled in the art that the foregoing is only presently preferred embodiments of the present invention, not to The present invention, all any modification, equivalent and improvement made within the spirit and principles in the present invention etc. are limited, all should be included Within protection scope of the present invention.

Claims

1. a kind of MapReduce optimization methods of suitable iterative calculation, are applied in a kind of Hadoop group systems, the cluster System includes a host node and multiple from node, it is characterised in that the method comprising the steps of：

(1) operation is put into operation by multiple Hadoop operations that host node receive user is submitted to, the job service process of host node In queue, and the job scheduler of host node is waited to carry out job scheduling；

(2) host node waits the task requests sent from node, and after task requests are received, the job scheduler of host node Priority scheduling localization tasks, if sending no localization tasks from node of task requests, make according to Hadoop The task type of industry is predicted scheduling, directly the Hadoop operations is scheduled for calculation type task, for mode transmission Task then postpones certain intervals, and just the Hadoop operations are scheduled when total delay time interval reaches delay threshold value, its In, calculation type task refers to that the calculating time of task accounts for the task of major part in whole processing procedures of Map tasks, transmission Type task refers to that the data transmission period of task accounts for the task of major part in whole processing procedures of Map tasks；

Wherein, step (2) specifically includes following sub-step：

(2-1) the heart beating letter that task service process of the job service process monitoring and wait on host node from node sends Breath, the heartbeat message includes the current operation information from node, specifically includes total slot number and the slot number being currently running；

(2-2) host node is calculated currently from node after the heartbeat message sent from node is received according to the heartbeat message The average operation slot number of idle slot number and whole Hadoop group systems, according to calculate result judge whether need to currently from The task of the node distribution operation, if task need not be distributed, return to step (2-1), otherwise execution step (2-3)；

(2-3) enumerator i=0 is set；

(2-4) judge that whether i-th Hadoop operation from node there are localization tasks currently, i.e., currently whether store from node There is the input data burst (Split) of i-th Hadoop operation, step (2-5) is proceeded to if not, step is proceeded to if having (2-11)；

(2-5) i=i+1 is set, and judges whether i is equal to the number of Hadoop operations, it is no if step (2-7) is entered equal to if Then return to step (2-4)；

(2-6) enumerator j=0 is set；

(2-7) task type for judging j-th Hadoop operation is calculation type task or mode transmission task, if calculation type is appointed Business, then into step (2-11), if mode transmission task, then into step (2-8)；

(2-8) task scheduling of j-th Hadoop operation is postponed into a heart time；

(2-9) whether the total delay time for judging j-th Hadoop job tasks scheduling reaches a threshold value, if reaching, turns Enter step (2-11), otherwise proceed to step (2-10)；

(2-10) j=j+1 is set, and judges whether j is equal to the number of Hadoop operations, step (2-12) is entered if being equal to, Otherwise return to step (2-1)；

(2-11) localization tasks of i-th Hadoop operation are dispatched to current from node, then process terminates；

(2-12) to the task scheduling of j-th Hadoop operation to current from node, then process terminates；

(3) judge that Hadoop homework types are carried out not after the task of Hadoop operations of host node scheduling is received from node With processing, homework type is divided into two kinds of iterative type and non-iterative type, for the operation of non-iterative type according to Hadoop conventional treatment Mode is processed, and for iterative type operation, was increased Map ends before the Map stages and is shuffled process, for reading for Map tasks Dynamic data, in the Reduce stages, the dynamic data caching component pipe that dynamic data is carried out local cache and transfers to from node Reason, and final result is stored in HDFS after operation is disposed.

2. MapReduce optimization methods according to claim 1, it is characterised in that step (2-2) specifically, it is current from The idle slot number of node deducts the slot number being currently running, the average run channel of whole Hadoop group systems equal to total slot number Number for tracking process monitoring to it is all from be currently running slot number of node and divided by all slot numbers from node, if currently The idle slot number of node is equal to 0, then need not distribute task, if while the current slot number being currently running from node is more than whole The average operation slot number of Hadoop group systems, then need not distribute task.

3. MapReduce optimization methods according to claim 1, it is characterised in that step (3) specifically includes following sub-step Suddenly：

(3-1) task of the Hadoop operations of host node scheduling is received；

(3-2) homework type for judging task is iterative type operation or non-iterative type operation, if iterative type operation then turns Enter step (3-3), if the operation of non-iterative type then proceeds to step (3-4)；

(3-3) task type for judging the iterative type operation is Map tasks or Reduce tasks, if Map tasks then turn Enter step (3-5), if Reduce tasks then proceed to step (3-9)；

(3-4) task type for judging the non-iterative type operation is Map tasks or Reduce tasks, if Map tasks are then Step (3-8) is proceeded to, if Reduce tasks then proceed to step (3-9)；

(3-5) judge whether the iterative type operation is operation for the first time, if not step (3-6) is then proceeded to, if it is turn Enter step (3-7)；

(3-6) the multiple data copy threads of task service process initiation from node that Map task process (Mapper) is located, pass through HTTP modes are asked Reduce task process to be located from node and obtain the calculated dynamic data file of Reduce task process, Then step (3-8) is proceeded to；

(3-7) Map task process reads dynamic data initialization value, then proceeds to step (3-8)；

(3-8) input file of operation is resolved into burst one by one by Hadoop group systems, and Map task process enters to burst Row is processed, and then proceeds to step (3-14)；

(3-9) the Reducer task process log-on data copy thread from node, asks Map tasks to be entered by HTTP modes Journey is located from node and obtains the intermediate output file of Map task process, and intermediate output file is stored in the local disk from node In, copying the file that comes can be firstly placed in core buffer, multiple copied files can according to final big file can be merged into, This big file sorts according to key, then proceeds to step (3-10)；

(3-10) Reduce task process from the big file for obtaining with<key,iterator>Form reads record, and performs Reduce () method, then proceeds to step (3-11)；

(3-11) judge that homework type is iterative type or non-iterative type, if iterative type operation, then proceed to step (3-12), If non-iterative type operation, then step (3-13) is proceeded to；

(3-12) from the dynamic data caching component of node by the result cache after the execution of Reduce task process in the middle of internal memory, Spill in local disk file when relief area is full, then proceed to step (3-14)；

(3-13) Reduce task process is written to the result after execution in HDFS, then proceeds to step (3-14)；

(3-14) tasks carrying terminates, and is then back to step (3-1).

4. MapReduce optimization methods according to claim 3, it is characterised in that the dynamic data text in step (3-6) Part is managed by the dynamic data caching component from node, in being stored in internal memory and local disk；The dynamic data that copy comes Managed by dynamic data caching component, same multiple Map task process from node are from from the local dynamic data request of node File, these data are input into the dynamic data needed as Map task process.

5. MapReduce optimization methods according to claim 3, it is characterised in that in step (3-8), the size of burst HDFS block sizes are defaulted as, block size is configured by configuration file, burst is resolved into Map task process and needed by Map task process Want<Key, value>The record of form, performs Map () method, by the result cache for performing in the middle of internal memory, works as relief area Man Shihui is spilt in the middle of disk, and the information of the file meeting record partitioning of spilling, single spill file is first according to subregion sequence, Then according to key sequences；If multiple spill files need to be merged into a big file, this process is to multiple spill files Carry out merger sequence.