CN103617087B - MapReduce optimizing method suitable for iterative computations - Google Patents
MapReduce optimizing method suitable for iterative computations Download PDFInfo
- Publication number
- CN103617087B CN103617087B CN201310600745.7A CN201310600745A CN103617087B CN 103617087 B CN103617087 B CN 103617087B CN 201310600745 A CN201310600745 A CN 201310600745A CN 103617087 B CN103617087 B CN 103617087B
- Authority
- CN
- China
- Prior art keywords
- task
- node
- hadoop
- map
- tasks
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a MapReduce optimizing method suitable for iterative computations. The MapReduce optimizing method is applied to a Hadoop trunking system. The trunking system comprises a major node and a plurality of secondary nodes. The MapReduce optimizing method comprises the following steps that a plurality of Hadoop jobs submitted by a user are received by the major node; the jobs are placed in a job queue by a job service process of the major node and wait for being scheduled by a job scheduler of the major node; the major node waits for a task request transmitted from the secondary nodes; after the major node receives the task request, localized tasks are scheduled preferentially by the job scheduler of the major node; and if the secondary nodes which transmit the task request do not have localized tasks, prediction scheduling is performed according to task types of the Hadoop jobs. The MapReduce optimizing method can support the traditional data-intensive application, and can also support iterative computations transparently and efficiently; dynamic data and static data can be respectively researched; and data transmission quantity can be reduced.
Description
Technical field
The invention belongs to parallel computation and mass data processing field, more particularly, to a kind of suitable iterative calculation
MapReduce optimization methods.
Background technology
21 century is entered into, the treatment scale of data is increasing, and the scale of TB ranks is increasingly common, or even occurs in that
The scale of PB ranks.Disposal ability of the data scale of this rank far beyond PC.Exactly this disposal ability
Demand promote the development of parallel or distributed computing platform.In this case, the MapReduce model of Google meet the tendency of and
Raw, it is Data-intensive computing model under a kind of popular big cluster environment.
MapReduce is a kind of programming model, for large-scale dataset(More than 1TB)Concurrent operation.Concept " Map
(Mapping)" and " Reduce(Abbreviation)", and their main thought is all borrowed from Functional Programming, also from
The characteristic borrowed in vector programming language.He be very easy to programming personnel will not distributed parallel program in the case of,
The program of oneself is operated in distributed system.In this model, the type of organization of all data is a kind of<Key,
value>It is right.During programming, programmer needs that what is done simply to realize Map and Reduce functions.The process input of Map functions<Key,
value>Pair and export zero or several key-value pairs, Reduce functions read Map in the middle of output, finally give zero or
Several results.MapReduce model structure is followed and do not exist between relatively independent principle, i.e. Map or Reduce data dependence
Relation.
The mentality of designing of MapReduce model allows it to be good at carrying out the calculating of batch mode, for example log analysis and text
Present treatment etc..But except the application of these batch processing modes, also there is the application based on machine learning or pattern recognition, allusion quotation
Type has computer vision and data mining application etc..In such applications, core algorithm is designed based on iterative manner.So
And current Hadoop(MapReduce model is increased income realization)Transparent can not efficiently support to iterate to calculate, or even Hadoop
Some characteristics are not suitable for iterative calculation.With the development of social networkies, computer vision, data mining etc., the number of this kind of application
It is increasing according to treatment scale.Can effectively support that the demand of the parallel computational model of this kind of application is increasing.
The content of the invention
For the disadvantages described above or Improvement requirement of prior art, the invention provides a kind of suitable iterative calculation
MapReduce optimization methods, it is intended that being improved on the basis of Hadoop, can either support that traditional data is intensive
Type application, transparent can efficiently support iterative calculation again, and be ground in terms of dynamic data and static data two respectively
Study carefully and realize the reduction of volume of transmitted data.
For achieving the above object, according to one aspect of the present invention, there is provided a kind of MapReduce of suitable iterative calculation
Optimization method, is applied in a kind of Hadoop group systems, and the group system includes a host node and multiple from node,
The method comprising the steps of:
(1)Multiple Hadoop operations that host node receive user is submitted to, the job service process of host node is put into operation
In job queue, and the job scheduler of host node is waited to carry out job scheduling;
(2)Host node waits the task requests sent from node, and after task requests are received, the operation of host node is adjusted
Degree device priority scheduling localization tasks, if sending no localization tasks, the basis from node of task requests
The task type of Hadoop operations is predicted scheduling, directly the Hadoop operations is scheduled for calculation type task, pin
Certain intervals are then postponed to mode transmission task, just the Hadoop operations are carried out when total delay time interval reaches delay threshold value
Scheduling;
(3)From node after the task of Hadoop operations of host node scheduling is received, judge that Hadoop homework types enter
Row different disposal, homework type is divided into two kinds of iterative type and non-iterative type, for the operation of non-iterative type according to Hadoop routine
Processing mode is processed, and for iterative type operation, was increased Map ends before the Map stages and is shuffled process, for for Map tasks
Dynamic data is read, in the Reduce stages, the dynamic data caching group that dynamic data is carried out local cache and transfers to from node
Part is managed, and final result is stored in HDFS after operation is disposed.
Preferably, step(2)Specifically include following sub-step:
(2-1)The heart that task service process of the job service process monitoring and wait on host node from node sends
Hop-information, the heartbeat message includes the current operation information from node, specifically includes total slot number and the slot number being currently running;
(2-2)Host node is calculated currently from section after the heartbeat message sent from node is received according to the heartbeat message
Point idle slot number and whole Hadoop group systems average operation slot number, according to calculate result judge whether need to work as
The front task from the node distribution operation, if task, return to step need not be distributed(2-1), otherwise execution step(2-
3);
(2-3)Enumerator i=0 is set;
(2-4)Judge that whether i-th Hadoop operation from node have localization tasks currently, i.e., currently from node whether
Be stored with the input data burst of i-th Hadoop operation(Split), step is proceeded to if not(2-5), step is proceeded to if having
Suddenly(2-11);
(2-5)I=i+1 is set, and judges whether i is equal to the number of Hadoop operations, step is entered if being equal to(2-7),
Otherwise return to step(2-4);
(2-6)Enumerator j=0 is set;
(2-7)The task type for judging j-th Hadoop operation is calculation type task or mode transmission task, if calculating
Type task, then into step(2-11)If, mode transmission task, then into step(2-8);
(2-8)The task scheduling of j-th Hadoop operation is postponed into a heart time;
(2-9)Whether the total delay time for judging j-th Hadoop job tasks scheduling reaches a threshold value, if reaching,
Then proceed to step(2-11), otherwise proceed to step(2-10);
(2-10)J=j+1 is set, and judges whether j is equal to the number of Hadoop operations, step is entered if being equal to(2-
12), otherwise return to step(2-1);
(2-11)The localization tasks of i-th Hadoop operation are dispatched to current from node, then process terminates;
(2-12)To the task scheduling of j-th Hadoop operation to current from node, then process terminates.
Preferably, step(2-2)Specifically, currently deduct equal to total slot number from the idle slot number of node being currently running
Slot number, the average operation slot number of whole Hadoop group systems be track process monitoring to all be currently running from node
Slot number and divided by all slot numbers from node, if the idle slot number of present node is equal to 0, task need not be distributed, together
The slot number that Shi Ruguo is currently currently running from node is more than the average operation slot number of whole Hadoop group systems, then need not divide
With task.
Preferably, step(3)Specifically include following sub-step:
(3-1)Receive the task of the Hadoop operations of host node scheduling;
(3-2)The homework type of judgement task is iterative type operation or non-iterative type operation, if iterative type operation
Then proceed to step(3-3), if the operation of non-iterative type then proceeds to step(3-4);
(3-3)The task type for judging the iterative type operation is Map tasks or Reduce tasks, if Map tasks
Then proceed to step(3-5), if Reduce tasks then proceed to step(3-9);
(3-4)The task type for judging the non-iterative type operation is Map tasks or Reduce tasks, if Map appoints
Business then proceeds to step(3-8), if Reduce tasks then proceed to step(3-9);
(3-5)Judge that whether the iterative type operation is operation for the first time, if not then proceeding to step(3-6), if
Then proceed to step(3-7);
(3-6)Map task process(Mapper)The multiple data copy threads of task service process initiation being located from node,
Ask Reduce task process to be located from node by HTTP modes and obtain the calculated dynamic data of Reduce task process
File, then proceeds to step(3-8);
(3-7)Map task process reads dynamic data initialization value, then proceeds to step(3-8);
(3-8)The input file of operation is resolved into Hadoop group systems burst one by one, and Map task process is to dividing
Piece is processed, and then proceeds to step(3-14);
(3-9)Reducer task process log-on data copy thread from node, asks Map to appoint by HTTP modes
Business process is located from node and obtains the intermediate output file of Map task process, and intermediate output file is stored in from the local of node
In disk, copying the file that comes can be firstly placed in core buffer, and multiple copied files can be final big according to being merged into
File, this big file sorts according to key, then proceeds to step(3-10);
(3-10)Reduce task process from the big file for obtaining with<key,iterator>Form reads record, and holds
Row Reduce () method, then proceeds to step(3-11);
(3-11)Judge that homework type is iterative type or non-iterative type, if iterative type operation, then proceed to step(3-
12), if non-iterative type operation, then proceed to step(3-13);
(3-12)From the dynamic data caching component of node by the result cache after the execution of Reduce task process in internal memory
It is central, spill in local disk file when relief area is full, then proceed to step(3-14);
(3-13)Reduce task process is written to the result after execution in HDFS, then proceeds to step(3-14);
(3-14)Tasks carrying terminates, and is then back to step(3-1).
Preferably, step(3-6)In dynamic data file by from node dynamic data caching component manage, be stored in
In internal memory and local disk;The dynamic data that copy comes also is managed by dynamic data caching component, same from many of node
Individual Map task process is from from the local dynamic data request file of node, the dynamic that these data will need as Map task process
Data input.
Preferably, step(3-8)In, the size of burst is defaulted as HDFS block sizes, and block size is configured by configuration file,
Burst is resolved into Map task process needs by Map task process<Key, value>The record of form, performs Map () method,
By the result cache for performing in the middle of internal memory, can spill in the middle of disk when relief area is full, the file of spilling can record partitioning
Information, single spill file be first according to subregion sequence, then according to key sequence;If multiple spill files need to close
And into a big file, this process carries out merger sequence to multiple spill files.
In general, by the contemplated above technical scheme of the present invention compared with prior art, can obtain down and show
Beneficial effect:
(1)The data locality of task is more preferable:As a result of step(2), the scheduling strategy of proposition is compared in the present invention
In delay dispatching strategy, task demand for localization and task is better balanced and has postponed expense, task that will be in Hadoop
(task)Conception division is computation-intensive and transmits intensive two class, and the load information with reference to cluster network is predicted in real time
Time delay.Raising task localization ratio can be so reached, the bulk delay expense of operation can be effectively reduced again.Therefore
The present invention has obvious advantage.
(2)Dynamic data transmission expense is less:As a result of step(3), dynamic data caching plan proposed by the present invention
Slightly significantly reduce the cluster network transport overhead that read-write dynamic data brings.Theoretical proof and experimental verification, iterative type is made
The dynamic data transmission total amount of industry is only directly proportional to task place interstitial content, and has the definite upper limit, i.e. clustered node number.Cause
This present invention has obvious advantage.
(3)Cluster efficiency under many operations and multi-user's use environment is higher:Under many operations and multi-user environment, cluster
Internet resources become the bottleneck of trunking efficiency, it will greatly limit the effective utilization of cluster.The present invention is by optimization
The network data flow of Hadoop, reduce cluster network transport overhead, effectively alleviate cluster network load, reduce user between and
Internet resources competition between operation, improves the cluster effective utilization under many operations and multi-user.Therefore the present invention has bright
Aobvious advantage.
(4)Support iterative calculation in high-efficient transparent ground.Compared to traditional Hadoop, the present invention can both support traditional criticizing
Operation is processed, iterative type operation can be preferably supported again, so the use field of the present invention is more extensive, such as social networkies,
Computer vision, data mining etc..Therefore the present invention has obvious advantage.
Description of the drawings
Fig. 1 is the flow chart of the MapReduce optimization methods that the present invention is adapted to iterative calculation.
Fig. 2 is step of the present invention(2)Refined flow chart.
Fig. 3 is step of the present invention(3)Refined flow chart.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and
It is not used in the restriction present invention.As long as additionally, technical characteristic involved in invention described below each embodiment
Not constituting conflict each other just can be mutually combined.
Hereinafter the technical term of the present invention is explained and illustrated first:
Dynamic data:In iterative calculation problem, one directly or indirectly constantly by old value recursion goes out the variable of new value.
Static data:In iterative calculation problem, a data of any change are not had, generally algorithm is originally inputted number
According to.
Calculation type task:The calculating time of task accounts for the task of major part in whole processing procedures of Map tasks.
Mode transmission task:The data transmission period of task accounts for appointing for major part in whole processing procedures of Map tasks
Business.
Localization tasks:From the locally stored Map tasks for having an input data burst of node.
Delay dispatching strategy:A kind of strategy for postponing non-localized task scheduling.
The Integral Thought of the present invention is to be conceived to multi-user and many operation cluster environment, by optimizing static data flow
Reduce the network transport load of cluster with shared data stream.For the optimization of static data flow, main contributions of the present invention
Individual forecast dispatching algorithm;For the optimization of shared data stream, the present invention is by data caching method and increases Map ends
Shuffle processes are achieving the goal.
The present invention is adapted to the MapReduce optimization methods of iterative calculation to be applied in a kind of Hadoop group systems, should
Group system includes a master(Master)Node and it is multiple from(Slave)Node, the method is comprised the following steps(Such as Fig. 1
It is shown):
(1)Multiple Hadoop operations that host node receive user is submitted to, the job service process of host node
(JobTracker)Operation is put in job queue, and waits the job scheduler of host node to carry out job scheduling;
(2)Host node waits the task requests sent from node, and after task requests are received, the operation of host node is adjusted
Degree device priority scheduling localization tasks, if sending no localization tasks, the basis from node of task requests
The task type of Hadoop operations is predicted scheduling, directly the Hadoop operations is scheduled for calculation type task, pin
Certain intervals are then postponed to mode transmission task, just the Hadoop operations are carried out when total delay time interval reaches delay threshold value
Scheduling.Specifically include following sub-step(As shown in Figure 2):
(2-1)Task service process of the job service process monitoring and wait on host node from node
(TaskTracker)The heartbeat message for sending, the heartbeat message includes the current operation information from node, specifically includes total groove
(slot)Slot number for counting and being currently running etc.;
(2-2)Host node is calculated currently from section after the heartbeat message sent from node is received according to the heartbeat message
Point idle slot number and whole Hadoop group systems average operation slot number, according to calculate result judge whether need to work as
The front task from the node distribution operation, if task, return to step need not be distributed(2-1), otherwise execution step(2-
3);Specifically, currently it is equal to total slot number from the idle slot number of node and deducts the slot number that is currently running, whole Hadoop collection
The average operation slot number of group's system be track process monitoring to it is all from be currently running slot number of node and divided by it is all from
The slot number of node;If the idle slot number of present node is equal to 0, task need not be distributed, if while from node just currently
It is more than the average operation slot number of whole Hadoop group systems in the slot number of operation, then need not distributes task;
(2-3)Enumerator i=0 is set;
(2-4)Judge that whether i-th Hadoop operation from node have localization tasks currently, i.e., currently from node whether
Be stored with the input data burst of i-th Hadoop operation(Split), step is proceeded to if not(2-5), step is proceeded to if having
Suddenly(2-11);
(2-5)I=i+1 is set, and judges whether i is equal to the number of Hadoop operations, step is entered if being equal to(2-7),
Otherwise return to step(2-4);
(2-6)Enumerator j=0 is set;
(2-7)The task type for judging j-th Hadoop operation is calculation type task or mode transmission task, if calculating
Type task, then into step(2-11)If, mode transmission task, then into step(2-8);
(2-8)The task scheduling of j-th Hadoop operation is postponed into a heart time;Specifically, heart time is
From node send heartbeat message time interval, specially 3 seconds;
(2-9)Whether the total delay time for judging j-th Hadoop job tasks scheduling reaches a threshold value, if reaching,
Then proceed to step(2-11), otherwise proceed to step(2-10);The value of threshold value can be configured by cluster administrator, the foundation of configuration
It is:When threshold value is bigger, the localization ratio of task can be bigger, but the expense for postponing also can be bigger;Threshold value is less, localization ratio phase
To less, but the expense for postponing also can be less, and threshold value is defaulted as 3 minutes;
(2-10)J=j+1 is set, and judges whether j is equal to the number of Hadoop operations, step is entered if being equal to(2-
12), otherwise return to step(2-1);
(2-11)The localization tasks of i-th Hadoop operation are dispatched to current from node, then process terminates;
(2-12)To the task scheduling of j-th Hadoop operation to current from node, then process terminates.
The advantage of this step is:Task is classified, calculation type task is scheduled using the mode of acquiescence, it is right
Mode transmission task is predicted scheduling.The localization ratio of task so both can be improved, delay can be reduced again and is brought
Expense.
(3)From node after the task of Hadoop operations of host node scheduling is received, judge that Hadoop homework types enter
Row different disposal, homework type is divided into two kinds of iterative type and non-iterative type, for the operation of non-iterative type according to Hadoop routine
Processing mode is processed, and for iterative type operation, a Map end was increased before the Map stages and is shuffled(shuffle)Cross
Journey, for for the task in Map stages(That is Map tasks)Dynamic data is read, in the Reduce stages, dynamic data is carried out locally
Cache and transfer to from the dynamic data caching component of node to manage, and preserve final result after operation is disposed
In Hadoop distributed file systems(Hadoop Distributed File System, abbreviation HDFS)In;This step is concrete
Including following sub-step(As shown in Figure 3):
(3-1)Receive the task of the Hadoop operations of host node scheduling;
(3-2)The homework type of judgement task is iterative type operation or non-iterative type operation, if iterative type operation
Then proceed to step(3-3), if the operation of non-iterative type then proceeds to step(3-4);
(3-3)The task type for judging the iterative type operation is Map tasks or Reduce tasks, if Map tasks
Then proceed to step(3-5), if Reduce tasks then proceed to step(3-9);
(3-4)The task type for judging the non-iterative type operation is Map tasks or Reduce tasks, if Map appoints
Business then proceeds to step(3-8), if Reduce tasks then proceed to step(3-9);
(3-5)Judge that whether the iterative type operation is operation for the first time, if not then proceeding to step(3-6), if
Then proceed to step(3-7);
(3-6)Map task process(Mapper)The multiple data copy threads of task service process initiation being located from node,
Reduce task process is asked by HTTP modes(Reducer)Being located, it is calculated to obtain Reduce task process from node
Dynamic data file, then proceeds to step(3-8);These dynamic data files are by the dynamic data caching component pipe from node
Reason, in being stored in internal memory and local disk;The dynamic data that comes of copy is also managed by dynamic data caching component, it is same from
From from the local dynamic data request file of node, these data will be needed multiple Map task process of node as Map task process
The dynamic data input wanted;
The advantage of this sub-step is:Dynamic data after 1, Reduce stage is stored in this, reduces and is written to
The expense that HDFS brings;2, the dynamic data after the Reduce stages is asked from node by what Map task process was located, and be stored in
Map task process from the local of node, Map task process This greatly reduces dynamic from from the local request data of node
The transmission volume of state data.
(3-7)Map task process reads dynamic data initialization value, then proceeds to step(3-8);Briefly, iteration
Type operation needs and produces some dynamic datas, and this data is to need to provide the dynamic by user in first time Job execution
The initialization value of data;
(3-8)The input file of operation is resolved into Hadoop group systems burst one by one, and Map task process is to dividing
Piece is processed, and then proceeds to step(3-14);Specifically, the size of burst is defaulted as HDFS block sizes, and block size passes through
Configuration file is configured, and burst is resolved into Map task process needs by Map task process<Key, value>The record of form, holds
Row Map () method, by the result cache for performing in the middle of internal memory, can spill in the middle of disk when relief area is full;The text of spilling
The information of part meeting record partitioning, single spill file is first according to subregion sequence, then according to key sequences;Overflow if multiple
Going out file needs to be merged into a big file, and this process carries out merger sequence to multiple spill files;
(3-9)Reducer task process log-on data copy thread from node, asks Map to appoint by HTTP modes
Business process is located from node and obtains the intermediate output file of Map task process, and intermediate output file is stored in from the local of node
In disk, copying the file that comes can be firstly placed in core buffer, and multiple copied files can be final big according to being merged into
File, this big file sorts according to key, then proceeds to step(3-10);
(3-10)Reduce task process from the big file for obtaining with<key,iterator>Form reads record, and holds
Row Reduce () method, then proceeds to step(3-11);
(3-11)Judge that homework type is iterative type or non-iterative type, if iterative type operation, then proceed to step(3-
12), if non-iterative type operation, then proceed to step(3-13);
(3-12)From the dynamic data caching component of node by the result cache after the execution of Reduce task process in internal memory
It is central, spill in local disk file when relief area is full, then proceed to step(3-14);
(3-13)Reduce task process is written to the result after execution in HDFS, then proceeds to step(3-14);
(3-14)Tasks carrying terminates, and is then back to step(3-1).
Example:
In order to verify the feasibility and effectiveness of the present invention, perform under the experimental configuration environment shown in following table 1 and write
Computer program, to invention test, test result is as shown in following table 2 and table 3:
Table 1:Experimental configuration environment
In table 2 and table 3, the comparison other of the present invention is Hadoop-0.20.0 and Haloop, and experiment algorithm is fuzzy
C-Means.What table 2 was represented is the network transmission of dynamic data of three MapReduce implementations under different experiments scale
Amount compares.What table 3 was represented is 3 MapReduce implementations under certain size of experiment during execution during different iterationses
Between compare.Experimental result shows that the present invention has more satisfactory improvement in network data transmission and time performance.
Table 2:Dynamic data transmission amount compares in fuzzy C-Means
Table 3:The fuzzy C-Means execution times compare
As it will be easily appreciated by one skilled in the art that the foregoing is only presently preferred embodiments of the present invention, not to
The present invention, all any modification, equivalent and improvement made within the spirit and principles in the present invention etc. are limited, all should be included
Within protection scope of the present invention.
Claims (5)
1. a kind of MapReduce optimization methods of suitable iterative calculation, are applied in a kind of Hadoop group systems, the cluster
System includes a host node and multiple from node, it is characterised in that the method comprising the steps of:
(1) operation is put into operation by multiple Hadoop operations that host node receive user is submitted to, the job service process of host node
In queue, and the job scheduler of host node is waited to carry out job scheduling;
(2) host node waits the task requests sent from node, and after task requests are received, the job scheduler of host node
Priority scheduling localization tasks, if sending no localization tasks from node of task requests, make according to Hadoop
The task type of industry is predicted scheduling, directly the Hadoop operations is scheduled for calculation type task, for mode transmission
Task then postpones certain intervals, and just the Hadoop operations are scheduled when total delay time interval reaches delay threshold value, its
In, calculation type task refers to that the calculating time of task accounts for the task of major part in whole processing procedures of Map tasks, transmission
Type task refers to that the data transmission period of task accounts for the task of major part in whole processing procedures of Map tasks;
Wherein, step (2) specifically includes following sub-step:
(2-1) the heart beating letter that task service process of the job service process monitoring and wait on host node from node sends
Breath, the heartbeat message includes the current operation information from node, specifically includes total slot number and the slot number being currently running;
(2-2) host node is calculated currently from node after the heartbeat message sent from node is received according to the heartbeat message
The average operation slot number of idle slot number and whole Hadoop group systems, according to calculate result judge whether need to currently from
The task of the node distribution operation, if task need not be distributed, return to step (2-1), otherwise execution step (2-3);
(2-3) enumerator i=0 is set;
(2-4) judge that whether i-th Hadoop operation from node there are localization tasks currently, i.e., currently whether store from node
There is the input data burst (Split) of i-th Hadoop operation, step (2-5) is proceeded to if not, step is proceeded to if having
(2-11);
(2-5) i=i+1 is set, and judges whether i is equal to the number of Hadoop operations, it is no if step (2-7) is entered equal to if
Then return to step (2-4);
(2-6) enumerator j=0 is set;
(2-7) task type for judging j-th Hadoop operation is calculation type task or mode transmission task, if calculation type is appointed
Business, then into step (2-11), if mode transmission task, then into step (2-8);
(2-8) task scheduling of j-th Hadoop operation is postponed into a heart time;
(2-9) whether the total delay time for judging j-th Hadoop job tasks scheduling reaches a threshold value, if reaching, turns
Enter step (2-11), otherwise proceed to step (2-10);
(2-10) j=j+1 is set, and judges whether j is equal to the number of Hadoop operations, step (2-12) is entered if being equal to,
Otherwise return to step (2-1);
(2-11) localization tasks of i-th Hadoop operation are dispatched to current from node, then process terminates;
(2-12) to the task scheduling of j-th Hadoop operation to current from node, then process terminates;
(3) judge that Hadoop homework types are carried out not after the task of Hadoop operations of host node scheduling is received from node
With processing, homework type is divided into two kinds of iterative type and non-iterative type, for the operation of non-iterative type according to Hadoop conventional treatment
Mode is processed, and for iterative type operation, was increased Map ends before the Map stages and is shuffled process, for reading for Map tasks
Dynamic data, in the Reduce stages, the dynamic data caching component pipe that dynamic data is carried out local cache and transfers to from node
Reason, and final result is stored in HDFS after operation is disposed.
2. MapReduce optimization methods according to claim 1, it is characterised in that step (2-2) specifically, it is current from
The idle slot number of node deducts the slot number being currently running, the average run channel of whole Hadoop group systems equal to total slot number
Number for tracking process monitoring to it is all from be currently running slot number of node and divided by all slot numbers from node, if currently
The idle slot number of node is equal to 0, then need not distribute task, if while the current slot number being currently running from node is more than whole
The average operation slot number of Hadoop group systems, then need not distribute task.
3. MapReduce optimization methods according to claim 1, it is characterised in that step (3) specifically includes following sub-step
Suddenly:
(3-1) task of the Hadoop operations of host node scheduling is received;
(3-2) homework type for judging task is iterative type operation or non-iterative type operation, if iterative type operation then turns
Enter step (3-3), if the operation of non-iterative type then proceeds to step (3-4);
(3-3) task type for judging the iterative type operation is Map tasks or Reduce tasks, if Map tasks then turn
Enter step (3-5), if Reduce tasks then proceed to step (3-9);
(3-4) task type for judging the non-iterative type operation is Map tasks or Reduce tasks, if Map tasks are then
Step (3-8) is proceeded to, if Reduce tasks then proceed to step (3-9);
(3-5) judge whether the iterative type operation is operation for the first time, if not step (3-6) is then proceeded to, if it is turn
Enter step (3-7);
(3-6) the multiple data copy threads of task service process initiation from node that Map task process (Mapper) is located, pass through
HTTP modes are asked Reduce task process to be located from node and obtain the calculated dynamic data file of Reduce task process,
Then step (3-8) is proceeded to;
(3-7) Map task process reads dynamic data initialization value, then proceeds to step (3-8);
(3-8) input file of operation is resolved into burst one by one by Hadoop group systems, and Map task process enters to burst
Row is processed, and then proceeds to step (3-14);
(3-9) the Reducer task process log-on data copy thread from node, asks Map tasks to be entered by HTTP modes
Journey is located from node and obtains the intermediate output file of Map task process, and intermediate output file is stored in the local disk from node
In, copying the file that comes can be firstly placed in core buffer, multiple copied files can according to final big file can be merged into,
This big file sorts according to key, then proceeds to step (3-10);
(3-10) Reduce task process from the big file for obtaining with<key,iterator>Form reads record, and performs
Reduce () method, then proceeds to step (3-11);
(3-11) judge that homework type is iterative type or non-iterative type, if iterative type operation, then proceed to step (3-12),
If non-iterative type operation, then step (3-13) is proceeded to;
(3-12) from the dynamic data caching component of node by the result cache after the execution of Reduce task process in the middle of internal memory,
Spill in local disk file when relief area is full, then proceed to step (3-14);
(3-13) Reduce task process is written to the result after execution in HDFS, then proceeds to step (3-14);
(3-14) tasks carrying terminates, and is then back to step (3-1).
4. MapReduce optimization methods according to claim 3, it is characterised in that the dynamic data text in step (3-6)
Part is managed by the dynamic data caching component from node, in being stored in internal memory and local disk;The dynamic data that copy comes
Managed by dynamic data caching component, same multiple Map task process from node are from from the local dynamic data request of node
File, these data are input into the dynamic data needed as Map task process.
5. MapReduce optimization methods according to claim 3, it is characterised in that in step (3-8), the size of burst
HDFS block sizes are defaulted as, block size is configured by configuration file, burst is resolved into Map task process and needed by Map task process
Want<Key, value>The record of form, performs Map () method, by the result cache for performing in the middle of internal memory, works as relief area
Man Shihui is spilt in the middle of disk, and the information of the file meeting record partitioning of spilling, single spill file is first according to subregion sequence,
Then according to key sequences;If multiple spill files need to be merged into a big file, this process is to multiple spill files
Carry out merger sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310600745.7A CN103617087B (en) | 2013-11-25 | 2013-11-25 | MapReduce optimizing method suitable for iterative computations |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310600745.7A CN103617087B (en) | 2013-11-25 | 2013-11-25 | MapReduce optimizing method suitable for iterative computations |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103617087A CN103617087A (en) | 2014-03-05 |
CN103617087B true CN103617087B (en) | 2017-04-26 |
Family
ID=50167790
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310600745.7A Active CN103617087B (en) | 2013-11-25 | 2013-11-25 | MapReduce optimizing method suitable for iterative computations |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103617087B (en) |
Families Citing this family (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105204920B (en) * | 2014-06-18 | 2019-07-23 | 阿里巴巴集团控股有限公司 | A kind of implementation method and device of the distributed computing operation based on mapping polymerization |
CN104270412A (en) * | 2014-06-24 | 2015-01-07 | 南京邮电大学 | Three-level caching method based on Hadoop distributed file system |
CN104158860B (en) * | 2014-07-31 | 2017-09-29 | 国家超级计算深圳中心(深圳云计算中心) | A kind of job scheduling method and job scheduling system |
CN104503820B (en) * | 2014-12-10 | 2018-07-24 | 华南师范大学 | A kind of Hadoop optimization methods based on asynchronous starting |
US10033570B2 (en) * | 2015-01-15 | 2018-07-24 | International Business Machines Corporation | Distributed map reduce network |
CN106528288A (en) * | 2015-09-10 | 2017-03-22 | 中兴通讯股份有限公司 | Resource management method, device and system |
CN106547609B (en) * | 2015-09-18 | 2020-09-18 | 阿里巴巴集团控股有限公司 | Event processing method and device |
CN105117286B (en) * | 2015-09-22 | 2018-06-12 | 北京大学 | The dispatching method of task and streamlined perform method in MapReduce |
CN106354563B (en) * | 2016-08-29 | 2020-05-22 | 广州市香港科大***研究院 | Distributed computing system for 3D reconstruction and 3D reconstruction method |
CN106506255B (en) * | 2016-09-21 | 2019-11-05 | 微梦创科网络科技(中国)有限公司 | A kind of method, apparatus and system of pressure test |
CN108153583B (en) * | 2016-12-06 | 2022-05-13 | 阿里巴巴集团控股有限公司 | Task allocation method and device and real-time computing framework system |
CN108270634B (en) * | 2016-12-30 | 2021-08-24 | 中移(苏州)软件技术有限公司 | Heartbeat detection method and system |
CN106897133B (en) * | 2017-02-27 | 2020-09-29 | 苏州浪潮智能科技有限公司 | Implementation method for managing cluster load based on PBS job scheduling |
CN107122238B (en) * | 2017-04-25 | 2018-05-25 | 郑州轻工业学院 | Efficient iterative Mechanism Design method based on Hadoop cloud Computational frame |
CN107316124B (en) * | 2017-05-10 | 2018-08-31 | 中国航天***科学与工程研究院 | Extensive affairs type job scheduling and processing general-purpose system under big data environment |
CN107391250B (en) * | 2017-08-11 | 2021-02-05 | 成都优易数据有限公司 | Controller scheduling method for improving performance of Mapreduce task Shuffle |
CN107562926B (en) * | 2017-09-14 | 2023-09-26 | 丙申南京网络技术有限公司 | Multi-hadoop distributed file system for big data analysis |
CN107807983B (en) * | 2017-10-30 | 2021-08-24 | 辽宁大学 | Design method of parallel processing framework supporting large-scale dynamic graph data query |
CN108376104B (en) * | 2018-02-12 | 2020-10-27 | 上海帝联网络科技有限公司 | Node scheduling method and device and computer readable storage medium |
CN108563497B (en) * | 2018-04-11 | 2022-03-29 | 中译语通科技股份有限公司 | Efficient multi-dimensional algorithm scheduling method and task server |
CN109117285B (en) * | 2018-07-27 | 2021-12-28 | 高新兴科技集团股份有限公司 | Distributed memory computing cluster system supporting high concurrency |
CN110297714B (en) * | 2019-06-19 | 2023-05-30 | 上海冰鉴信息科技有限公司 | Method and device for acquiring PageRank based on large-scale graph dataset |
CN112148202B (en) * | 2019-06-26 | 2023-05-26 | 杭州海康威视数字技术股份有限公司 | Training sample reading method and device |
CN110908796B (en) * | 2019-11-04 | 2022-03-18 | 北京理工大学 | Multi-operation merging and optimizing system and method in Gaia system |
CN111813527B (en) * | 2020-07-15 | 2022-06-14 | 江苏方天电力技术有限公司 | Data-aware task scheduling method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737114A (en) * | 2012-05-18 | 2012-10-17 | 北京大学 | MapReduce-based big picture distance connection query method |
CN103279328A (en) * | 2013-04-08 | 2013-09-04 | 河海大学 | BlogRank algorithm parallelization processing construction method based on Haloop |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120304186A1 (en) * | 2011-05-26 | 2012-11-29 | International Business Machines Corporation | Scheduling Mapreduce Jobs in the Presence of Priority Classes |
-
2013
- 2013-11-25 CN CN201310600745.7A patent/CN103617087B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737114A (en) * | 2012-05-18 | 2012-10-17 | 北京大学 | MapReduce-based big picture distance connection query method |
CN103279328A (en) * | 2013-04-08 | 2013-09-04 | 河海大学 | BlogRank algorithm parallelization processing construction method based on Haloop |
Non-Patent Citations (2)
Title |
---|
HPMR:Prefetching and Pre-shuffling in Shared MapReduce Computation Environment;Sangwon Seo等;《IEEE International Conference on Cluster Computing and Workshops,2009》;20091231;第1和4页 * |
基于MapReduce的迭代型分布式数据处理研究;冯新建;《中国优秀硕士学位论文全文数据库信息科技辑》;20131015(第10期);第I137-20页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103617087A (en) | 2014-03-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103617087B (en) | MapReduce optimizing method suitable for iterative computations | |
Kalia et al. | Analysis of hadoop MapReduce scheduling in heterogeneous environment | |
CN105117286B (en) | The dispatching method of task and streamlined perform method in MapReduce | |
CN101359333B (en) | Parallel data processing method based on latent dirichlet allocation model | |
CN108469988A (en) | A kind of method for scheduling task based on isomery Hadoop clusters | |
CN106547627A (en) | The method and system that a kind of Spark MLlib data processings accelerate | |
Arfat et al. | Big data for smart infrastructure design: Opportunities and challenges | |
Huynh et al. | An efficient approach for mining sequential patterns using multiple threads on very large databases | |
Laccetti et al. | Performance enhancement of a dynamic K-means algorithm through a parallel adaptive strategy on multicore CPUs | |
Gandomi et al. | HybSMRP: a hybrid scheduling algorithm in Hadoop MapReduce framework | |
Kim et al. | Load-balancing in distributed selective search | |
CN108170861B (en) | Distributed database system collaborative optimization method based on dynamic programming | |
Escobar et al. | Parallel high-dimensional multi-objective feature selection for EEG classification with dynamic workload balancing on CPU–GPU architectures | |
Shi et al. | MapReduce short jobs optimization based on resource reuse | |
CN113255165A (en) | Experimental scheme parallel deduction system based on dynamic task allocation | |
CN104778088B (en) | A kind of Parallel I/O optimization methods and system based on reduction interprocess communication expense | |
Shanker et al. | ACTIVE-a real time commit protocol | |
CN108509259A (en) | Obtain the method and air control system in multiparty data source | |
US8027996B2 (en) | Commitment control for less than an entire record in an in-memory database in a parallel computer system | |
CN115688906A (en) | Automatic data arranging and loading method and system for hyperparametric training | |
Lee et al. | ARLS: A MapReduce-based output analysis tool for large-scale simulations | |
Xu et al. | EdgeMesh: A hybrid distributed training mechanism for heterogeneous edge devices | |
Zhao et al. | A holistic cross-layer optimization approach for mitigating stragglers in in-memory data processing | |
CN106033434A (en) | Virtual asset data replica processing method based on data size and popularity | |
Enokido et al. | Energy-Saving Multi-version Timestamp Ordering Algorithm for Virtual Machine Environments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |