CN101309208B - Job scheduling system suitable for grid environment and based on reliable expense - Google Patents

Job scheduling system suitable for grid environment and based on reliable expense Download PDF

Info

Publication number
CN101309208B
CN101309208B CN2008100481627A CN200810048162A CN101309208B CN 101309208 B CN101309208 B CN 101309208B CN 2008100481627 A CN2008100481627 A CN 2008100481627A CN 200810048162 A CN200810048162 A CN 200810048162A CN 101309208 B CN101309208 B CN 101309208B
Authority
CN
China
Prior art keywords
module
resource
scheduling
job
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2008100481627A
Other languages
Chinese (zh)
Other versions
CN101309208A (en
Inventor
金海�
陶永才
吴松
邹德清
石宣化
曹海军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN2008100481627A priority Critical patent/CN101309208B/en
Publication of CN101309208A publication Critical patent/CN101309208A/en
Application granted granted Critical
Publication of CN101309208B publication Critical patent/CN101309208B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to an operation scheduling system which is applicable to the grid environment and based on the reliability cost; as indicated in graph 1, the whole system includes three layers: the first layer is an operation submission interface module 1; the second layer is an operation scheduling module 2 and the grid resource platform 7 as the substrate layer. From the perspective of the operational principle, the core of the invention is the operation scheduling module in the second layer, which includes a pre-scheduling module 3, a scheduling strategy module 4, an operation finishtime prediction module 5 and a resource information module 6. The operation scheduling system in the invention proposes an operation running time prediction model and a resource usability prediction model; the operation running time prediction model based on the mathematical model and the resource usability prediction model based on the Markov model have high accuracy and high generality. The operation scheduling system adopts the copy fault-tolerance strategy, the primary copy asynchronous operation fault-tolerance strategy and the retry fault-tolerance respectively according to different operation service quality requirements and resource characteristics so that the operation scheduling system has high flexibility and high validity; meanwhile, the operation scheduling system supports the computation-intensive operation and the data-intensive operation to have good generality. Compared with the scheduling system in the prior art, the operation scheduling system has the advantages of supporting more concurrent users, improving the resource utilization rate, good generality, good extensibility and high system throughput.

Description

A kind of job scheduling system that is applicable to grid environment based on reliable expense
Technical field
The invention belongs to the grid computing field, be specifically related to a kind of job scheduling system that is applicable to grid environment based on reliable expense.
Background technology
Grid has been integrated and has been distributed in upward Internet resources dynamic, autonomous, isomery (comprising high speed internet, computer, large database, transducer, remote equipment etc.) of Internet, it has shielded dynamic, isomerism and the distributivity of resource, for the user provides a kind of resource-sharing efficiently and cooperative working environment.Grid has just attracted the very big attention of academia and industrial quarters once proposition, and has obtained development at full speed.The advantage that grid is different from the traditional distributed high-performance calculation is as follows: the resource that (1) effectively utilizes wide area to distribute; (2) efficient collaboration, both between realization isomery tissue; (3) effectively solve computation-intensive and data-intensive property task; (4) thought based on OGSA makes workflow more be tending towards " service flow ".Grid provides effective problem solution route for complexity, huge scientific research mission, for example: Flight Vehicle Design, gene ordering, atmospheric environment analysis etc.
Gridding resource belongs to different organizations more, and most of resource is non-specific resource, can dynamically add and leave.The change of resource-sharing pattern in addition,, hardware and software failure and network paralysis can cause the unavailable of gridding resource.Therefore the dynamic of resource causes the frequent generation of grid work failure, and QoS of customer can't guarantee.Therefore, the job scheduling under the grid environment faces many new challenges.Therefore, for advantages such as the affluent resources of bringing into play grid better and extensibilities, set up the key point that job scheduling reliably becomes the grid system performance quality.
Gridding resource is presented to the user with service form, and the user is by enjoying various gridding resources to the grid system submit job.The network job scheduling system carries out matching qualified mesh services resource set alternately according to user's qos requirement and information centre after the operation that receives the user.Then, job scheduling system is according to specific scheduling strategy, for user job is selected best resource.Existing job scheduling strategy is many based on performance driving model, economic driving model with trust driving model [referring to K.Krauter, R.Buyya, and M.Maheswaran, A Taxonomy and Survey of Grid Resource Management Systemsfor Distributed Computing, Software Practice and Experience, 32 (2): 135-164, February 2002.].The performance driving model lays particular emphasis on performance indexs of correlation such as improving system throughput, operation execution efficient; Economic driving model lays particular emphasis under the prerequisite that satisfies user QoS demand, selects the minimum resource service of charge; Trust driving model and then be historical service scenario (for example: faulty resource rate, running job success rate etc.) for each resource model that breaks the wall of mistrust,, carry out job scheduling trusty based on this model according to resource.When interrupting because of faulty resource or other reasons in the operation implementation, system takes certain fault-tolerant strategy.Fault-tolerant strategy at present commonly used has: checkpoint, multiple spot duplicate with retry etc.The checkpoint strategy is preserved the operation result and the state of operation termly, when resource breaks down, system rolls back to operation the checkpoint of system log (SYSLOG) before the fault, re-execute from this checkpoint through behind the recovering state, rather than start anew to carry out, thereby saved resource and reduced the Loss Rate of operation; The multiple spot replication strategy is dispatched to resource nodes different more than two to operation simultaneously and carries out, as long as there is a resource node normally to move, just can guarantee that job success carries out; Retry strategy promptly when operation is broken down, is dispatched operation again, and operation can be dispatched to this resource node or other resource nodes.
Traditional job scheduling system does not take into full account the dynamic of resource under the grid environment, causes the operation fault frequently to take place.In addition, the single fault tolerant mechanism of the many employings of traditional dispatching patcher lacks flexibility, and waste system resource.
Summary of the invention
The objective of the invention is deficiency at existing job scheduling system, a kind of job scheduling system based on reliable expense that is applicable to grid environment is provided, this system has taken into full account the QoS request and the resource reliability of operation, automatically adopt suitable fault-tolerant strategy, and have the high and good characteristics of versatility of efficient for operation.
For achieving the above object, be applicable to the job scheduling system based on reliable expense of grid environment, it is characterized in that: it comprises operation submission interface module and job scheduling module;
Operation submits to interface module to be used for user's submit job, and sends the job scheduling module to;
The job scheduling module is used to receive the operation that operation submits to interface module to submit to, dispatch with the fault-tolerant strategy customization after, operation is assigned to corresponding resource node in the gridding resource platform; It comprises pre-scheduling module, scheduling decision module, operation deadline prediction module and resource information module;
The pre-scheduling module is analyzed by the QoS requirement to operation, and user job is classified and list in order of importance and urgency; The pre-scheduling module receives the operation that operation submits to module to send, and carries out alternately with operation deadline prediction module, and the information of forecasting of prediction module is classified and list in order of importance and urgency to operation according to the operation deadline; The pre-scheduling module while is as the operating pool of scheduling decision module, for the scheduling decision module provides operation;
Operation deadline prediction module is used for the deadline of each operation on each resource node predicted; Operation deadline prediction module is accepted the operation deadline predictions request of pre-scheduling module and scheduling decision module, after predicting, will predict the outcome and be back to pre-scheduling module and scheduling decision module respectively; Operation deadline prediction module and resource information module are carried out alternately, and operation deadline prediction module is inquired about each resource performance information by the resource information module;
The resource information module is responsible for collecting the real-time status information of grid resource node, the resource information module is accepted the resource query request of scheduling decision module and operation deadline prediction module, and corresponding Query Result is returned scheduling decision module and operation deadline prediction module; The resource information module adopts inquiry regularly and subscription/publication mechanism to carry out alternately with each resource in the bottom gridding resource, and each resource performance information upgrades in time;
The scheduling decision module is according to the resource node availability in future, and user job is carried out scheduling based on reliable expense, also according to the following availability of the QoS requirement and the resource node that is scheduled of operation, is that fault-tolerant strategy is formulated in each schedule job simultaneously; The scheduling decision module is fetched from the pre-scheduling module and is treated schedule job, and request job deadline prediction module is predicted operation the running time on each resource; Then, carry out alternately with the resource information module running time on each resource according to job run demand and operation, coupling with best resource fulfils assignment; At last, the scheduling decision module is with job scheduling corresponding resource node to the gridding resource platform.
Native system proposes job run time prediction model and Resource Availability forecast model, and job run time prediction model is based on Mathematical Modeling, and the Resource Availability forecast model has higher accuracy and versatility based on Markov model.Native system moves strategy, primary copy asynchronous operation strategy and retry strategy simultaneously according to the different primary copies that adopt respectively of operation and resources characteristic, has very high flexibility and validity.Simultaneously, native system is supported computation-intensive operation and data-intensive operation, has good versatility.The present invention compares with existing dispatching patcher, and utilance, the versatility with the more concurrent user of support, raising resource is good, extensibility is good, the system throughput advantages of higher.Particularly, the present invention has following characteristics.
(1) high accuracy of forecast model
Job run time prediction model is based on Mathematical Modeling, and the Resource Availability forecast model has higher accuracy and versatility based on Markov model.
(2) reliability of raising job run
Native system proposes job success's operational reliability cost model, and this model has taken into full account the qos requirement and the following availability of resource of operation.Based on this model, carry out operation and reliably dispatch.
(3) improve system throughput
Native system is selected adequate resources according to the operation qos requirement, for example selects low reliable resource for low QoS operation, thereby saves highly reliable resource for high QoS operation.Not only satisfy job requirements, also saved resource simultaneously, reached the system load balance, and then improved system throughput.
(4) support the polytype grid work
Native system adopts general dispatching patcher, not only supports the computation-intensive operation, and supports data-intensive operation, has high generality.
Description of drawings
Fig. 1 is the job scheduling system structural representation based on reliable expense that is applicable to grid environment;
Fig. 2 is the structural representation of pre-scheduling module;
Fig. 3 is the structural representation of scheduling decision module;
Fig. 4 is the structural representation of resource information module;
Fig. 5 is an operation pre-scheduling schematic flow sheet;
Fig. 6 is a job scheduling decision process schematic diagram.
Embodiment
The present invention is further detailed explanation below in conjunction with accompanying drawing.
As shown in Figure 1, the job scheduling system based on reliable expense (hereinafter to be referred as DGSS (Dependable Grid Scheduling System)) of grid environment that is applicable to provided by the invention is divided into two layers: ground floor is that interface module 1 is submitted in operation to, and the second layer is a job scheduling module 2.System bottom is a gridding resource platform 7.On operation principle, core of the present invention is the job scheduling module of the second layer, and it comprises pre-scheduling module 3, scheduling decision module 4, operation deadline prediction module 5 and resource information module 6.
Operation submits to interface module 1 to be used for user's submit job, and sends the pre-scheduling module 3 in the job scheduling module 2 to.Operation submits to interface module 1 to support the several work way of submission, comprises the operation way of submission based on Web, order line and JSDL.
Job scheduling module 2 receives the operation that the user submits to, is responsible for reliability scheduling and effectively fault-tolerant strategy customization are carried out in operation.Job scheduling module 2 receives the operation that operations submit to interface modules 1 to submit to, carried out the customization of scheduling and fault-tolerant strategy after, operation is assigned to corresponding resource node in the gridding resource platform 7.
Pre-scheduling module 3 is analyzed by the QoS demand to operation, and user job is classified and list in order of importance and urgency; Pre-scheduling module 3 receives the operation that operation submits to module 1 to send, and carries out according to the information of forecasting of operation deadline prediction module 5 operation being classified and list in order of importance and urgency alternately with operation deadline prediction module 5.3 whiles of pre-scheduling module are as the operating pool of scheduling decision module 4, for scheduling decision module 4 provides operation.
Operation deadline prediction module 5 was predicted the deadline of each operation on each resource node.The deadline prediction theory model of 5 couples of operation i of operation deadline prediction module on resource node j is as follows: CT Ij=β (Q j/ PC j+ D In/ BW Kj+ Q/PC j+ D Out/ BW Kj).CT wherein IjThe expression expection deadline of operation i on resource node j, β is a corrected parameter, and the β value is determined according to the accuracy of history prediction operation deadline.Q jBe the first-class pending workload of resource node j, Q is the amount of calculation of operation i, PC jBe the computing capability of resource node j, D JnThe data volume that needs during for running job i, D OutFor operation i needs the data volume exported, BW after finishing KjBe the resource node k of storage data and the data bandwidth between the resource node j.Operation deadline prediction module 5 is accepted the operation deadline predictions request of pre-scheduling module 3 and scheduling decision module 4, after predicting, will predict the outcome and be back to pre-scheduling module 3 and scheduling decision module 4 respectively.Operation deadline prediction module 5 is carried out alternately with resource information module 6, and operation deadline prediction module 5 is by resource information module 6 each resource performance information of inquiry.
The real-time status information that resource information module 6 is responsible for collecting the grid resource node comprises the following availability of CPU, memory usage, hard drive space and resource of resource etc.Resource information module 6 is accepted the resource query request of scheduling decision module 4 and operation deadline prediction module 5, and corresponding Query Result is returned scheduling decision module 4 and operation deadline prediction module 5.Resource information module 6 adopts inquiry regularly and subscription/publication mechanism to carry out alternately with each resource in the bottom gridding resource 7, and each resource performance information upgrades in time.
Scheduling decision module 4 is according to the resource node availability in future, user job carried out scheduling based on reliable expense, simultaneously also according to the QoS demand of operation and the following availability of the resource node that is scheduled, for fault-tolerant strategy is flexibly formulated in each schedule job.Scheduling decision module 4 is fetched from pre-scheduling module 3 and is treated schedule job, and 5 pairs of operations of request job deadline prediction module are predicted the running time on each resource; Then, carry out alternately with resource information module 6 running time on each resource according to job run demand and operation, coupling with best resource fulfils assignment; At last, scheduling decision module 4 is with job scheduling corresponding resource machine node to the gridding resource platform 7.
Bottom is a gridding resource platform 7, and it has gathered numerous resource nodes.Resource node is disposed mesh services, and user job finally is scheduled for resource node and carries out.
Adequate resources is selected according to the operation qos requirement by the DGSS system, for example selects low reliable resource for low QoS operation, thereby saves highly reliable resource for high QoS operation.Not only satisfy job requirements, also saved resource simultaneously, reached the system load balance, improved system throughput.
Illustrate the specific implementation of a kind of optimization of each module in the job scheduling module 2 below; persons skilled in the art can adopt other multiple mode to give specific implementation according to content disclosed by the invention, and protection scope of the present invention is not limited to the content of following example.
As shown in Figure 2, pre-scheduling module 3 comprises operation QoS parsing module 31, job queue's module 32, no QoS formation 33, low QoS formation 34 and high QoS formation 35.
Operation QoS parsing module 31 receives the operation that the user submits to interface module 1 to submit to by operation, and operation is resolved, and extracts the QoS demand of operation, for example: CPU, memory usage, hard drive space, the restriction of operation deadline etc.After treatment the information of operation and parsing is delivered and job queue module 32.
Job queue's module 32 receives operation and the resolving information that operation QoS parsing modules 31 are delivered, and carries out alternately with operation deadline prediction module 5, determines the QoS grade of operation according to the feature of the QoS demand of operation and resource node.Do not have the operation of finishing the time limit requirement and can be regarded as not having the QoS operation, time limit deadline is considered as low QoS operation (0<λ≤0) greater than the operation minimum running time (1+ λ) of operation doubly in resource node, otherwise is high QoS operation.At last, job queue's module 32 is also submitted to high QoS formation 35, low QoS formation 34, no QoS formation 33 respectively with job class.
No QoS formation 33 is used for no QoS operation is ranked, and it receives the operation that job queue's module 32 is delivered, and the operation extraction module 41 in the scheduling decision module 4 does not regularly have QoS formation 33 and extracts operation.
Low QoS formation 34 is used for low QoS operation is ranked, and it receives the operation that job queue's module 32 is delivered, and the operation extraction module 41 in the scheduling decision module 4 regularly comes end QoS formation 34 to extract operation.
High QoS formation 35 is used for high QoS operation is ranked, and it receives the operation that job queue's module 32 is delivered, and the operation extraction module 41 in the scheduling decision module 4 regularly comes high QoS formation 35 to extract operation.No QoS formation 33, low QoS formation 34 and high QoS formation 35 queuing policys are service earlier first, and priority is minimum for the operation in the no QoS formation 33, and the job priority in the high QoS formation 35 is the highest.
As shown in Figure 3, scheduling decision module 4 comprises operation extraction module 41, fault-tolerant strategy module 42, scheduling strategy module 43 and resource operation matching module 44.
Operation extraction module 41 is responsible for extracting operation the no QoS formation 33 from pre-scheduling module 3, low QoS formation 34 and the high QoS formation 35, extracts operation according to the priority level that height does not have.And operation delivered in resource operation matching module 44.
Fault-tolerant strategy module 42 is responsible for job scheduling fault-tolerant strategy flexibly is provided, and for no QoS operation provides the retry fault-tolerant strategy, promptly after job run failure, operation is dispatched on other resource nodes again moves; For low QoS operation provides the primary copy fault-tolerant strategy, copy has initiatively operational mode and passive moving model in addition.When start-up time the latest of copy and main this expection deadline were overlapping, copy adopted initiatively operational mode, otherwise, adopt passive operational mode; Duplicate fault-tolerant strategy for high QoS operation provides, exactly operation is dispatched to simultaneously plural resource node operation, to ensure that operation completes successfully finishing in the time limit of user's appointment.Fault-tolerant strategy module 42 is accepted the fault-tolerant decision requests of resource operation matching module 44, and corresponding fault-tolerant strategy is back to resource operation matching module 44.
Scheduling strategy module 43 is responsible for job scheduling reliable scheduling strategy is provided, the invention provides the scheduling strategy based on the job run reliable expense, reliable expense is that the probability of operation normal operation in running time is expected in operation in the expection running time on this resource and this resource is long-pending.Scheduling strategy module 43 is accepted the scheduling decision request of resource operation matching module 44, and corresponding scheduling strategy is back to resource operation matching module 44.
Resource operation matching module 44 receives the operation that operation extraction module 41 is delivered, and carries out alternately the deadline of prediction operation with operation deadline prediction module 5; According to operating feature, mutual then with resource information module 6 and scheduling decision module 43, formulate the scheduling strategy of operation, select optimum resource; And mutual with fault-tolerant strategy module 42, formulate the fault-tolerant strategy of operation; At last, resource operation matching module 44 is with the resource node of job scheduling in gridding resource platform 7 correspondences.
Operation deadline prediction module 5 is responsible for the prediction expection deadline of operation on each gridding resource node.The job information of resource operation matching module 4 in its receiving scheduling decision-making module 4, and from resource information module 6, obtain the dynamic property information of gridding resource node, then the operation deadline is predicted, and will be predicted the outcome and return to resource operation matching module 44.
As shown in Figure 4, resource information module 6 comprises resource information detecting module 61, Resource Availability prediction module 62 and database 63.
Resource information detecting module 61 is responsible for collecting the variation of grid resource information and monitoring gridding resource, each gridding resource node in its monitoring grid resource platform 7, the static and dynamic resource information of regularly collecting is kept in the database 63, simultaneously, if the resource adding is arranged or leave (or break down and can not visit), resource information detecting module 61 is notice Resource Availability prediction module 62 in time.
Resource Availability prediction module 62 is responsible for the following availability of resource is predicted that it makes up Resource Availability state space and state-transition matrix based on the Markov forecast model.Resource Availability prediction module 62 receives the signal of resource information detecting modules 61, and carries out upgrading the state space and the state-transition matrix of resource node in time alternately with database 63, and will upgrade the result and be stored in database 63.
Database 63 is used to preserve the performance information of gridding resource node and the usability status space and the state-transition matrix of each resource node, its receives from the information of resource information detecting module 61 and Resource Availability prediction module 62 and new database more, and it also accepts the information inquiry of resource operation matching module 44 in operation deadline prediction module 5 and the scheduling decision module 4 simultaneously.
Below workflow of the present invention is introduced respectively:
(1) operation pre-scheduling flow process (as shown in Figure 5)
(1.1) system receives the operation that the user submits to interface module 1 to submit to by operation;
(1.2) 31 pairs of operations of operation QoS parsing module are resolved, and extract operation QoS demand information;
(1.3) job queue's module 32 is judged operation QoS grade, if belong to the restriction of no operatton time limit, then operation is added no QoS formation 33;
(1.4) the 32 prediction operations of job queue's module are in the expection deadline of each gridding resource node;
(1.5) ask the average operating time T of operation at resource node;
(1.6) if Tn+ (1+ λ) is T 〉=Td, operation adds high QoS to row (Tn is the current time in system, and Td is that the time limit binding hours is finished in operation, 0<λ≤1);
(1.7) otherwise, operation adds low QoS to row.
(2) job scheduling decision process (as shown in Figure 6)
(2.1) operation extraction module 41 respectively from high and low, no QoS to extracting operation row 33,34 and 35;
(2.2) reliable expense of resource operation matching module 44 computational tasks on each resource;
(2.3) resource operation matching module 44 is mutual with scheduling strategy module 43, with job scheduling to the minimum resource node of reliable expense;
(2.4) resource operation matching module 44 is mutual with fault-tolerant strategy module 42, if operation belongs to high QoS operation, then primary copy is implemented in operation and is moved strategy simultaneously;
(2.5) if operation belongs to low QoS operation, then primary copy asynchronous operation strategy is implemented in operation;
(2.6), then the retry fault-tolerant strategy is implemented in operation if operation belongs to no QoS operation.
Illustrate the configuring condition in the native system implementation process below.
For feasibility and the validity of verifying system of the present invention, configuration system of the present invention under true environment.Make up a job scheduling system that is applicable to grid environment in group system based on reliable expense with 40 node machines.Its basic configuration is as shown in table 1:
The configuration illustration of table 1 system
CPU Internal memory Hard disk Network interface card Operating system Network
2 CPU * 4 nuclears 8G 74G 3C905B Linux?AS?4 The 100M switch
Wherein, dispose operation and submit interface module 1 to as front-end processor for one, the user submits to interface module 1 to carry out operation by operation and submits to.Dispose job scheduling module 2 for two, wherein dispose pre-scheduling module 3, scheduling decision module 4 for one; Another disposes operation deadline prediction module 5 and resource information module 6.All the other 37 simulation gridding resource platforms 7, it represents the gridding resource node, and each resource node all has different reliabilities.
Through experimental test, can realize following operation:
(1) resource information module 6 can detect the performance change of bottom grid resource node in time, and the availability in predicted grid resource node future effectively.
(2) operation deadline prediction module 5 makes up based on Mathematical Modeling, in conjunction with the performance information of gridding resource, can predict the expection deadline of operation exactly.
(3), and, operation is carried out based on the reliable expense scheduling, and selected the fault-tolerant strategy of optimization for operation the prediction of operation deadline and the following availability prediction of gridding resource node based on the QoS demand of operation.
(4) test data shows, native system can improve job run success rate, throughput of system and load balancing.

Claims (4)

1. job scheduling system based on reliable expense that is applicable to grid environment is characterized in that: it comprises that operation submits interface module (1) and job scheduling module (2) to;
Operation submits to interface module (1) to be used for user's submit job, and sends job scheduling module (2) to;
Job scheduling module (2) is used to receive the operation that operation submits to interface module (1) to submit to, dispatch with the fault-tolerant strategy customization after, operation is assigned to corresponding resource node in the gridding resource platform (7); It comprises pre-scheduling module (3), scheduling decision module (4), operation deadline prediction module (5) and resource information module (6);
Pre-scheduling module (3) is analyzed by the QoS requirement to operation, and user job is classified and list in order of importance and urgency; Pre-scheduling module (3) receives the operation that operation submits to module (1) to send, and carries out according to the information of forecasting of operation deadline prediction module (5) operation being classified and list in order of importance and urgency alternately with operation deadline prediction module (5); Pre-scheduling module (3) while is as the operating pool of scheduling decision module (4), for scheduling decision module (4) provides operation;
Operation deadline prediction module (5) is used for the deadline of each operation on each resource node predicted; Operation deadline prediction module (5) is accepted the operation deadline predictions request of pre-scheduling module (3) and scheduling decision module (4), after predicting, will predict the outcome and be back to pre-scheduling module (3) and scheduling decision module (4) respectively; Operation deadline prediction module (5) is carried out alternately with resource information module (6), and operation deadline prediction module (5) is inquired about each resource performance information by resource information module (6);
Resource information module (6) is responsible for collecting the real-time status information of grid resource node, resource information module (6) is accepted the resource query request of scheduling decision module (4) and operation deadline prediction module (5), and corresponding Query Result is returned scheduling decision module (4) and operation deadline prediction module (5); Resource information module (6) adopts inquiry regularly and subscription/publication mechanism to carry out alternately with each resource in the bottom gridding resource (7), and each resource performance information upgrades in time;
Scheduling decision module (4) is according to the resource node availability in future, user job is carried out scheduling based on reliable expense,, be that fault-tolerant strategy is formulated in each schedule job simultaneously also according to the following availability of the QoS requirement and the resource node that is scheduled of operation; Scheduling decision module (4) is fetched from pre-scheduling module (3) and is treated schedule job, and request job deadline prediction module (5) is predicted operation the running time on each resource; Then, carry out alternately with resource information module (6) running time on each resource according to job run demand and operation, coupling with best resource fulfils assignment; At last, scheduling decision module (4) goes up corresponding resource node with job scheduling to gridding resource platform (7).
2. job scheduling system according to claim 1 is characterized in that: pre-scheduling module (3) comprises job service quality parsing module (31), job queue's module (32), no service quality formation (33), low service quality formation (34) and high quality-of-service formation (35);
Job service quality parsing module (31) receives the operation that the user submits to interface module (1) to submit to by operation, and operation is resolved, and extracts the QoS requirement of operation, the information of operation and parsing is delivered and job queue's module (32) after treatment;
Job queue's module (32) receives operation and the resolving information that job service quality parsing module (31) is delivered, and carry out alternately with operation deadline prediction module (5), determine the service quality rating of operation according to the feature of the QoS requirement of operation and resource node, do not have the operation of finishing the time limit requirement and be considered as not having the QoS operation, time limit deadline is considered as low QoS operation greater than the operation minimum running time (1+ λ) of operation doubly in resource node, 0<λ≤0, on the contrary be high QoS operation; Job queue's module (32) is with job class and submit to high quality-of-service formation (35), low service quality formation (34), no service quality formation (33) respectively;
No service quality formation (33) is used for no service quality operation is ranked, and it receives the operation that job queue's module (32) is delivered, and the operation extraction module (41) in the scheduling decision module (4) does not regularly have service quality formation (33) and extracts operation;
Low service quality formation (34) is used for low service quality operation is ranked, and it receives the operation that job queue's module (32) is delivered, and operation is extracted in the regularly low service quality formation (34) of the operation extraction module (41) in the scheduling decision module (4);
High quality-of-service formation (35) is used for to high quality-of-service operation ranks, and its receives the operation that job queue's module (32) is delivered, and operation is extracted in operation extraction module (41) the high quality-of-service formation regularly (35) in the scheduling decision module (4); No service quality formation (33), low service quality formation (34) and high quality-of-service formation (35) queuing policy are service earlier first, priority is minimum for the operation in the no service quality formation (33), and the job priority in the high quality-of-service formation (35) is the highest.
3. job scheduling system according to claim 1 is characterized in that: scheduling decision module (4) comprises operation extraction module (41), fault-tolerant strategy module (42), scheduling strategy module (43) and resource operation matching module (44);
Operation extraction module (41) is responsible for extracting operation the no service quality formation (33) from pre-scheduling module (3), low service quality formation (34) and the high quality-of-service formation (35), extracts operation according to the priority level that height does not have; And operation delivered in resource operation matching module (44);
Fault-tolerant strategy module (42) is responsible for job scheduling fault-tolerant strategy flexibly is provided, and is included as no service quality operation the retry fault-tolerant strategy is provided, and promptly after job run failure, operation is dispatched on other resource nodes again moves; For low service quality operation provides the primary copy fault-tolerant strategy, copy has initiatively operational mode and passive moving model in addition; When start-up time the latest of copy and main this expection deadline were overlapping, copy adopted initiatively operational mode, otherwise, adopt passive operational mode; Duplicate fault-tolerant strategy for the high quality-of-service operation provides, operation is dispatched to plural resource node operation simultaneously, ensure that operation completes successfully finishing in the time limit of user's appointment; Fault-tolerant strategy module (42) is accepted the fault-tolerant decision requests of resource operation matching module (44), and corresponding fault-tolerant strategy is back to resource operation matching module (44);
Scheduling strategy module (43) is responsible for job scheduling scheduling strategy based on the scheduling strategy of job run reliable expense is provided, accept the scheduling decision request of resource operation matching module (44), and corresponding scheduling strategy is back to resource operation matching module (44); Described reliable expense is that the probability of operation normal operation in running time is expected in operation in the expection running time on this resource and this resource is long-pending;
Resource operation matching module (44) receives the operation that operation extraction module (41) is delivered, and carries out alternately the deadline of prediction operation with operation deadline prediction module (5); According to operating feature, mutual then with resource information module (6) and scheduling decision module (43), formulate the scheduling strategy of operation, select optimum resource; And mutual with fault-tolerant strategy module (42), formulate the fault-tolerant strategy of operation; At last, resource operation matching module (44) is with the resource node of job scheduling in gridding resource platform (7) correspondence.
4. job scheduling system according to claim 1 is characterized in that: resource information module (6) comprises resource information detecting module (61), Resource Availability prediction module (62) and database (63);
Resource information detecting module (61) is responsible for collecting the variation of grid resource information and monitoring gridding resource, each gridding resource node in its monitoring grid resource platform (7), the static and dynamic resource information of regularly collecting is kept in the database (63), simultaneously, if the resource adding is arranged or leaves the timely notice of resource information detecting module (61) meeting Resource Availability prediction module (62);
Resource Availability prediction module (62) is responsible for the following availability of resource is predicted that it makes up Resource Availability state space and state-transition matrix based on the Markov forecast model; Resource Availability prediction module (62) receives the signal of resource information detecting module (61), and carries out upgrading the state space and the state-transition matrix of resource node in time alternately with database (63), and will upgrade the result and be stored in database (63);
Database (63) is used to preserve the performance information of gridding resource node and the usability status space and the state-transition matrix of each resource node, its receives from the information of resource information detecting module (61) and Resource Availability prediction module (62) and new database more, and it also accepts the information inquiry of operation deadline prediction module (5) and the middle resource operation matching module (44) of scheduling decision module (4) simultaneously.
CN2008100481627A 2008-06-21 2008-06-21 Job scheduling system suitable for grid environment and based on reliable expense Expired - Fee Related CN101309208B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008100481627A CN101309208B (en) 2008-06-21 2008-06-21 Job scheduling system suitable for grid environment and based on reliable expense

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008100481627A CN101309208B (en) 2008-06-21 2008-06-21 Job scheduling system suitable for grid environment and based on reliable expense

Publications (2)

Publication Number Publication Date
CN101309208A CN101309208A (en) 2008-11-19
CN101309208B true CN101309208B (en) 2010-12-01

Family

ID=40125436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008100481627A Expired - Fee Related CN101309208B (en) 2008-06-21 2008-06-21 Job scheduling system suitable for grid environment and based on reliable expense

Country Status (1)

Country Link
CN (1) CN101309208B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105245472A (en) * 2015-10-22 2016-01-13 国网安徽省电力公司 Inheriting power supply information priority scheduling policy

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101448026B (en) * 2008-12-16 2012-05-23 中国科学技术大学 Method for computing node selection in grid market on the basis of trust filtering
CN101488872B (en) * 2009-01-09 2011-02-09 哈尔滨工业大学 Biological information computing grid system
CN101694712B (en) * 2009-08-28 2016-04-06 曙光信息产业(北京)有限公司 The bill generation system of cluster job scheduling and method
CN102232282B (en) * 2010-10-29 2014-03-26 华为技术有限公司 Method and apparatus for realizing load balance of resources in data center
EP2685693A4 (en) * 2011-03-09 2014-05-07 Computer Network Inf Ct Cas Method for gathering queue information and job information in computation environment
CN102185759A (en) * 2011-04-12 2011-09-14 田文洪 Multi-physical server load equalizing method and device capable of meeting requirement characteristic
CN102185779B (en) * 2011-05-11 2015-02-25 田文洪 Method and device for realizing data center resource load balance in proportion to comprehensive allocation capability
CN102223395B (en) * 2011-05-11 2014-05-07 田文洪 Method and device for balancing dynamic load of middleware in radio frequency identification (RFID) network
CN102831012A (en) * 2011-06-16 2012-12-19 日立(中国)研究开发有限公司 Task scheduling device and task scheduling method in multimode distributive system
CN102902581B (en) 2011-07-29 2016-05-11 国际商业机器公司 Hardware accelerator and method, CPU, computing equipment
JP5891664B2 (en) * 2011-09-08 2016-03-23 富士ゼロックス株式会社 Information management apparatus, program, and information management system
US9628402B2 (en) 2011-09-22 2017-04-18 International Business Machines Corporation Provisioning of resources
CN104915253B (en) * 2014-03-12 2019-05-10 ***通信集团河北有限公司 A kind of method and job processor of job scheduling
KR102101319B1 (en) * 2014-06-30 2020-04-16 콘비다 와이어리스, 엘엘씨 Network node availability prediction based on past history data
CN105718316A (en) * 2014-12-01 2016-06-29 ***通信集团公司 Job scheduling method and apparatus
CN105718479B (en) * 2014-12-04 2020-02-28 中国电信股份有限公司 Execution strategy generation method and device under cross-IDC big data processing architecture
CN105610814B (en) * 2015-12-25 2018-09-21 盛科网络(苏州)有限公司 Reduce the method and system of message Forwarding Latency
US10692012B2 (en) * 2016-05-29 2020-06-23 Microsoft Technology Licensing, Llc Classifying transactions at network accessible storage
CN106326003B (en) * 2016-08-11 2019-06-28 中国科学院重庆绿色智能技术研究院 A kind of job scheduling and computational resource allocation method
CN106790529B (en) * 2016-12-20 2019-07-02 北京并行科技股份有限公司 Dispatching method, control centre and the scheduling system of computing resource
CN107168790B (en) * 2017-03-31 2020-04-03 北京奇艺世纪科技有限公司 Job scheduling method and device
CN107450983A (en) * 2017-07-14 2017-12-08 中国石油大学(华东) It is a kind of based on the hierarchical network resource regulating method virtually clustered and system
CN108536528A (en) * 2018-03-23 2018-09-14 湖南大学 Using the extensive network job scheduling method of perception
CN110362390B (en) * 2019-06-06 2021-09-07 银江股份有限公司 Distributed data integration job scheduling method and device
CN114981778A (en) * 2020-01-14 2022-08-30 华为技术有限公司 Method for determining chip state, method for scheduling cluster resources and device thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1713150A (en) * 2005-06-16 2005-12-28 武汉理工大学 Construction of grid experimental system under single machine system environment
CN1744035A (en) * 2005-09-15 2006-03-08 上海交通大学 J2EE operating platform for supporting grid computing standard WSRF
WO2007017557A1 (en) * 2005-08-08 2007-02-15 Techila Technologies Oy Management of a grid computing network using independent softwqre installation packages

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1713150A (en) * 2005-06-16 2005-12-28 武汉理工大学 Construction of grid experimental system under single machine system environment
WO2007017557A1 (en) * 2005-08-08 2007-02-15 Techila Technologies Oy Management of a grid computing network using independent softwqre installation packages
CN1744035A (en) * 2005-09-15 2006-03-08 上海交通大学 J2EE operating platform for supporting grid computing standard WSRF

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105245472A (en) * 2015-10-22 2016-01-13 国网安徽省电力公司 Inheriting power supply information priority scheduling policy
CN105245472B (en) * 2015-10-22 2018-09-14 国网安徽省电力公司 A kind of inheritance power supply information priorities dispatching method

Also Published As

Publication number Publication date
CN101309208A (en) 2008-11-19

Similar Documents

Publication Publication Date Title
CN101309208B (en) Job scheduling system suitable for grid environment and based on reliable expense
CN107273185B (en) Load balancing control method based on virtual machine
US9542223B2 (en) Scheduling jobs in a cluster by constructing multiple subclusters based on entry and exit rules
CN107548549B (en) Resource balancing in a distributed computing environment
CN100570569C (en) Operation cross-domain control method under the grid computing environment
JP4374378B2 (en) Operation performance evaluation apparatus, operation performance evaluation method, and program
JP4920391B2 (en) Computer system management method, management server, computer system and program
CN101986274B (en) Resource allocation system and resource allocation method in private cloud environment
CN106020933B (en) Cloud computing dynamic resource scheduling system and method based on ultralight amount virtual machine
CN107864211B (en) Cluster resource dispatching method and system
Marahatta et al. PEFS: AI-driven prediction based energy-aware fault-tolerant scheduling scheme for cloud data center
US20100229171A1 (en) Management computer, computer system and physical resource allocation method
CN104081353A (en) Dynamic load balancing in a scalable environment
CN104081354A (en) Managing partitions in a scalable environment
CN101014036A (en) Method and system for assigning decentralized application resource for node cluster
JP2005141605A (en) Method for distributing computer resource based on prediction
CN107168799A (en) Data-optimized processing method based on cloud computing framework
KR20130052599A (en) Virtual data center system
CN105912383A (en) High-reliability dependent task scheduling and resource configuration method
Meng et al. Service-oriented reliability modeling and autonomous optimization of reliability for public cloud computing systems
Ghazali et al. A classification of Hadoop job schedulers based on performance optimization approaches
Malik et al. A planned scheduling process of cloud computing by an effective job allocation and fault-tolerant mechanism
Pattanaik et al. Dynamic fault tolerance management algorithm for VM migration in cloud data centers
Garg et al. Optimal virtual machine scheduling in virtualized cloud environment using VIKOR method
US20230155958A1 (en) Method for optimal resource selection based on available gpu resource analysis in large-scale container platform

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20101201

Termination date: 20130621