CN101309208B

CN101309208B - Job scheduling system suitable for grid environment and based on reliable expense

Info

Publication number: CN101309208B
Application number: CN2008100481627A
Authority: CN
Inventors: 金海�; 陶永才; 吴松; 邹德清; 石宣化; 曹海军
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2008-06-21
Filing date: 2008-06-21
Publication date: 2010-12-01
Anticipated expiration: 2028-06-21
Also published as: CN101309208A

Abstract

The invention relates to an operation scheduling system which is applicable to the grid environment and based on the reliability cost; as indicated in graph 1, the whole system includes three layers: the first layer is an operation submission interface module 1; the second layer is an operation scheduling module 2 and the grid resource platform 7 as the substrate layer. From the perspective of the operational principle, the core of the invention is the operation scheduling module in the second layer, which includes a pre-scheduling module 3, a scheduling strategy module 4, an operation finishtime prediction module 5 and a resource information module 6. The operation scheduling system in the invention proposes an operation running time prediction model and a resource usability prediction model; the operation running time prediction model based on the mathematical model and the resource usability prediction model based on the Markov model have high accuracy and high generality. The operation scheduling system adopts the copy fault-tolerance strategy, the primary copy asynchronous operation fault-tolerance strategy and the retry fault-tolerance respectively according to different operation service quality requirements and resource characteristics so that the operation scheduling system has high flexibility and high validity; meanwhile, the operation scheduling system supports the computation-intensive operation and the data-intensive operation to have good generality. Compared with the scheduling system in the prior art, the operation scheduling system has the advantages of supporting more concurrent users, improving the resource utilization rate, good generality, good extensibility and high system throughput.

Description

A kind of job scheduling system that is applicable to grid environment based on reliable expense

Technical field

The invention belongs to the grid computing field, be specifically related to a kind of job scheduling system that is applicable to grid environment based on reliable expense.

Background technology

Grid has been integrated and has been distributed in upward Internet resources dynamic, autonomous, isomery (comprising high speed internet, computer, large database, transducer, remote equipment etc.) of Internet, it has shielded dynamic, isomerism and the distributivity of resource, for the user provides a kind of resource-sharing efficiently and cooperative working environment.Grid has just attracted the very big attention of academia and industrial quarters once proposition, and has obtained development at full speed.The advantage that grid is different from the traditional distributed high-performance calculation is as follows: the resource that (1) effectively utilizes wide area to distribute; (2) efficient collaboration, both between realization isomery tissue; (3) effectively solve computation-intensive and data-intensive property task; (4) thought based on OGSA makes workflow more be tending towards " service flow ".Grid provides effective problem solution route for complexity, huge scientific research mission, for example: Flight Vehicle Design, gene ordering, atmospheric environment analysis etc.

Gridding resource belongs to different organizations more, and most of resource is non-specific resource, can dynamically add and leave.The change of resource-sharing pattern in addition,, hardware and software failure and network paralysis can cause the unavailable of gridding resource.Therefore the dynamic of resource causes the frequent generation of grid work failure, and QoS of customer can't guarantee.Therefore, the job scheduling under the grid environment faces many new challenges.Therefore, for advantages such as the affluent resources of bringing into play grid better and extensibilities, set up the key point that job scheduling reliably becomes the grid system performance quality.

Gridding resource is presented to the user with service form, and the user is by enjoying various gridding resources to the grid system submit job.The network job scheduling system carries out matching qualified mesh services resource set alternately according to user's qos requirement and information centre after the operation that receives the user.Then, job scheduling system is according to specific scheduling strategy, for user job is selected best resource.Existing job scheduling strategy is many based on performance driving model, economic driving model with trust driving model [referring to K.Krauter, R.Buyya, and M.Maheswaran, A Taxonomy and Survey of Grid Resource Management Systemsfor Distributed Computing, Software Practice and Experience, 32 (2): 135-164, February 2002.].The performance driving model lays particular emphasis on performance indexs of correlation such as improving system throughput, operation execution efficient; Economic driving model lays particular emphasis under the prerequisite that satisfies user QoS demand, selects the minimum resource service of charge; Trust driving model and then be historical service scenario (for example: faulty resource rate, running job success rate etc.) for each resource model that breaks the wall of mistrust,, carry out job scheduling trusty based on this model according to resource.When interrupting because of faulty resource or other reasons in the operation implementation, system takes certain fault-tolerant strategy.Fault-tolerant strategy at present commonly used has: checkpoint, multiple spot duplicate with retry etc.The checkpoint strategy is preserved the operation result and the state of operation termly, when resource breaks down, system rolls back to operation the checkpoint of system log (SYSLOG) before the fault, re-execute from this checkpoint through behind the recovering state, rather than start anew to carry out, thereby saved resource and reduced the Loss Rate of operation; The multiple spot replication strategy is dispatched to resource nodes different more than two to operation simultaneously and carries out, as long as there is a resource node normally to move, just can guarantee that job success carries out; Retry strategy promptly when operation is broken down, is dispatched operation again, and operation can be dispatched to this resource node or other resource nodes.

Traditional job scheduling system does not take into full account the dynamic of resource under the grid environment, causes the operation fault frequently to take place.In addition, the single fault tolerant mechanism of the many employings of traditional dispatching patcher lacks flexibility, and waste system resource.

Summary of the invention

The objective of the invention is deficiency at existing job scheduling system, a kind of job scheduling system based on reliable expense that is applicable to grid environment is provided, this system has taken into full account the QoS request and the resource reliability of operation, automatically adopt suitable fault-tolerant strategy, and have the high and good characteristics of versatility of efficient for operation.

For achieving the above object, be applicable to the job scheduling system based on reliable expense of grid environment, it is characterized in that: it comprises operation submission interface module and job scheduling module;

Operation submits to interface module to be used for user's submit job, and sends the job scheduling module to;

The job scheduling module is used to receive the operation that operation submits to interface module to submit to, dispatch with the fault-tolerant strategy customization after, operation is assigned to corresponding resource node in the gridding resource platform; It comprises pre-scheduling module, scheduling decision module, operation deadline prediction module and resource information module;

The pre-scheduling module is analyzed by the QoS requirement to operation, and user job is classified and list in order of importance and urgency; The pre-scheduling module receives the operation that operation submits to module to send, and carries out alternately with operation deadline prediction module, and the information of forecasting of prediction module is classified and list in order of importance and urgency to operation according to the operation deadline; The pre-scheduling module while is as the operating pool of scheduling decision module, for the scheduling decision module provides operation;

Operation deadline prediction module is used for the deadline of each operation on each resource node predicted; Operation deadline prediction module is accepted the operation deadline predictions request of pre-scheduling module and scheduling decision module, after predicting, will predict the outcome and be back to pre-scheduling module and scheduling decision module respectively; Operation deadline prediction module and resource information module are carried out alternately, and operation deadline prediction module is inquired about each resource performance information by the resource information module;

The resource information module is responsible for collecting the real-time status information of grid resource node, the resource information module is accepted the resource query request of scheduling decision module and operation deadline prediction module, and corresponding Query Result is returned scheduling decision module and operation deadline prediction module; The resource information module adopts inquiry regularly and subscription/publication mechanism to carry out alternately with each resource in the bottom gridding resource, and each resource performance information upgrades in time;

The scheduling decision module is according to the resource node availability in future, and user job is carried out scheduling based on reliable expense, also according to the following availability of the QoS requirement and the resource node that is scheduled of operation, is that fault-tolerant strategy is formulated in each schedule job simultaneously; The scheduling decision module is fetched from the pre-scheduling module and is treated schedule job, and request job deadline prediction module is predicted operation the running time on each resource; Then, carry out alternately with the resource information module running time on each resource according to job run demand and operation, coupling with best resource fulfils assignment; At last, the scheduling decision module is with job scheduling corresponding resource node to the gridding resource platform.

Native system proposes job run time prediction model and Resource Availability forecast model, and job run time prediction model is based on Mathematical Modeling, and the Resource Availability forecast model has higher accuracy and versatility based on Markov model.Native system moves strategy, primary copy asynchronous operation strategy and retry strategy simultaneously according to the different primary copies that adopt respectively of operation and resources characteristic, has very high flexibility and validity.Simultaneously, native system is supported computation-intensive operation and data-intensive operation, has good versatility.The present invention compares with existing dispatching patcher, and utilance, the versatility with the more concurrent user of support, raising resource is good, extensibility is good, the system throughput advantages of higher.Particularly, the present invention has following characteristics.

(1) high accuracy of forecast model

Job run time prediction model is based on Mathematical Modeling, and the Resource Availability forecast model has higher accuracy and versatility based on Markov model.

(2) reliability of raising job run

Native system proposes job success's operational reliability cost model, and this model has taken into full account the qos requirement and the following availability of resource of operation.Based on this model, carry out operation and reliably dispatch.

(3) improve system throughput

Native system is selected adequate resources according to the operation qos requirement, for example selects low reliable resource for low QoS operation, thereby saves highly reliable resource for high QoS operation.Not only satisfy job requirements, also saved resource simultaneously, reached the system load balance, and then improved system throughput.

(4) support the polytype grid work

Native system adopts general dispatching patcher, not only supports the computation-intensive operation, and supports data-intensive operation, has high generality.

Description of drawings

Fig. 1 is the job scheduling system structural representation based on reliable expense that is applicable to grid environment;

Fig. 2 is the structural representation of pre-scheduling module;

Fig. 3 is the structural representation of scheduling decision module;

Fig. 4 is the structural representation of resource information module;

Fig. 5 is an operation pre-scheduling schematic flow sheet;

Fig. 6 is a job scheduling decision process schematic diagram.

Embodiment

The present invention is further detailed explanation below in conjunction with accompanying drawing.

As shown in Figure 1, the job scheduling system based on reliable expense (hereinafter to be referred as DGSS (Dependable Grid Scheduling System)) of grid environment that is applicable to provided by the invention is divided into two layers: ground floor is that interface module 1 is submitted in operation to, and the second layer is a job scheduling module 2.System bottom is a gridding resource platform 7.On operation principle, core of the present invention is the job scheduling module of the second layer, and it comprises pre-scheduling module 3, scheduling decision module 4, operation deadline prediction module 5 and resource information module 6.

Operation submits to interface module 1 to be used for user's submit job, and sends the pre-scheduling module 3 in the job scheduling module 2 to.Operation submits to interface module 1 to support the several work way of submission, comprises the operation way of submission based on Web, order line and JSDL.

Job scheduling module 2 receives the operation that the user submits to, is responsible for reliability scheduling and effectively fault-tolerant strategy customization are carried out in operation.Job scheduling module 2 receives the operation that operations submit to interface modules 1 to submit to, carried out the customization of scheduling and fault-tolerant strategy after, operation is assigned to corresponding resource node in the gridding resource platform 7.

Pre-scheduling module 3 is analyzed by the QoS demand to operation, and user job is classified and list in order of importance and urgency; Pre-scheduling module 3 receives the operation that operation submits to module 1 to send, and carries out according to the information of forecasting of operation deadline prediction module 5 operation being classified and list in order of importance and urgency alternately with operation deadline prediction module 5.3 whiles of pre-scheduling module are as the operating pool of scheduling decision module 4, for scheduling decision module 4 provides operation.

Operation deadline prediction module 5 was predicted the deadline of each operation on each resource node.The deadline prediction theory model of 5 couples of operation i of operation deadline prediction module on resource node j is as follows: CT _Ij=β (Q _j/ PC _j+ D _In/ BW _Kj+ Q/PC _j+ D _Out/ BW _Kj).CT wherein _IjThe expression expection deadline of operation i on resource node j, β is a corrected parameter, and the β value is determined according to the accuracy of history prediction operation deadline.Q _jBe the first-class pending workload of resource node j, Q is the amount of calculation of operation i, PC _jBe the computing capability of resource node j, D _JnThe data volume that needs during for running job i, D _OutFor operation i needs the data volume exported, BW after finishing _KjBe the resource node k of storage data and the data bandwidth between the resource node j.Operation deadline prediction module 5 is accepted the operation deadline predictions request of pre-scheduling module 3 and scheduling decision module 4, after predicting, will predict the outcome and be back to pre-scheduling module 3 and scheduling decision module 4 respectively.Operation deadline prediction module 5 is carried out alternately with resource information module 6, and operation deadline prediction module 5 is by resource information module 6 each resource performance information of inquiry.

The real-time status information that resource information module 6 is responsible for collecting the grid resource node comprises the following availability of CPU, memory usage, hard drive space and resource of resource etc.Resource information module 6 is accepted the resource query request of scheduling decision module 4 and operation deadline prediction module 5, and corresponding Query Result is returned scheduling decision module 4 and operation deadline prediction module 5.Resource information module 6 adopts inquiry regularly and subscription/publication mechanism to carry out alternately with each resource in the bottom gridding resource 7, and each resource performance information upgrades in time.

Scheduling decision module 4 is according to the resource node availability in future, user job carried out scheduling based on reliable expense, simultaneously also according to the QoS demand of operation and the following availability of the resource node that is scheduled, for fault-tolerant strategy is flexibly formulated in each schedule job.Scheduling decision module 4 is fetched from pre-scheduling module 3 and is treated schedule job, and 5 pairs of operations of request job deadline prediction module are predicted the running time on each resource; Then, carry out alternately with resource information module 6 running time on each resource according to job run demand and operation, coupling with best resource fulfils assignment; At last, scheduling decision module 4 is with job scheduling corresponding resource machine node to the gridding resource platform 7.

Bottom is a gridding resource platform 7, and it has gathered numerous resource nodes.Resource node is disposed mesh services, and user job finally is scheduled for resource node and carries out.

Adequate resources is selected according to the operation qos requirement by the DGSS system, for example selects low reliable resource for low QoS operation, thereby saves highly reliable resource for high QoS operation.Not only satisfy job requirements, also saved resource simultaneously, reached the system load balance, improved system throughput.

Illustrate the specific implementation of a kind of optimization of each module in the job scheduling module 2 below; persons skilled in the art can adopt other multiple mode to give specific implementation according to content disclosed by the invention, and protection scope of the present invention is not limited to the content of following example.

As shown in Figure 2, pre-scheduling module 3 comprises operation QoS parsing module 31, job queue's module 32, no QoS formation 33, low QoS formation 34 and high QoS formation 35.

Operation QoS parsing module 31 receives the operation that the user submits to interface module 1 to submit to by operation, and operation is resolved, and extracts the QoS demand of operation, for example: CPU, memory usage, hard drive space, the restriction of operation deadline etc.After treatment the information of operation and parsing is delivered and job queue module 32.

Job queue's module 32 receives operation and the resolving information that operation QoS parsing modules 31 are delivered, and carries out alternately with operation deadline prediction module 5, determines the QoS grade of operation according to the feature of the QoS demand of operation and resource node.Do not have the operation of finishing the time limit requirement and can be regarded as not having the QoS operation, time limit deadline is considered as low QoS operation (0＜λ≤0) greater than the operation minimum running time (1+ λ) of operation doubly in resource node, otherwise is high QoS operation.At last, job queue's module 32 is also submitted to high QoS formation 35, low QoS formation 34, no QoS formation 33 respectively with job class.

No QoS formation 33 is used for no QoS operation is ranked, and it receives the operation that job queue's module 32 is delivered, and the operation extraction module 41 in the scheduling decision module 4 does not regularly have QoS formation 33 and extracts operation.

Low QoS formation 34 is used for low QoS operation is ranked, and it receives the operation that job queue's module 32 is delivered, and the operation extraction module 41 in the scheduling decision module 4 regularly comes end QoS formation 34 to extract operation.

High QoS formation 35 is used for high QoS operation is ranked, and it receives the operation that job queue's module 32 is delivered, and the operation extraction module 41 in the scheduling decision module 4 regularly comes high QoS formation 35 to extract operation.No QoS formation 33, low QoS formation 34 and high QoS formation 35 queuing policys are service earlier first, and priority is minimum for the operation in the no QoS formation 33, and the job priority in the high QoS formation 35 is the highest.

As shown in Figure 3, scheduling decision module 4 comprises operation extraction module 41, fault-tolerant strategy module 42, scheduling strategy module 43 and resource operation matching module 44.

Operation extraction module 41 is responsible for extracting operation the no QoS formation 33 from pre-scheduling module 3, low QoS formation 34 and the high QoS formation 35, extracts operation according to the priority level that height does not have.And operation delivered in resource operation matching module 44.

Fault-tolerant strategy module 42 is responsible for job scheduling fault-tolerant strategy flexibly is provided, and for no QoS operation provides the retry fault-tolerant strategy, promptly after job run failure, operation is dispatched on other resource nodes again moves; For low QoS operation provides the primary copy fault-tolerant strategy, copy has initiatively operational mode and passive moving model in addition.When start-up time the latest of copy and main this expection deadline were overlapping, copy adopted initiatively operational mode, otherwise, adopt passive operational mode; Duplicate fault-tolerant strategy for high QoS operation provides, exactly operation is dispatched to simultaneously plural resource node operation, to ensure that operation completes successfully finishing in the time limit of user's appointment.Fault-tolerant strategy module 42 is accepted the fault-tolerant decision requests of resource operation matching module 44, and corresponding fault-tolerant strategy is back to resource operation matching module 44.

Scheduling strategy module 43 is responsible for job scheduling reliable scheduling strategy is provided, the invention provides the scheduling strategy based on the job run reliable expense, reliable expense is that the probability of operation normal operation in running time is expected in operation in the expection running time on this resource and this resource is long-pending.Scheduling strategy module 43 is accepted the scheduling decision request of resource operation matching module 44, and corresponding scheduling strategy is back to resource operation matching module 44.

Resource operation matching module 44 receives the operation that operation extraction module 41 is delivered, and carries out alternately the deadline of prediction operation with operation deadline prediction module 5; According to operating feature, mutual then with resource information module 6 and scheduling decision module 43, formulate the scheduling strategy of operation, select optimum resource; And mutual with fault-tolerant strategy module 42, formulate the fault-tolerant strategy of operation; At last, resource operation matching module 44 is with the resource node of job scheduling in gridding resource platform 7 correspondences.

Operation deadline prediction module 5 is responsible for the prediction expection deadline of operation on each gridding resource node.The job information of resource operation matching module 4 in its receiving scheduling decision-making module 4, and from resource information module 6, obtain the dynamic property information of gridding resource node, then the operation deadline is predicted, and will be predicted the outcome and return to resource operation matching module 44.

As shown in Figure 4, resource information module 6 comprises resource information detecting module 61, Resource Availability prediction module 62 and database 63.

Resource information detecting module 61 is responsible for collecting the variation of grid resource information and monitoring gridding resource, each gridding resource node in its monitoring grid resource platform 7, the static and dynamic resource information of regularly collecting is kept in the database 63, simultaneously, if the resource adding is arranged or leave (or break down and can not visit), resource information detecting module 61 is notice Resource Availability prediction module 62 in time.

Resource Availability prediction module 62 is responsible for the following availability of resource is predicted that it makes up Resource Availability state space and state-transition matrix based on the Markov forecast model.Resource Availability prediction module 62 receives the signal of resource information detecting modules 61, and carries out upgrading the state space and the state-transition matrix of resource node in time alternately with database 63, and will upgrade the result and be stored in database 63.

Database 63 is used to preserve the performance information of gridding resource node and the usability status space and the state-transition matrix of each resource node, its receives from the information of resource information detecting module 61 and Resource Availability prediction module 62 and new database more, and it also accepts the information inquiry of resource operation matching module 44 in operation deadline prediction module 5 and the scheduling decision module 4 simultaneously.

Below workflow of the present invention is introduced respectively:

(1) operation pre-scheduling flow process (as shown in Figure 5)

(1.1) system receives the operation that the user submits to interface module 1 to submit to by operation;

(1.2) 31 pairs of operations of operation QoS parsing module are resolved, and extract operation QoS demand information;

(1.3) job queue's module 32 is judged operation QoS grade, if belong to the restriction of no operatton time limit, then operation is added no QoS formation 33;

(1.4) the 32 prediction operations of job queue's module are in the expection deadline of each gridding resource node;

(1.5) ask the average operating time T of operation at resource node;

(1.6) if Tn+ (1+ λ) is T 〉=Td, operation adds high QoS to row (Tn is the current time in system, and Td is that the time limit binding hours is finished in operation, 0＜λ≤1);

(1.7) otherwise, operation adds low QoS to row.

(2) job scheduling decision process (as shown in Figure 6)

(2.1) operation extraction module 41 respectively from high and low, no QoS to extracting

operation row

33,34 and 35;

(2.2) reliable expense of resource operation matching module 44 computational tasks on each resource;

(2.3) resource operation matching module 44 is mutual with scheduling strategy module 43, with job scheduling to the minimum resource node of reliable expense;

(2.4) resource operation matching module 44 is mutual with fault-tolerant strategy module 42, if operation belongs to high QoS operation, then primary copy is implemented in operation and is moved strategy simultaneously;

(2.5) if operation belongs to low QoS operation, then primary copy asynchronous operation strategy is implemented in operation;

(2.6), then the retry fault-tolerant strategy is implemented in operation if operation belongs to no QoS operation.

Illustrate the configuring condition in the native system implementation process below.

For feasibility and the validity of verifying system of the present invention, configuration system of the present invention under true environment.Make up a job scheduling system that is applicable to grid environment in group system based on reliable expense with 40 node machines.Its basic configuration is as shown in table 1:

The configuration illustration of table 1 system

CPU	Internal memory	Hard disk	Network interface card	Operating system	Network
						2 CPU * 4 nuclears	8G	74G	3C905B	Linux?AS?4	The 100M switch

Wherein, dispose operation and submit interface module 1 to as front-end processor for one, the user submits to interface module 1 to carry out operation by operation and submits to.Dispose job scheduling module 2 for two, wherein dispose pre-scheduling module 3, scheduling decision module 4 for one; Another disposes operation deadline prediction module 5 and resource information module 6.All the other 37 simulation gridding resource platforms 7, it represents the gridding resource node, and each resource node all has different reliabilities.

Through experimental test, can realize following operation:

(1) resource information module 6 can detect the performance change of bottom grid resource node in time, and the availability in predicted grid resource node future effectively.

(2) operation deadline prediction module 5 makes up based on Mathematical Modeling, in conjunction with the performance information of gridding resource, can predict the expection deadline of operation exactly.

(3), and, operation is carried out based on the reliable expense scheduling, and selected the fault-tolerant strategy of optimization for operation the prediction of operation deadline and the following availability prediction of gridding resource node based on the QoS demand of operation.

(4) test data shows, native system can improve job run success rate, throughput of system and load balancing.

Claims

1. job scheduling system based on reliable expense that is applicable to grid environment is characterized in that: it comprises that operation submits interface module (1) and job scheduling module (2) to;

Operation submits to interface module (1) to be used for user's submit job, and sends job scheduling module (2) to;

Job scheduling module (2) is used to receive the operation that operation submits to interface module (1) to submit to, dispatch with the fault-tolerant strategy customization after, operation is assigned to corresponding resource node in the gridding resource platform (7); It comprises pre-scheduling module (3), scheduling decision module (4), operation deadline prediction module (5) and resource information module (6);

Pre-scheduling module (3) is analyzed by the QoS requirement to operation, and user job is classified and list in order of importance and urgency; Pre-scheduling module (3) receives the operation that operation submits to module (1) to send, and carries out according to the information of forecasting of operation deadline prediction module (5) operation being classified and list in order of importance and urgency alternately with operation deadline prediction module (5); Pre-scheduling module (3) while is as the operating pool of scheduling decision module (4), for scheduling decision module (4) provides operation;

Operation deadline prediction module (5) is used for the deadline of each operation on each resource node predicted; Operation deadline prediction module (5) is accepted the operation deadline predictions request of pre-scheduling module (3) and scheduling decision module (4), after predicting, will predict the outcome and be back to pre-scheduling module (3) and scheduling decision module (4) respectively; Operation deadline prediction module (5) is carried out alternately with resource information module (6), and operation deadline prediction module (5) is inquired about each resource performance information by resource information module (6);

Resource information module (6) is responsible for collecting the real-time status information of grid resource node, resource information module (6) is accepted the resource query request of scheduling decision module (4) and operation deadline prediction module (5), and corresponding Query Result is returned scheduling decision module (4) and operation deadline prediction module (5); Resource information module (6) adopts inquiry regularly and subscription/publication mechanism to carry out alternately with each resource in the bottom gridding resource (7), and each resource performance information upgrades in time;

Scheduling decision module (4) is according to the resource node availability in future, user job is carried out scheduling based on reliable expense,, be that fault-tolerant strategy is formulated in each schedule job simultaneously also according to the following availability of the QoS requirement and the resource node that is scheduled of operation; Scheduling decision module (4) is fetched from pre-scheduling module (3) and is treated schedule job, and request job deadline prediction module (5) is predicted operation the running time on each resource; Then, carry out alternately with resource information module (6) running time on each resource according to job run demand and operation, coupling with best resource fulfils assignment; At last, scheduling decision module (4) goes up corresponding resource node with job scheduling to gridding resource platform (7).

2. job scheduling system according to claim 1 is characterized in that: pre-scheduling module (3) comprises job service quality parsing module (31), job queue's module (32), no service quality formation (33), low service quality formation (34) and high quality-of-service formation (35);

Job service quality parsing module (31) receives the operation that the user submits to interface module (1) to submit to by operation, and operation is resolved, and extracts the QoS requirement of operation, the information of operation and parsing is delivered and job queue's module (32) after treatment;

Job queue's module (32) receives operation and the resolving information that job service quality parsing module (31) is delivered, and carry out alternately with operation deadline prediction module (5), determine the service quality rating of operation according to the feature of the QoS requirement of operation and resource node, do not have the operation of finishing the time limit requirement and be considered as not having the QoS operation, time limit deadline is considered as low QoS operation greater than the operation minimum running time (1+ λ) of operation doubly in resource node, 0＜λ≤0, on the contrary be high QoS operation; Job queue's module (32) is with job class and submit to high quality-of-service formation (35), low service quality formation (34), no service quality formation (33) respectively;

No service quality formation (33) is used for no service quality operation is ranked, and it receives the operation that job queue's module (32) is delivered, and the operation extraction module (41) in the scheduling decision module (4) does not regularly have service quality formation (33) and extracts operation;

Low service quality formation (34) is used for low service quality operation is ranked, and it receives the operation that job queue's module (32) is delivered, and operation is extracted in the regularly low service quality formation (34) of the operation extraction module (41) in the scheduling decision module (4);

High quality-of-service formation (35) is used for to high quality-of-service operation ranks, and its receives the operation that job queue's module (32) is delivered, and operation is extracted in operation extraction module (41) the high quality-of-service formation regularly (35) in the scheduling decision module (4); No service quality formation (33), low service quality formation (34) and high quality-of-service formation (35) queuing policy are service earlier first, priority is minimum for the operation in the no service quality formation (33), and the job priority in the high quality-of-service formation (35) is the highest.

3. job scheduling system according to claim 1 is characterized in that: scheduling decision module (4) comprises operation extraction module (41), fault-tolerant strategy module (42), scheduling strategy module (43) and resource operation matching module (44);

Operation extraction module (41) is responsible for extracting operation the no service quality formation (33) from pre-scheduling module (3), low service quality formation (34) and the high quality-of-service formation (35), extracts operation according to the priority level that height does not have; And operation delivered in resource operation matching module (44);

Fault-tolerant strategy module (42) is responsible for job scheduling fault-tolerant strategy flexibly is provided, and is included as no service quality operation the retry fault-tolerant strategy is provided, and promptly after job run failure, operation is dispatched on other resource nodes again moves; For low service quality operation provides the primary copy fault-tolerant strategy, copy has initiatively operational mode and passive moving model in addition; When start-up time the latest of copy and main this expection deadline were overlapping, copy adopted initiatively operational mode, otherwise, adopt passive operational mode; Duplicate fault-tolerant strategy for the high quality-of-service operation provides, operation is dispatched to plural resource node operation simultaneously, ensure that operation completes successfully finishing in the time limit of user's appointment; Fault-tolerant strategy module (42) is accepted the fault-tolerant decision requests of resource operation matching module (44), and corresponding fault-tolerant strategy is back to resource operation matching module (44);

Scheduling strategy module (43) is responsible for job scheduling scheduling strategy based on the scheduling strategy of job run reliable expense is provided, accept the scheduling decision request of resource operation matching module (44), and corresponding scheduling strategy is back to resource operation matching module (44); Described reliable expense is that the probability of operation normal operation in running time is expected in operation in the expection running time on this resource and this resource is long-pending;

Resource operation matching module (44) receives the operation that operation extraction module (41) is delivered, and carries out alternately the deadline of prediction operation with operation deadline prediction module (5); According to operating feature, mutual then with resource information module (6) and scheduling decision module (43), formulate the scheduling strategy of operation, select optimum resource; And mutual with fault-tolerant strategy module (42), formulate the fault-tolerant strategy of operation; At last, resource operation matching module (44) is with the resource node of job scheduling in gridding resource platform (7) correspondence.

4. job scheduling system according to claim 1 is characterized in that: resource information module (6) comprises resource information detecting module (61), Resource Availability prediction module (62) and database (63);

Resource information detecting module (61) is responsible for collecting the variation of grid resource information and monitoring gridding resource, each gridding resource node in its monitoring grid resource platform (7), the static and dynamic resource information of regularly collecting is kept in the database (63), simultaneously, if the resource adding is arranged or leaves the timely notice of resource information detecting module (61) meeting Resource Availability prediction module (62);

Resource Availability prediction module (62) is responsible for the following availability of resource is predicted that it makes up Resource Availability state space and state-transition matrix based on the Markov forecast model; Resource Availability prediction module (62) receives the signal of resource information detecting module (61), and carries out upgrading the state space and the state-transition matrix of resource node in time alternately with database (63), and will upgrade the result and be stored in database (63);

Database (63) is used to preserve the performance information of gridding resource node and the usability status space and the state-transition matrix of each resource node, its receives from the information of resource information detecting module (61) and Resource Availability prediction module (62) and new database more, and it also accepts the information inquiry of operation deadline prediction module (5) and the middle resource operation matching module (44) of scheduling decision module (4) simultaneously.