CN112698931B - Distributed scheduling system for cloud workflow - Google Patents

Distributed scheduling system for cloud workflow Download PDF

Info

Publication number
CN112698931B
CN112698931B CN202110033497.7A CN202110033497A CN112698931B CN 112698931 B CN112698931 B CN 112698931B CN 202110033497 A CN202110033497 A CN 202110033497A CN 112698931 B CN112698931 B CN 112698931B
Authority
CN
China
Prior art keywords
cloud workflow
scheduler
cloud
workflow
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110033497.7A
Other languages
Chinese (zh)
Other versions
CN112698931A (en
Inventor
李怡然
夏元清
杨立文
王冠
李亚兴
叶玲娟
单成刚
闫策
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202110033497.7A priority Critical patent/CN112698931B/en
Publication of CN112698931A publication Critical patent/CN112698931A/en
Application granted granted Critical
Publication of CN112698931B publication Critical patent/CN112698931B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a distributed scheduling system for cloud workflow. The system comprises: the system comprises a scheduler controller, a distributed cloud workflow scheduling module and a cloud workflow state database module. The cloud workflow can be dynamically allocated and scheduled quickly by adopting the scheduler controller; by adopting the distributed cloud workflow scheduling module, the scheduling number can be flexibly adjusted, so that the distributed cloud workflow scheduling module is suitable for cloud workflow scheduling of different scales, and the user cost can be saved while the cloud resource sharing efficiency is improved. The cloud workflow state database module is used for storing the cloud workflow state, so that the system can have memory on the breakpoint when an accident occurs in the cloud workflow execution process, and the cloud workflow is re-executed from the breakpoint.

Description

Distributed scheduling system for cloud workflow
Technical Field
The invention relates to the field of cloud workflow scheduling, in particular to a distributed scheduling system for cloud workflow.
Background
With the development of the cloud computing technology, computing resources and data resources can be shared and communicated, and computing task requirements of users can be customized according to complex business logic or the front-back dependency relationship among complex computing tasks, so that a cloud workflow is formed. When cloud computing technology is just emerging, cloud workflow scheduling and computing are performed on a virtual machine cluster, and are mainly directed to a scientific cloud workflow with multiple front-back dependency computing tasks. With the development of container technology, the isolation, sharing and easy arrangement of cloud computing resources are more excellent, and a plurality of cloud workflows can be simultaneously executed in a heterogeneous cluster environment. However, the existing cloud workflow scheduling system cannot satisfy an application environment where large-scale cloud workflows arrive at the same time, and is mainly embodied as follows: (1) Most of traditional cloud workflow scheduling systems are based on scheduling of virtual machines, and cannot meet a cloud workflow scheduling mode based on a container environment; (2) The existing container-based cloud workflow scheduling system is in a centralized scheduling mode, the number of cloud workflows which can be received by a scheduler is limited by the capacity of the scheduler, and the requirement that users submit large-scale cloud workflows simultaneously in the cloud computing era can not be met, so that the response speed of the scheduling system is slow and even the scheduling system is paralyzed.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a distributed scheduling system for cloud workflow.
In order to achieve the purpose, the invention provides the following scheme:
a cloud workflow distributed scheduling system, comprising: the system comprises a scheduler controller, a distributed cloud workflow scheduling module and a cloud workflow state database module;
one end of the scheduler controller is connected with a user side, and the other end of the scheduler controller is connected with one end of the distributed cloud workflow scheduling module; the other end of the distributed cloud workflow scheduling module is connected with the cloud workflow state database module;
the scheduler controller is used for receiving the cloud workflow sent by the user side and distributing the received cloud workflow to the distributed cloud workflow scheduling module; the distributed cloud workflow scheduling module is used for dynamically scheduling and allocating the received cloud workflow.
Preferably, the distributed cloud workflow scheduling module comprises a plurality of schedulers;
the scheduler is respectively connected with the scheduler controller and the cloud workflow state database module;
the scheduler is used for releasing the ready cloud workflow tasks according to the dependency relationship among the cloud workflow tasks and distributing computing resources for the ready cloud workflow tasks.
Preferably, each of the schedulers includes: the system comprises a monitoring unit, a storage unit, a cloud workflow analysis unit, a resource monitoring unit, a resource allocation unit, an updating unit and a resource calculation unit;
the monitoring unit, the cloud workflow analysis unit, the resource monitoring unit, the resource allocation unit and the updating unit are all connected with the storage unit; the resource calculation unit is connected with the resource allocation unit; the resource monitoring unit is connected with the cloud workflow state database module; the memory cell includes: the system comprises a cloud workflow buffer area, a task buffer area, a cloud workflow task execution state linked list and a ready task pool;
the monitoring unit is used for receiving the cloud workflow file sent by the scheduler controller, numbering the cloud workflow in the received cloud workflow file by adopting a snowflake algorithm, and storing the numbered cloud workflow in a cloud workflow cache region of the storage unit; the cloud workflow analyzing unit reads the cloud workflow in the cloud workflow cache region, analyzes the cloud workflow file and stores the analyzed cloud workflow file into the task cache region; the monitoring unit is used for receiving a cloud workflow task execution result fed back by the scheduler controller, storing the cloud workflow task execution result into the cloud workflow task execution state linked list, and meanwhile updating information stored in the cloud workflow state database module according to the cloud workflow task execution result; the updating unit is used for updating the executable task in the cloud workflow cache region to the ready task pool according to the task state in the cloud workflow task execution state linked list; the resource computing unit is used for computing the total computing resources needed by the cloud workflow tasks in the ready task pool, sending the total computing resources to the resource allocation unit, receiving the total computing resources which are provided by the cloud platform and returned by the resource allocation unit, allocating computing resources to the cloud workflow tasks according to the priorities of the cloud workflow tasks, and then transmitting the allocation scheme results to the resource allocation unit.
Preferably, the method further comprises the following steps: the system comprises a keep-alive signal module, a scheduler pressure evaluation module and a container resource allocation/monitoring module;
the keep-alive signal module, the scheduler pressure evaluation module and the container resource allocation/monitoring module are all connected with the scheduler controller and the scheduler;
the container resource allocation/monitoring module is used for building a scheduler according to the control signal of the scheduler controller; the keep-alive signal module is used for sending keep-alive signals to the scheduler controller at regular time; the scheduler controller is used for judging the state of the scheduler according to the keep-alive signals; when the scheduler controller does not receive the keep-alive signal of the scheduler in a set time period, determining that the scheduler is dead, and at the moment, killing the scheduler by the scheduler controller, informing a container resource allocation/monitoring module to newly establish a scheduler, and then reallocating the cloud workflow to the newly established scheduler;
the scheduler pressure evaluation module is used for acquiring the scheduler pressure; when the pressure of the scheduler is larger than a first set pressure threshold value, the scheduler controller informs a container resource allocation/monitoring module to newly establish a scheduler.
Preferably, a redis database is stored in the cloud workflow status database module.
Preferably, when the number of schedulers in the system is zero, the scheduler controller is further configured to store the cloud workflow received from the user side and to allocate the cached cloud workflow to the first scheduler after the first scheduler is started and registered.
Preferably, when a certain scheduler in the distributed cloud workflow scheduling module crashes or the scheduler controller detects that the scheduler dies, the cloud workflow scheduling execution path is:
when the pressure of the scheduler is smaller than a second set pressure threshold value, the scheduler controller allocates the cloud work flow to be allocated to the scheduler; if the pressure of the scheduler is greater than the first set pressure threshold, the scheduler controller informs the container resource allocation/monitoring module of newly building a scheduler, and the scheduler controller allocates the cloud workflow to the newly built scheduler.
Preferably, when the pressure of each scheduler in the distributed cloud workflow scheduling module is greater than the first set pressure threshold, the scheduler controller notifies the container resource allocation/monitoring module to create a new scheduler, and then the scheduler controller allocates a cloud workflow according to the pressure of each scheduler.
Preferably, when a scheduler with pressure lower than the second set pressure threshold exists in the distributed cloud workflow scheduling module, the scheduler controller kills the scheduler with pressure lower than the second set pressure threshold, and reclaims and redistributes the cloud workflow scheduled by the scheduler with pressure lower than the second set pressure threshold.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
according to the cloud workflow distributed scheduling system provided by the invention, the scheduler controller is adopted to receive the cloud workflow sent by the user side and distribute the received cloud workflow to the distributed cloud workflow scheduling module, and the distributed cloud workflow scheduling module performs dynamic scheduling and distribution on the received cloud workflow, so that the cloud resource sharing efficiency can be improved and the user cost can be saved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is an architecture diagram of a distributed scheduling system for cloud workflow provided by the present invention;
FIG. 2 is an internal structure diagram of a distributed scheduling system for cloud workflows provided by the present invention;
fig. 3 is a flowchart of an optimal execution path of a cloud workflow according to an embodiment of the present invention;
fig. 4 is a schematic view of a life cycle of a cloud workflow provided in an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a distributed scheduling system for cloud workflows, which can dynamically allocate and adjust the cloud workflows, thereby improving the cloud resource sharing efficiency and saving the user cost.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is an architecture diagram of a cloud workflow distributed scheduling system provided by the present invention, and fig. 2 is an internal architecture diagram of the cloud workflow distributed scheduling system provided by the present invention. As shown in fig. 1 and fig. 2, a cloud workflow distributed scheduling system includes: the system comprises a scheduler controller, a distributed cloud workflow scheduling module and a cloud workflow state database module.
One end of the scheduler controller is connected with the user side, and the other end of the scheduler controller is connected with one end of the distributed cloud workflow scheduling module. The other end of the distributed cloud workflow scheduling module is connected with the cloud workflow state database module. The scheduler controller is used for receiving the cloud workflow sent by the user side and distributing the received cloud workflow to the distributed cloud workflow scheduling module. The distributed cloud workflow scheduling module is used for dynamically scheduling and allocating the received cloud workflow.
The distributed cloud workflow scheduling module is a core module of the whole cloud workflow distributed scheduling system and is realized through a JAVA language. The distributed cloud workflow scheduling module provided by the invention mainly has the functions of: the method comprises the steps of receiving cloud workflows distributed by a scheduler controller, analyzing cloud workflow dependency relations, tracking cloud workflow task execution conditions, updating a schedulable task set, performing resource distribution on schedulable tasks and generating container resources for the tasks to execute.
In order to implement the above functions, the distributed cloud workflow scheduling module provided by the present invention preferably includes a plurality of schedulers. The scheduler is respectively connected with the scheduler controller and the cloud workflow state database module. The scheduler is used for releasing the ready cloud workflow tasks according to the dependency relationship among the cloud workflow tasks and distributing computing resources for the ready cloud workflow tasks.
Each scheduler includes: the system comprises a monitoring unit, a storage unit, a cloud workflow analysis unit, a resource monitoring unit, a resource allocation unit, an updating unit and a resource calculation unit.
The monitoring unit, the cloud workflow analysis unit, the resource monitoring unit, the resource allocation unit and the updating unit are all connected with the storage unit. The resource calculation unit is connected with the resource allocation unit. The resource monitoring unit is connected with the cloud workflow state database module. The memory cell includes: the system comprises a cloud workflow buffer area, a task buffer area, a cloud workflow task execution state linked list and a ready task pool.
The monitoring unit is used for receiving the cloud workflow files sent by the scheduler controller, numbering the cloud workflows in the received cloud workflow files by adopting a snowflake algorithm, and storing the numbered cloud workflows in a cloud workflow cache region of the storage unit. And the cloud workflow analysis unit reads the cloud workflow in the cloud workflow cache region, analyzes the cloud workflow file and stores the analyzed cloud workflow file into the task cache region. The monitoring unit is used for receiving the cloud workflow task execution result fed back by the scheduler controller, storing the cloud workflow task execution result into the cloud workflow task execution state linked list, and meanwhile updating information stored in the cloud workflow state database module according to the cloud workflow task execution result. The updating unit is used for updating the executable tasks in the cloud workflow cache region to the ready task pool according to the task states in the cloud workflow task execution state linked list. The resource computing unit is used for computing the total computing resources needed by the cloud workflow tasks in the ready task pool, sending the total computing resources to the resource allocation unit, receiving the total computing resources which can be provided by the cloud platform and returned by the resource allocation unit, allocating the computing resources to each cloud workflow task according to the priority of the cloud workflow tasks, and then transmitting the allocation scheme result to the resource allocation unit.
When the number of the cloud workflows submitted by a user to the cloud platform is small, the cloud workflow scheduling of the whole cloud platform can be completed by one scheduler, but when the number of the cloud workflows is increased sharply or the pressure of the scheduler is large, the scheduler needs to be designed. Therefore, in addition to the above-described basic functions, the scheduling system preferably further includes: the system comprises a keep-alive signal module, a scheduler pressure evaluation module and a container resource allocation/monitoring module.
The keep-alive signal module, the scheduler pressure evaluation module and the container resource allocation/monitoring module are all connected with the scheduler controller and the scheduler.
The container resource allocation/monitoring module is used for establishing a scheduler according to the control signal of the scheduler controller. The keep-alive signal module is used for sending keep-alive signals to the scheduler controller at regular time. The scheduler controller is used for judging the state of the scheduler according to the keep-alive signals. When the scheduler controller does not receive the keep-alive signals of the scheduler in a set time period, the scheduler is judged to die, at the moment, the scheduler controller kills the scheduler and informs the container resource allocation/monitoring module to newly build a scheduler, and then cloud workflow is reallocated to the newly built scheduler.
The scheduler pressure evaluation module is used for obtaining the scheduler pressure. When the pressure of the dispatcher is greater than the first set pressure threshold value, the dispatcher controller informs the container resource allocation/monitoring module to newly establish a dispatcher.
The cloud workflow distributed scheduling system provided by the invention can elastically stretch the number of the distributed schedulers according to the number of the cloud workflows submitted by a user and the pressure of the schedulers. The distributed scheduling system of the cloud workflow designs a keep-alive signal module, a scheduler pressure evaluation module and a cloud workflow state database to work in a matched mode so as to achieve elastic expansion of the distributed scheduler system. The keep-alive signal module is used for reporting the health degree of each scheduler in the system to the scheduler controller, so that the scheduler controller can judge whether each scheduler in the system is in a normal working state or a dead state, and if the scheduler is dead, the scheduler controller can restart one scheduler to replace the working of the dead scheduler. The scheduler pressure evaluation module is used for evaluating the cloud workflow number contained in each scheduler and the load of the number of the tasks being scheduled, and executing the following functions: the scheduler controller preferentially assigns the newly arrived cloud workflow to the less stressed scheduler. If the pressure value of each scheduler in the system is high, the scheduler controller may activate a new scheduler to share the pressure of the existing scheduler in the cluster. If the pressure of some schedulers in the system is too low, the scheduler controller can kill the schedulers according to a certain rule, so that the computing resources of the whole cloud platform are saved. The cloud workflow state database is used for storing original information and an execution state of the cloud workflow, and when the scheduler dies or is killed by the controller, the cloud workflow managed by the scheduler is distributed to the distributed scheduling system again, and the system can guarantee that the cloud workflow is scheduled at a breakpoint, so that cloud computing resources and scheduling cost are saved.
According to the invention, when large-scale cloud workflows arrive at the same time, the dispatcher controller can be quickly allocated to the distributed cloud workflow scheduling module for scheduling execution, and the specific scheduling path of the cloud workflows can be flexibly adjusted through the distributed cloud workflow scheduling module, so that the cloud workflow scheduling method is suitable for cloud workflow scheduling modes of different scales. Namely, when the cloud workflow scale is large, the scheduler is added to enable the system to meet the requirement of simultaneous scheduling, and when the cloud workflow scale is small, the scheduler is reduced to enable cloud computing resources to be saved. The container resource allocation/monitoring module is an interface between the scheduling system and the container arrangement system, and the scheduling system can be quickly suitable for various different cluster container arrangement systems by constructing different APIs. The cloud workflow state database module is used for storing the cloud workflow state, so that the system can have memory on the breakpoint when an accident occurs in the cloud workflow execution process, and the cloud workflow is re-executed from the breakpoint.
In order to realize synchronous scheduling of large-scale cloud workflows in a cluster, the system designs five threads in each scheduler of a distributed scheduling module through JAVA language based on the five functional units in the scheduler provided by the invention, and receives workflows distributed by a controller, analyzes workflow dependency relationship, tracks the execution condition of cloud workflow tasks, updates a schedulable task set, performs resource allocation on schedulable tasks and performs execution on task container resources.
The cloud workflow analysis system comprises a cloud workflow receiving buffer area, a cloud workflow task pool, a cloud workflow task execution state linked list and a ready cloud workflow task pool, wherein the four buffer areas are used for storing a to-be-analyzed cloud workflow received from a controller, an analyzed cloud workflow task with a front-back dependency relationship, a stored cloud workflow task execution condition and a pre-order task which are all executed and completed. As shown in fig. 2, specifically:
a first thread: the monitoring scheduler controller receives the yalm file of the cloud workflow sent by the scheduler controller to the scheduler, after the scheduler receives the cloud workflow, the cloud workflow is numbered through a snowflake algorithm and stored in a cloud workflow receiving buffer area,
and a second thread: the cloud workflow analysis unit takes the cloud workflow out of the cloud workflow cache region, analyzes the palm file of the cloud workflow into tasks with dependency relationship and stores the tasks into the task cache region,
thread three: and the monitoring container resource allocation/monitoring module receives the cloud workflow task execution result { shared, failed }, which is fed back by the scheduler controller, stores the cloud workflow task execution result into the cloud workflow task execution state linked list, and updates the cloud workflow task execution condition in the cloud workflow state database.
Thread four: and updating the executable tasks in the task cache region to a ready task pool according to the task successful state in the task execution state linked list.
Thread five: and the total computing resources required by the cloud workflow tasks in the computing ready task pool are sent to the resource allocation unit, the cloud platform returned by the resource allocation unit is received to provide the total computing resources, the computing resources are allocated to the tasks according to the priority of the cloud workflow tasks, and finally, the allocation scheme result is transmitted to the container resource allocation/monitoring module.
In the process, five threads in the scheduler have no priority difference and are executed in parallel according to the number of tasks required by caching or the condition of monitoring port data.
It should be noted that, in the whole scheduling process, the scheduler only releases the ready cloud workflow task according to the dependency relationship before and after the cloud workflow task and allocates the computing resource to the cloud workflow task according to a certain algorithm. The system is not responsible for the arrangement of the cluster containers, and the part of the work is responsible for a corresponding cluster arrangement system, such as platforms of kubernets, swarm or yann. Therefore, the system can be rapidly deployed into most of the existing cloud native container services, and can be compatible with the cross-cluster cloud native services. For different container arrangement systems, only the corresponding API of the resource allocation and monitoring module needs to be changed.
The cloud workflow state database stored in the cloud workflow state database module preferably adopts a redis database. A redis database is a database schema with high-speed IO but small data volume. After the dispatcher receives the cloud workflow, the received cloud workflow is uniquely numbered by adopting a snowflake algorithm, and a file is established in a cloud workflow state database, wherein the cloud workflow state database is divided into an A part and a B part, and the A part is used for storing information of cloud workflow levels, including { cloud workflow number. The number of tasks. Execution state }. And the B library is used for storing the task level information of the cloud workflow and comprises { cloud workflow number. And (6) numbering the tasks. Task execution state }. The bank A is updated in real time through the first thread, and the bank B is updated in real time through the third thread. Through the cloud workflow state database, the controller can be guaranteed to continuously allocate the cloud workflow which is not completely executed to a certain scheduler after the scheduler dies or is killed, and the scheduler can continuously schedule tasks in the cloud workflow at the breakpoint until the whole cloud workflow is completely executed.
Based on this, the overall working principle of the distributed scheduling system for the cloud workflow provided by the invention is as follows: the scheduler controller receives a cloud workflow sent by a user and then allocates the cloud workflow to one scheduler in the distributed cloud workflow scheduling modules according to a certain rule to perform task decomposition and scheduling, the scheduler and the container resource allocation/monitoring module read resource data of the cloud platform to determine how to schedule the cloud workflow tasks, the cloud workflow state database user stores cloud workflow information, and the container resource allocation/monitoring module is an interactive interface of the cloud workflow distributed scheduling system and the cloud platform.
The following describes a specific execution path of each cloud workflow to be executed in detail based on the working principle of the cloud workflow distributed scheduling system provided by the present invention.
After a user submits a cloud workflow to a cloud platform, according to different conditions of the pressure of a scheduler, the cloud workflow can experience the following execution paths:
(1) Optimal case execution path, as shown in FIG. 3:
step1, submitting the cloud workflow to a cloud platform by a user.
step2, the dispatcher controller receives the cloud workflow xml file and distributes the cloud workflow xml file to a certain dispatcher in the distributed cloud workflow dispatching module according to a certain rule.
step3, after receiving the yaml file of the cloud workflow, the dispatcher uniquely numbers the cloud workflow and stores the number into a cloud workflow receiving cache region and a cloud workflow state database A { cloud workflow number. The number of tasks. Execution state }.
step4, the cloud workflow analyzer takes out the cloud workflow from the cloud workflow receiving cache area, analyzes the cloud workflow into cloud workflow tasks { a parent node set, a child node set and { current task information } } with a front-back dependency relationship, and stores the cloud workflow tasks into a cloud workflow task pool and a cloud workflow state database B { a cloud workflow number. And (4) numbering the tasks. Task execution state).
step5, the task state tracking unit receives the cloud workflow task execution result { suceded, failed }, stores the cloud workflow task execution result into a cloud workflow task execution state linked list, and meanwhile, updates the state in a cloud workflow state database B.
And step6, updating the ready cloud workflow task pool according to the current cloud workflow task execution state chain table and the cloud workflow task pool condition.
step7, computing the total amount of required computing resources in the ready task pool, requesting resources from a resource allocation module, returning the allocated computing resources to the current scheduler through a certain algorithm by the resource allocation module through computing the total amount of the residual resources of the cloud platform and the future resource requirements, allocating the resources to the cloud workflow task by the scheduler according to the task priority, and sending an allocation scheme to the cloud platform, thereby starting a container to establish a task execution mechanism.
(2) When a certain scheduler in the distributed cloud workflow scheduling module crashes or the scheduler controller detects that the scheduler is dead, the cloud workflow scheduling execution path is as follows:
and when the pressure of the scheduler is less than a second set pressure threshold value, the scheduler controller allocates the cloud work flow to be allocated to the scheduler. If the pressure of the scheduler is larger than a first set pressure threshold value, the scheduler controller informs the container resource allocation/monitoring module to newly establish the scheduler, and the scheduler controller allocates the cloud workflow to the newly established scheduler.
That is, when the scheduler is over-stressed and crashes or the controller detects that a scheduler dies, the cloud workflow execution path is: all schedulers in the current system are evaluated for stress, and if there is a less stressed scheduler, the scheduler controller assigns the outstanding cloud workflow to the existing scheduler. And if the pressure of the schedulers in the current system is large, restarting the schedulers, and allocating the cloud working flows to the newly started schedulers by the scheduler controller. Among the execution paths, the execution paths inside the scheduler are: receiving the cloud workflow, and reading the state reached by the cloud workflow: and if the state is 1, directly reading the execution condition of the cloud workflow database, taking out the unexecuted breakpoint task data and storing the unexecuted breakpoint task data in a cloud workflow task buffer area, and executing a path under the subsequent execution condition and the optimal condition. And if the state is 0, directly executing the subsequent analysis and scheduling scheme according to the optimal condition execution path condition.
(3) When the pressure of each scheduler in the distributed cloud workflow scheduling module is greater than a first set pressure threshold, the scheduler controller informs the container resource allocation/monitoring module to newly establish a scheduler, and then the scheduler controller allocates the cloud workflow according to the pressure of each scheduler.
Namely, when the pressure of each scheduler in the system is higher, the scheduler pressure evaluation module informs the scheduler controller to restart a new scheduler so as to equalize the pressure of each scheduler, and the subsequent scheduler controller allocates the cloud workflow according to the pressure conditions of all the schedulers in the system.
(4) When the dispatcher with the pressure smaller than the second set pressure threshold value exists in the distributed cloud workflow dispatching module, the dispatcher controller kills the dispatcher with the pressure smaller than the second set pressure threshold value, and withdraws and redistributes the cloud workflow dispatched by the dispatcher with the pressure smaller than the second set pressure threshold value.
Namely, when the pressure of some schedulers in the system is lower, the scheduler pressure evaluation module informs the scheduler controller, the scheduler controller kills some schedulers according to a certain rule, the cloud workflow scheduled by the original scheduler is recovered and redistributed, and the cloud workflow is scheduled and executed by the new scheduler according to the condition (2).
(5) When the number of schedulers in the system is zero, the scheduler controller is further configured to store the cloud workflow received from the user side and to allocate the cached cloud workflow to the first scheduler after the first scheduler is started and registered.
That is, when the system is cold started, the number of schedulers in the system is 0 at this time: after the system receives the cloud workflow, the scheduler controller temporarily stores the cloud workflow, starts and registers a first scheduler in the system, then allocates the cloud workflow temporarily stored in the controller to the scheduler, and executes subsequent scheduling work according to the condition (1).
The first set pressure threshold and the second set pressure threshold adopted in the invention are artificially set according to the actual use scene.
In summary, the full life cycle of cloud work is shown in FIG. 4. After a user submits a cloud workflow, a scheduler controller allocates the cloud workflow to a certain scheduler in a distributed cloud workflow scheduling system, the cloud workflow is in a pending state in a cloud workflow cache area to be decomposed in the scheduler, the decomposed cloud workflow tasks are in a consumed state in a task cache area (the decomposed cloud workflow tasks are sequentially executed according to the cloud workflow dependency relationship), if the execution of the preorder tasks of the currently decomposed cloud workflow tasks is completed, the tasks enter a ready task pool and are in a ready state, the scheduler allocates resources to the tasks in the ready state and generates container execution tasks to enter a running state, the busy state is returned to a task execution state linked list after the execution of the current tasks is completed, the tasks are in an inner-finished state, the ready tasks in the ready task pool can be continuously updated and executed according to the completed task condition and a cloud workflow structure until all tasks of the whole cloud workflow are successfully executed, and the cloud workflow is in a finished state, and the scheduling and the execution of the cloud workflow tasks are completed.
In summary, the distributed scheduling system for cloud workflows provided by the present invention has the following advantages over the prior art:
1. the cloud workflow scheduling method and the cloud workflow scheduling system can meet the requirement of scheduling the cloud workflow in a container environment, and can support the cloud workflow scheduling of API in different container arrangement environments.
2. When a plurality of users submit large-scale cloud workflows simultaneously, the cloud workflows can be scheduled according to the QoS requirements of the users and the pre-and-post dependency relationship of the cloud workflow tasks.
3. When large-scale cloud workflows arrive at the same time, the number of the schedulers in the system can be dynamically adjusted according to the pressure of the schedulers, so that the cloud workflows are guaranteed to be processed in time, and the cloud computing resources occupied by the system can be reduced as far as possible under the condition that the system pressure is low.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (5)

1. A cloud workflow distributed scheduling system, comprising: the system comprises a scheduler controller, a distributed cloud workflow scheduling module, a cloud workflow state database module, a keep-alive signal module, a scheduler pressure evaluation module and a container resource allocation/monitoring module;
one end of the scheduler controller is connected with a user side, and the other end of the scheduler controller is connected with one end of the distributed cloud workflow scheduling module; the other end of the distributed cloud workflow scheduling module is connected with the cloud workflow state database module;
the scheduler controller is used for receiving the cloud workflow sent by the user side and distributing the received cloud workflow to the distributed cloud workflow scheduling module; the distributed cloud workflow scheduling module is used for dynamically scheduling and allocating the received cloud workflow;
the distributed cloud workflow scheduling module comprises a plurality of schedulers;
the dispatcher is respectively connected with the dispatcher controller and the cloud workflow state database module;
the scheduler is used for releasing the ready cloud workflow tasks according to the dependency relationship among the cloud workflow tasks and distributing computing resources for the ready cloud workflow tasks;
the keep-alive signal module, the scheduler pressure evaluation module and the container resource allocation/monitoring module are all connected with the scheduler controller and the scheduler;
the container resource allocation/monitoring module is used for establishing a scheduler according to the control signal of the scheduler controller; the keep-alive signal module is used for sending keep-alive signals to the scheduler controller at regular time; the scheduler controller is used for judging the state of the scheduler according to the keep-alive signals; when the scheduler controller does not receive the keep-alive signal of the scheduler in a set time period, determining that the scheduler is dead, and at the moment, killing the scheduler by the scheduler controller, informing a container resource allocation/monitoring module to newly establish a scheduler, and then reallocating the cloud workflow to the newly established scheduler;
the dispatcher pressure evaluation module is used for acquiring dispatcher pressure; when the pressure of the scheduler is greater than a first set pressure threshold value, the scheduler controller informs a container resource allocation/monitoring module to establish a new scheduler;
when the pressure of each scheduler in the distributed cloud workflow scheduling module is greater than a first set pressure threshold, the scheduler controller informs the container resource allocation/monitoring module to newly establish a scheduler, and then the scheduler controller allocates the cloud workflow according to the pressure of each scheduler;
when a scheduler with the pressure smaller than a second set pressure threshold value exists in the distributed cloud workflow scheduling module, the scheduler controller kills the scheduler with the pressure smaller than the second set pressure threshold value, and reclaims and redistributes the cloud workflow scheduled by the scheduler with the pressure smaller than the second set pressure threshold value.
2. The cloud workflow distributed scheduling system of claim 1 wherein each of the schedulers comprises: the system comprises a monitoring unit, a storage unit, a cloud workflow analysis unit, a resource monitoring unit, a resource allocation unit, an updating unit and a resource calculation unit;
the monitoring unit, the cloud workflow analysis unit, the resource monitoring unit, the resource allocation unit and the updating unit are all connected with the storage unit; the resource calculation unit is connected with the resource allocation unit; the resource monitoring unit is connected with the cloud workflow state database module; the memory cell includes: the system comprises a cloud workflow buffer area, a task buffer area, a cloud workflow task execution state linked list and a ready task pool;
the monitoring unit is used for receiving the cloud workflow file sent by the scheduler controller, numbering the cloud workflow in the received cloud workflow file by adopting a snowflake algorithm, and storing the numbered cloud workflow in a cloud workflow cache region of the storage unit; the cloud workflow analyzing unit reads the cloud workflow in the cloud workflow cache region, analyzes the cloud workflow file and stores the analyzed cloud workflow file into the task cache region; the monitoring unit is used for receiving a cloud workflow task execution result fed back by the scheduler controller, storing the cloud workflow task execution result into the cloud workflow task execution state linked list, and meanwhile updating information stored in the cloud workflow state database module according to the cloud workflow task execution result; the updating unit is used for updating the executable task in the cloud workflow cache region to the ready task pool according to the task state in the cloud workflow task execution state linked list; the resource computing unit is used for computing the total computing resources needed by the cloud workflow tasks in the ready task pool, sending the total computing resources to the resource allocation unit, receiving the total computing resources which can be provided by the cloud platform and are returned by the resource allocation unit, and transmitting the allocation scheme result to the resource allocation unit after allocating the computing resources to each cloud workflow task according to the priority of the cloud workflow tasks.
3. The cloud workflow distributed scheduling system of claim 1 wherein a redis database is stored in the cloud workflow status database module.
4. The cloud workflow distributed scheduling system of claim 1 wherein the scheduler controller is further configured to store the cloud workflow received from the client when the number of schedulers in the system is zero and to assign the cached cloud workflow to a first scheduler after the first scheduler is started and registered.
5. The cloud workflow distributed scheduling system of claim 1 wherein when a scheduler of the distributed cloud workflow scheduling module crashes or the scheduler controller detects that the scheduler dies, the cloud workflow scheduling execution path is:
when the pressure of the scheduler is smaller than a second set pressure threshold value, the scheduler controller allocates the cloud work flow to be allocated to the scheduler; if the pressure of the scheduler is greater than the first set pressure threshold, the scheduler controller informs the container resource allocation/monitoring module of newly building a scheduler, and the scheduler controller allocates the cloud workflow to the newly built scheduler.
CN202110033497.7A 2021-01-12 2021-01-12 Distributed scheduling system for cloud workflow Active CN112698931B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110033497.7A CN112698931B (en) 2021-01-12 2021-01-12 Distributed scheduling system for cloud workflow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110033497.7A CN112698931B (en) 2021-01-12 2021-01-12 Distributed scheduling system for cloud workflow

Publications (2)

Publication Number Publication Date
CN112698931A CN112698931A (en) 2021-04-23
CN112698931B true CN112698931B (en) 2022-11-11

Family

ID=75513942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110033497.7A Active CN112698931B (en) 2021-01-12 2021-01-12 Distributed scheduling system for cloud workflow

Country Status (1)

Country Link
CN (1) CN112698931B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434268A (en) * 2021-06-09 2021-09-24 北方工业大学 Workflow distributed scheduling management system and method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271256A (en) * 2018-09-27 2019-01-25 浪潮软件股份有限公司 A kind of cloud resource management and monitoring system and method based on distributed deployment
CN110737485A (en) * 2019-09-29 2020-01-31 武汉海昌信息技术有限公司 workflow configuration system and method based on cloud architecture
CN111026890A (en) * 2019-11-28 2020-04-17 天脉聚源(杭州)传媒科技有限公司 Picture data storage method, system, device and storage medium based on index table

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107547596B (en) * 2016-06-27 2022-01-25 中兴通讯股份有限公司 Cloud platform control method and device based on Docker
CN106878389B (en) * 2017-01-04 2020-02-07 北京百度网讯科技有限公司 Method and device for resource scheduling in cloud system
CN106850589B (en) * 2017-01-11 2020-08-18 杨立群 Method for managing and controlling operation of cloud computing terminal and cloud server

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271256A (en) * 2018-09-27 2019-01-25 浪潮软件股份有限公司 A kind of cloud resource management and monitoring system and method based on distributed deployment
CN110737485A (en) * 2019-09-29 2020-01-31 武汉海昌信息技术有限公司 workflow configuration system and method based on cloud architecture
CN111026890A (en) * 2019-11-28 2020-04-17 天脉聚源(杭州)传媒科技有限公司 Picture data storage method, system, device and storage medium based on index table

Also Published As

Publication number Publication date
CN112698931A (en) 2021-04-23

Similar Documents

Publication Publication Date Title
CN110119311B (en) Distributed stream computing system acceleration method based on FPGA
US10545789B2 (en) Task scheduling for highly concurrent analytical and transaction workloads
CN107038069B (en) Dynamic label matching DLMS scheduling method under Hadoop platform
Gu et al. Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters
CN105808334B (en) A kind of short optimization of job system and method for MapReduce based on resource reuse
Safaei Real-time processing of streaming big data
US8381212B2 (en) Dynamic allocation and partitioning of compute nodes in hierarchical job scheduling
US9239734B2 (en) Scheduling method and system, computing grid, and corresponding computer-program product
CN110347504B (en) Many-core computing resource scheduling method and device
Chadha et al. Extending slurm for dynamic resource-aware adaptive batch scheduling
Al-Sinayyid et al. Job scheduler for streaming applications in heterogeneous distributed processing systems
CN116010064A (en) DAG job scheduling and cluster management method, system and device
CN112698931B (en) Distributed scheduling system for cloud workflow
Bartolini et al. Proactive workload dispatching on the EURORA supercomputer
Xu et al. Improving utilization and parallelism of hadoop cluster by elastic containers
Yang et al. Elastic executor provisioning for iterative workloads on apache spark
Rattihalli et al. Fine-grained heterogeneous execution framework with energy aware scheduling
Chiang et al. DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads.
Khalil et al. Survey of Apache Spark optimized job scheduling in Big Data
Zhang et al. SMOSA: Spider monkey optimization‐based scheduling algorithm for heterogeneous Hadoop
Narang et al. Dynamic distributed scheduling algorithm for state space search
Ha et al. Resource management for parallel processing frameworks with load awareness at worker side
Goponenko et al. Towards workload-adaptive scheduling for HPC clusters
CN112416538A (en) Multilayer architecture and management method of distributed resource management framework
Sandokji et al. Communication and computation aware task scheduling framework toward exascale computing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant