CN112559159A

CN112559159A - Task scheduling method based on distributed deployment

Info

Publication number: CN112559159A
Application number: CN202110000842.7A
Authority: CN
Inventors: 吴新学; 翁庄明; 彭本; 林冬霞
Original assignee: Guangzhou Sinobest Software Technology Co ltd
Current assignee: Guangzhou Sinobest Software Technology Co ltd
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2021-03-26

Abstract

The invention relates to a task scheduling method based on distributed deployment, which is used for performing distributed scheduling and execution on data processing tasks in a big data platform and comprises the following implementation steps: defining the execution priority, the starting time and the execution period of the data processing operation; according to the scheduling configuration information and the historical scheduling information of the jobs, the jobs are sequentially transmitted to a priority job queue according to the execution sequence; according to the data time range of job processing, averagely decomposing a job into a plurality of subtasks with the same time range, and enabling the decomposed subtasks to enter a task queue; distributing the subtasks in the task queue to executing nodes which are idle and have good health states for execution; and inquiring the execution result of the inspection task regularly, and summarizing the execution result of each subtask by taking the job as a unit to form a job execution result. The invention provides a task scheduling method based on distributed deployment, which can process big data tasks in a distributed environment in an efficient, scalable and monitorable mode.

Description

Task scheduling method based on distributed deployment

Technical Field

The invention relates to the technical field of data processing, in particular to a task scheduling method based on distributed deployment.

Background

In the process of managing and applying the big data, the big data is processed, and the processing task of the big data needs to be scheduled correspondingly.

Based on the characteristic that big data has mass data scale and service indexes thereof are becoming more and more complicated, how to ensure the efficiency and stability of big data processing tasks becomes a major challenge facing task scheduling.

The current task scheduling technology mainly comprises a single-machine timing scheduling program and a distributed task scheduling system.

Because the large data platform mostly adopts a distributed architecture, a single-node scheduling mode is difficult to adapt to complex task scenes in a large data cluster, and flexible expansion cannot be performed on large data volume and high concurrency scenes.

However, the existing distributed task scheduling system has the following disadvantages:

1. tasks of the workflow can be executed only on one working node and cannot be executed on a plurality of nodes in the cluster in a scattered manner, so that the scheduling capability is limited, and resource distribution is unbalanced.

2. The dispatching host can not be expanded horizontally, and a single point problem exists. In the operation process of a big data cluster task, manual intervention is needed for the fault of a host in the cluster until the fault is repaired, and the scheduling of the big data task is in an interruption state, which may affect the service progress.

3. The dispatching information of the cluster, the running condition of the tasks and the health state of the nodes are difficult to monitor and analyze.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a task scheduling method based on distributed deployment, which solves the problems that the workflow task in the prior art cannot be executed in multiple nodes, a scheduling host cannot be expanded and the scheduling state cannot be monitored, and can process the big data task in a distributed environment in an efficient, scalable and monitorable mode.

The technical idea of the invention is as follows: the coordination host is used as a coordinator of a distributed architecture and is responsible for monitoring the health states of the scheduling host and the execution host and completing the election of the high-availability scheduling host; the scheduling host is used as a main node of the distributed architecture and is responsible for receiving data processing operation and performing task scheduling on the execution node; the execution host is used as a slave node of the distributed architecture and is responsible for receiving and executing the tasks distributed by the scheduling host, and the execution result is fed back to the scheduling host after the execution is finished; at least 2 scheduling hosts are established, wherein one host serves as a main instance which actually runs and provides task scheduling, and one or more hosts serve as standby instances; the coordination host can select a high-availability standby instance to be switched into a main instance according to the health state of the scheduling host; the scheduling host can carry out task scheduling again on the tasks which are failed to be served or executed last time.

In order to achieve the purpose, the invention adopts the technical scheme that: a task scheduling method based on distributed deployment is provided, which is used for performing distributed scheduling and execution on data processing tasks in a big data platform.

The data processing task of the big data platform refers to data processing operations such as data acquisition, cleaning, conversion and the like established on the big data platform.

The task scheduling method based on distributed deployment comprises the following steps:

s1, scheduling and configuring data processing jobs according to user requirements, and defining execution priority, start time and execution period of the data processing jobs;

s2, according to the scheduling configuration information and the historical scheduling information of the jobs, the jobs are sequentially transmitted to a priority job queue according to the execution sequence;

s3, the jobs transmitted to the priority job queue are averagely decomposed into a plurality of subtasks with the same time range according to the time range of the processed data, and the decomposed subtasks enter the task queue;

s4, acquiring the busy condition and the health state of each execution node, distributing the subtasks in the task queue to the execution nodes which are idle and have good health states for execution, and if no execution node meeting the conditions exists at present, waiting until the idle execution nodes are released and then distributing;

and S5, inquiring the execution result of the inspection task regularly, and summarizing the execution result of each subtask by taking the operation as a unit to form an operation execution result.

Further, the step S2 includes the following steps:

s21, judging whether the job is an instant job or a timing job according to scheduling configuration information of the job, if the job is the instant job, performing job priority arrangement, and if the job is the timing job, waiting for the execution time of the job and performing priority arrangement on the job;

and S22, judging whether the operation fails to be issued last time or executed last time according to the historical scheduling information of the operation, if so, arranging the execution sequence to be the top, and preferentially transmitting the operation to the priority operation queue, and if not, sequentially transmitting the operation to the priority operation queue from high to low according to the scheduling configuration information of the operation.

Further, the step S5 includes the following steps:

and S51, inquiring the execution result of the checking task regularly, if the execution result of a certain task cannot be fed back for a long time, dispatching an execution stopping instruction to a corresponding execution node, defining the execution result of the job as failure, and when the job is scheduled again, determining that the historical scheduling information is the last execution failure.

And S52, after all the related subtasks decomposed by the single job are executed, summarizing the execution results of all the subtasks by taking the job as a unit to form a job execution result.

In step S1, the execution priority of the job includes, from high to low: highest, high, normal, low, lowest.

In step S2, the history scheduling information indicates the history scheduling condition of the job, including whether it is the last issue failure and whether it is the last execution failure.

In step S21, the immediate job refers to the scheduling configuration information of the job, and the start time of the job is defined as immediate execution; the timing job refers to that the starting time of the job is defined as other specified time except the moment to trigger execution in the scheduling configuration information of the job.

In step S4, the busy status of the execution node indicates whether the number of tasks currently executed by the execution node reaches a threshold, and if the number of tasks currently executed by the execution node reaches the threshold, the Worker is busy, and if the number of tasks does not reach the threshold or no task is currently executed, the Worker is idle.

In step S4, the health status of the executing node refers to whether the executing node is operating normally, if the executing node is operating normally without failure, the executing node is in a healthy good status, otherwise, the executing node is in a healthy abnormal status.

Further, in step S4, if the waiting time reaches the set threshold, all subtasks in units of the current job are marked as failed to be delivered, and when the job is rescheduled, the historical scheduling information is the last failed to be delivered.

In step S5, the execution result of the task includes the execution time, the execution speed, the task flow rate, and the task running condition of the task; the execution result of the job comprises the execution time, the execution speed, the job flow and the job running condition of the job.

Further, the task traffic refers to the amount of data processed by the task; the job traffic is the amount of data processed by the job.

Further, the task running condition means that the task is executed successfully or unsuccessfully; the job running condition refers to that the job is executed successfully or unsuccessfully.

On the basis of the scheme, the health states of the scheduling host and the execution host monitored by the coordination host comprise the CPU utilization rate of the host where the instance is located, the JVM heap memory use condition, the JVM thread and GC time.

On the basis of the scheme, the health state of the host machine monitored by the coordinating host machine, the scheduling information of the scheduling host machine and the job execution result of the scheduling host machine are uniformly transmitted to the big data platform, and the big data platform provides visual information display.

Further, the scheduling information of the scheduling host includes the dispatching condition and the queuing condition of the job, i.e. the number of jobs which have finished dispatching and the number of jobs in queuing are accumulated.

The task scheduling method based on distributed deployment has the advantages that: based on the distributed cluster environment, the task scheduling host can create 2 or more than 2 instances, establish a master-slave mechanism and prevent single-point failure; the data processing operation is decomposed into a plurality of subtasks which are distributed to a plurality of execution nodes to be executed concurrently, so that the operation execution efficiency is improved; the execution node can perform horizontal expansion according to the scale of data volume, the efficiency of data processing and the stability requirement, and has strong scalability; scheduling according to task priority can better meet the requirements of service attributes; the dispatching information of the cluster, the running condition of the tasks and the health state of the nodes can be monitored and analyzed, the system automatically completes the main-standby switching of the dispatching host and the reasonable allocation of the tasks among a plurality of executing nodes according to the monitoring and analyzing results, and the reliability of the task dispatching of the system is improved.

Drawings

The invention has the following drawings:

the drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a diagram of a distributed cluster deployment architecture for a distributed deployment-based task scheduling method according to the present invention;

FIG. 2 is a flowchart of a task scheduling method based on distributed deployment according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings. The detailed description, while indicating exemplary embodiments of the invention, is given by way of illustration only, in which various details of embodiments of the invention are included to assist understanding. Accordingly, it will be appreciated by those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The embodiment discloses a task scheduling method based on distributed deployment, which is used for realizing distributed scheduling and execution of data processing tasks in a big data platform. The task scheduling scenario of a data processing job in a big data platform is specifically taken as an embodiment for detailed description, so as to facilitate understanding for those skilled in the art.

Example 1

As shown in fig. 1, in this embodiment, a distributed cluster deployment architecture based on a distributed deployment task scheduling method includes: 3 distributed coordination hosts (Zookeeper), 1 management center host (Menter), 2 dispatching center hosts (Ganger) and 3 execution engine working nodes (Worker); wherein the content of the first and second substances,

the 3 Zookeeper are used for installing a preset Zookeeper program package, are responsible for coordinating the orderly work of Ganger instances and Worker instances in the distributed cluster, monitor the running state of the instances and the health state of hosts where the instances are located, and complete the election of the high-available Ganger instances.

The 1 Menter is used for installing a preset Menter program package, serves as a comprehensive management center for task scheduling of the big data platform, is responsible for managing all hosts of the distributed cluster, creating Ganger instances and Worker instances, and visually displays the running state of the instances and the health state of the hosts where the instances are located in a line chart mode.

The 2 Ganger hosts are used as carriers for Ganger instance operation, namely scheduling hosts.

The Worker working node is used as a carrier for running a Worker instance.

The Ganger instance refers to a virtual role which is created in Menter and can play a role in dispatching task scheduling, and is bound with a Ganger host.

The Worker instance refers to a virtual role which is created in the Menter and can play a role in task execution, and is bound with a Worker work node.

One of the 2 Ganger instances bound with the 2 Ganger hosts is used as a main instance which actually runs and provides task scheduling, and the other one is used as a standby instance, and the standby instance is automatically switched to the main instance under the condition that the health state of the main instance is abnormal.

The Ganger instance consists of a scheduling engine and a dispatching engine, wherein the scheduling engine is responsible for receiving the job created by the Menter, scheduling the job, decomposing the job to be executed immediately into subtasks and transmitting the subtasks to the dispatching engine, and the dispatching engine distributes the tasks to various Worker instances for execution.

The scheduling engine is composed of a job coordinator, a timed task scheduler and a job manager, wherein,

the job coordinator is used for receiving and judging whether the job created by the Menter is an instant job or a timing job, and deciding whether to transmit the job to a priority job queue or a timing task scheduler.

The timing task scheduler is used for receiving and monitoring whether the timing task is due and needs to be executed immediately, if so, the job is transmitted to the priority job queue, otherwise, the timing job is stored continuously until the timing job is due.

The job manager is used for receiving jobs sequentially transmitted according to priorities in the priority job queue, averagely decomposing the jobs into a plurality of subtasks in the same time range according to the time range of data to be processed of the jobs, and enabling the decomposed subtasks to enter the task queue to wait for the dispatching engine to dispatch and execute the subtasks.

The priority is defined by the job at creation of Menter, and comprises from high to low: highest, high, normal, low, lowest.

Example 2

Based on the distributed cluster of the task scheduling method based on distributed deployment, which is set up in the foregoing embodiment 1, the embodiment performs distributed scheduling and execution on a data processing task created by a big data platform. As shown in fig. 2, in this embodiment, the task scheduling method based on distributed deployment includes the following steps:

s1, scheduling and configuring data processing jobs according to user requirements in a Menter, and defining execution priority, start time and execution period of the data processing jobs;

s21, receiving the job created by the Menter by the job coordinator, judging whether the job is an instant job or a timing job according to scheduling configuration information of the job, if the job is the instant job, performing job priority arrangement, and if the job is the timing job, waiting for the execution time of the job and performing priority arrangement on the job;

S3, the jobs transmitted to the priority job queue enter the job manager in sequence according to the priority of the jobs, namely, the jobs with high priority are advanced, and the jobs with low priority are fed back; the job manager averagely decomposes the job into a plurality of subtasks with the same time range according to the time range of the data processed by the job manager, and the decomposed subtasks enter a task queue to wait for the dispatching engine to dispatch and execute.

S41, the distribution engine acquires the running states and the health states of all Worker instances from the Zookeeper;

s42, the dispatching engine dispatches the tasks in the task queue to idle and healthy workers for execution, and if no Worker meeting the conditions exists at present, the dispatching engine waits for the idle Worker to be released and dispatches the idle Worker;

s51, inquiring the Worker regularly to check the execution result of the task, if the execution result of a certain task cannot be fed back for a long time, dispatching an execution stopping instruction to a corresponding execution node, defining the execution result of the job as failure, and when the job is rescheduled, determining that the historical scheduling information is the last execution failure;

s52, after the task is executed, feeding back a task execution result to Ganger by the Worker; (ii) a

And S53, after all the related subtasks decomposed by the single job are executed, the Ganger summarizes the execution results of all the subtasks by taking the job as a unit to form a job execution result and feeds the job execution result back to the Menter.

For ease of understanding, Mcenter creates 4 jobs, for example, on the basis of example 1, as follows:

data acquisition operation a: collecting attendance data of students in 12 months in the past year from an attendance system database to a big data platform database, wherein the priority is highest, and the attendance data is immediately executed;

data acquisition operation b: collecting the score data of students in 12 months in the past year from a score management system database to a big data platform database, wherein the priority is general, and the score data is immediately executed and is the last execution failure operation;

data acquisition operation c: collecting the score data of students in the past 24 months in a score management system database to a big data platform database, wherein the priorities are general and the results are immediately executed;

data conversion job d: the height unit in the student information management system database is converted from meter to centimeter, the priority is general, and the operation is performed after one week.

The threshold for the number of concurrent tasks executed by the Worker instance is 3.

After the 4 jobs are received by the job coordinator in the Ganger instance scheduling engine, the jobs a, the jobs b and the jobs c are all instant jobs, the jobs are firstly input into a priority job queue to perform priority arrangement, the jobs enter the priority job queue, d is a timing job, the jobs are firstly input into a timing task manager, and the jobs enter the priority job queue to perform priority arrangement after the execution time of one week.

B is a job failed to execute last time, so that b is a job which is higher than b and c in priority, a and c are transferred to the job manager from the priority job queue, and a and c are transferred to the job manager from the priority job queue in sequence from high to low in priority. And the operation manager divides the operation in the operation b into sub-tasks which are divided into 24 parts by month, and respectively finishes the result data extraction of each month in the past 1-24 months.

At this time, if all 3 workers are in an idle and healthy state, that is, the number of concurrent tasks being executed by the 3 workers does not exceed 3, and the operation is normal and has no fault, the 24 sub-tasks of the b decomposition are evenly distributed to the 3 workers by the distribution engine to be executed, and if only 2 or 1 Worker meeting the above conditions is distributed to the workers meeting the conditions to be executed. And after the Worker task is executed, feeding back the execution time, the execution speed, the task flow and the task running condition of the task to Ganger. After the ginger finishes the execution results of the 24 subtasks, the final job execution result is fed back to the Menter by taking the job as a unit.

The health status of the instance, the scheduling information of the Ganger instance and the execution result of the work fed back by the Ganger can be checked in the Menter by means of a line graph.

In conclusion, the task scheduling method based on distributed deployment can decompose the big data processing job into a plurality of subtasks to be distributed to a plurality of nodes for execution, and improves the task processing efficiency; the execution node can be expanded transversely, and can be well adapted to a task processing scene with high concurrency; the dispatching host can prevent single point of failure by establishing a main/standby mechanism; scheduling according to task priority can guarantee the timeliness requirement of the service; the task dispatching condition, the task execution condition and the instance running condition can be monitored and analyzed, so that a user can be helped to comprehensively master the task scheduling condition, and problems can be found and processed in time.

Those not described in detail in this specification are within the skill of the art.

The above-described embodiments are merely preferred embodiments of the present invention, which is not intended to limit the present invention in any way. Those skilled in the art can make many changes, modifications, and equivalents to the embodiments of the invention without departing from the scope of the invention as set forth in the claims below. Therefore, equivalent variations made according to the idea of the present invention should be covered within the protection scope of the present invention without departing from the contents of the technical solution of the present invention.

Claims

1. A task scheduling method based on distributed deployment is characterized in that the method is applied to scheduling hosts and execution nodes in a distributed cluster environment, the distributed cluster environment further comprises at least 3 distributed coordination hosts, and the coordination hosts serve as coordinators of a distributed architecture and are responsible for monitoring the health states of the scheduling hosts and the execution hosts and completing election of high-availability scheduling hosts; the scheduling host is used as a main node of the distributed architecture and is responsible for receiving data processing operation and performing task scheduling on the execution node; the execution host is used as a slave node of the distributed architecture and is responsible for receiving and executing the tasks distributed by the scheduling host, and the execution result is fed back to the scheduling host after the execution is finished; at least 2 scheduling hosts are established, wherein one host serves as a main instance which actually runs and provides task scheduling, and one or more hosts serve as standby instances; the coordination host can select a high-availability standby instance to be switched into a main instance according to the health state of the scheduling host; the scheduling host can re-schedule tasks which are failed to be dispatched or executed last time, and the task scheduling method based on distributed deployment is realized by a coordination host, the scheduling host and an execution node in a distributed architecture, and comprises the following steps:

s22, judging whether the operation fails to be issued last time or executed last time according to the historical scheduling information of the operation, if so, arranging the execution sequence to be the top, and preferentially transmitting the operation to a priority operation queue, and if not, sequentially transmitting the operation to the priority operation queue from high to low according to the scheduling configuration information of the operation;

s51, inquiring the execution result of the checking task regularly, if the execution result of a certain task cannot be fed back for a long time, dispatching an execution stopping instruction to a corresponding execution node, defining the execution result of the job as failure, and when the job is scheduled again, determining the historical scheduling information as the last execution failure;

2. The task scheduling method based on distributed deployment according to claim 1, wherein in step S2, the historical scheduling information refers to historical scheduling conditions of the job, including whether it is the last issue failure and whether it is the last execution failure.

3. The task scheduling method based on distributed deployment of claim 1, wherein in step S4, the busy status of the execution node means whether the number of tasks currently being executed by the execution node reaches a threshold, and if the threshold is reached, the Worker is in a busy state, and if the threshold is not reached or no task is currently executed, the Worker is in an idle state.

4. The task scheduling method based on distributed deployment of claim 1, wherein in step S4, the health status of the execution node refers to whether the execution node is operating normally, if the execution node is operating normally without failure, the execution node is in a healthy good status, otherwise, the execution node is in a healthy abnormal status.

5. The task scheduling method based on distributed deployment according to claim 1, wherein in step S4, if the waiting time reaches a set threshold, all subtasks in units of the current job are marked as failed to be delivered, and when the job is rescheduled, the historical scheduling information is the last failed to be delivered.

6. The task scheduling method based on distributed deployment according to claim 1, wherein in step S5, the execution result of the task includes execution time, execution speed, task traffic, task running condition of the task; the execution result of the job comprises the execution time, the execution speed, the job flow and the job running condition of the job.

7. The task scheduling method based on distributed deployment according to claim 6, wherein the task running condition is that the task is executed successfully or unsuccessfully; the job running condition refers to that the job is executed successfully or unsuccessfully.

8. The task scheduling method based on distributed deployment of claim 1, wherein the health status of the scheduling host and the execution host monitored by the coordinating host comprises the CPU usage rate, JVM heap memory usage, JVM threads and GC time of the host where the instance is located.

9. The task scheduling method based on distributed deployment according to claim 1, wherein the health status of the host machine monitored by the coordinating host machine, the scheduling information of the scheduling host machine, and the job execution result of the scheduling host machine are uniformly transmitted to the big data platform, and the big data platform provides visual information display.

10. The task scheduling method based on distributed deployment according to claim 9, wherein the scheduling information of the scheduling host includes assignment and queuing of jobs, i.e. the number of jobs whose assignment has been completed and the number of jobs in queue are accumulated.