CN111209101B

CN111209101B - Big data calculation task multi-dependency scheduling system

Info

Publication number: CN111209101B
Application number: CN202010008151.7A
Authority: CN
Inventors: 黄胜
Original assignee: Shenzhen Coship Electronics Co Ltd
Current assignee: Shenzhen Coship Electronics Co Ltd
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2023-05-02
Anticipated expiration: 2040-01-06
Also published as: CN111209101A

Abstract

The invention discloses a big data computing task multi-dependency dispatching system which comprises a user side, a Web visualization module, a task template generation module, an actual task generation module, a task dependency solving module, a dispatching optimization computing module, an actual task dispatching module and a big data computing platform. The invention can automatically realize the processing of complex task dependency relationship through simple configuration of task parameters, reasonably allocate cluster computing resources and effectively track the computing process and result. The invention greatly simplifies the dispatching management process of big data calculation tasks with complex dependency relationship, improves the utilization rate of cluster calculation resources, strengthens the state management of execution tasks, and reduces the use difficulty and the possibility of task errors.

Description

Big data calculation task multi-dependency scheduling system

Technical Field

The invention belongs to the technical field of computers, and relates to a big data computing task multi-dependency scheduling system.

Background

With the rapid development of big data technology, the storage and calculation of a huge amount of offline data are not difficult, the most mainstream solution is a Hadoop distributed system, and the core of the Hadoop distributed system is a distributed file system HDFS and a unified resource management and scheduling system yacn, and a Spark memory calculation engine is added. However, as the dependency relationship between the calculation process and different calculation processes becomes more complex, how to simplify the management of increasingly complex calculation dependency relationship, how to accurately grasp the state of the task scheduling process, and how to accurately and efficiently complete the task scheduling of big data calculation are several difficulties faced by the current big data scheduling system.

The large data computing task scheduling mode at the present stage is quite various, and has programs Crontab and class library Quartz which are biased to single machine execution, open source distributed scheduling systems Oozie and Azkaban, scheduling software based on open source encapsulation and the like of other companies. However, the following problems exist: 1. the method can not process or can only process simple sequential dependency relationship, and has the advantages that the method needs to put in large development cost for realizing complex dependency and needs to be packaged and modified, and the function is limited by the software; 2. task intensive scheduling tasks cannot reasonably allocate cluster resources, so that cluster computing resources are inclined, and congestion is generated during scheduling; 3. the operation mode is not friendly, the learning cost is high, and the development and scheduling efficiency is low; 4. the task-free resource management function is completely managed by the user. Therefore, there is a need to provide a big data computing task multi-dependency scheduling system.

Disclosure of Invention

In order to overcome the defects in the prior art, a big data computing task multi-dependent scheduling system is provided.

The invention is realized by the following scheme:

the system comprises a user side, a Web visualization module, a task template generation module, an actual task generation module, a task dependency solving module, a dispatching optimization calculation module, an actual task dispatching module and a big data calculation platform;

the Web visualization module is used for providing a task management Web interface which is simple and easy to understand, supporting task state management, supporting cluster resource management, and creating tasks, filling or modifying task parameters;

the task template generation module is used for verifying and storing the task parameters filled in or modified in the Web visualization module and generating a task template;

the actual task generating module is used for checking the task template according to the set execution time, generating an actual task by using the task template, organizing task parameters into an execution command which can be directly submitted to a computing cluster and storing the execution command;

the task dependency solving module is used for solving the actual task in a bi-directional dependency way;

the scheduling optimization calculation module is used for adjusting the execution sequence of tasks to be scheduled, which solves the dependency relationship;

the actual task scheduling module is used for receiving the execution sequence submitted by the scheduling optimization module, submitting the execution sequence to the big data computing platform for operation, judging that the actual task goes to the task dependency solving module or the scheduling optimization computing module according to the operation result of the big data computing platform, and returning the execution result to the Web visualization module.

The task template generation module supports setting task priority and adjusting task execution sequence, and supports setting tolerance delay time.

The delay time is used to evaluate the run-time-out warning and the post-run result evaluation.

The bi-directional dependency resolution includes a generated new task dependency resolution and a batch dependency resolution for each completed task.

The method has the beneficial effects that:

1. the big data calculation task multi-dependency scheduling system provides a quick big data calculation task submitting method, has low learning cost, adopts semi-automation to solve task dependency relationship, and ensures that a user only needs to consider current task dependency, thereby simplifying the process of dependency relationship establishment and reducing the possibility of error caused by complex dependency relationship;

2. the solution of task dependency relationship in the big data computing task multi-dependency scheduling system can process complex multi-dependency relationship and abstract the task into a template, thereby realizing template management of task resources.

3. According to the big data computing task multi-dependency scheduling system, tasks with different priorities and different urgency degrees set by a user can be processed through the scheduling optimization computing module, control of the user on the task execution sequence is enhanced, and meanwhile automatic resource balance is achieved, so that computing resources can be fully utilized and cannot be inclined.

4. According to the big data calculation task multi-dependence scheduling system, the process of solving the problem of dependence can be timely reminded through judging that the task exceeds the preset time before execution, the next execution time of the task is estimated to optimize the scheduling sequence through analyzing the historical data after the task is executed, and further optimization of the task execution sequence is achieved.

Drawings

FIG. 1 is a block flow diagram of a big data computing task multi-dependency scheduling system of the present invention;

Detailed Description

The invention is further illustrated below in connection with specific examples:

the Web visualization module is used for providing a task management Web interface which is simple and easy to understand, supporting task state management, supporting cluster resource management, and creating tasks, filling or modifying task parameters; the Web visualization module provides a simple and easy task management Web interface, supports task state management and cluster resource management.

The Web visualization module is a friendly operation interface provided by the scheduling system, can perform overall management on tasks, and provides functions including, but not limited to, new creation, modification, deletion and start-stop of tasks, checking and modification of real-time task states, inquiry and statistics of historical tasks, checking of cluster load states and the like.

The task template generation module is used for verifying and storing the task parameters filled in or modified in the Web visualization module and generating a task template; the task template generation module supports setting task priority and adjusting task execution sequence, and supports setting tolerance delay time. The delay time is used to evaluate the run-time-out warning and the post-run result evaluation. The task template generation module is matched with the Web visualization module, and the task parameters filled in or modified in the task template generation module are checked and stored to generate corresponding task templates for the subsequent actual task generation module.

The main content of the task template generation module comprises task basic information, a task execution file, a task execution period, task execution parameters and a task dependency template. The task basic information comprises basic information such as a task name, a creator, creation time and the like; the task execution file detects the integrity of the executable file and stores the executable file in the HDFS; the task execution period comprises task execution time, execution period, task priority and delay tolerance time; the task execution period supports setting task periods at various time intervals, supporting setting task priorities to adjust task execution order, setting tolerance delay time to provide run timeout warnings and post-run result evaluations.

The task execution parameters include additional parameters required by the task, supporting the pass-through variables. The task extra parameters support the delivery of variable parameters filled in by the scheduling system, such as task execution time, time interval, submitting user, etc., while also supporting custom variable in-coming.

The task dependency template is the dependent task name + time dependency expression mode. The method supports the task names of the selected dependencies from the configured tasks, supports the time dependency expressions to set the dependency relationships among the periodic tasks, and supports the self-dependent setting of the periodic tasks with different execution times.

The actual task generating module is used for checking the task template according to the set execution time, generating an actual task by using the task template, organizing task parameters into an execution command which can be directly submitted to a computing cluster and storing the execution command; the actual task generating module mainly plays a role of generating corresponding actual tasks by using the task template generating module according to the set execution time.

The specific flow of the actual task generating module is as follows:

(1) Checking the task template according to the actual time to obtain a task template conforming to the execution time;

(2) If the task execution parameters have variables, the task execution parameters are subjected to variable replacement, and if the task execution parameters have dependency relationships, all the dependency relationships of the task are read according to the actual time;

(3) The splicing task executes the actual command, and then other information, execution commands, dependency relationships and the like of the task are stored as the actual task;

(4) The actual task generating module executes according to the period, so that the template is rechecked after the completion of the process and waiting for the next checking time to form a cycle.

The task dependency solving module is used for solving the actual task in a bi-directional dependency way; the bi-directional dependency resolution includes a generated new task dependency resolution and a batch dependency resolution for each completed task. The task dependency solving module is mainly responsible for solving task dependency from two aspects, namely, from top to bottom, performing dependency solving on all the generated new tasks, and from bottom to top, performing batch dependency solving on each completed task. The task dependency solving module adopts a bidirectional dependency solving mode, so that the highest efficiency of completing the checking of the dependency relationship is ensured, and the dependency relationship can be easily solved even if the number of tasks is numerous and the dependency relationship is complex.

The task dependency solving module specifically comprises the following steps:

(1) Firstly, performing top-down dependency resolution, after an actual task is generated by an actual task generating module, a task dependency resolution module matches all dependency relations of the task in a completed task to confirm which dependencies are resolved;

(2) Judging whether the dependence is completely solved, if so, delivering the task to a dispatching optimization calculation module, otherwise, waiting and then solving the dependence;

(3) The bottom-up dependency resolution is to perform batch dependency resolution on all tasks which are not resolved by the dependency after the task is successfully executed, all tasks which depend on the successful task can process the dependency, and then perform the operation in (2) on the tasks.

The scheduling optimization calculation module is used for adjusting the execution sequence of tasks to be scheduled, which solves the dependency relationship; the scheduling optimization calculation module selects a proper number of tasks and execution sequences according to comprehensive consideration of task priority, time urgency degree, historical task execution time, task number in scheduling, task resource estimated occupation, cluster residual resources and the like, and transmits the tasks and the execution sequences to the actual task scheduling module for scheduling according to the execution sequences.

The specific flow of the dispatching optimization calculation module is as follows:

(1) Aiming at all tasks to be scheduled which solve the dependency relationship, firstly calculating the urgency degree of the tasks, and then carrying out priority sequencing and urgency degree secondary sequencing on the tasks to obtain a basic execution sequence;

(2) After the cluster residual computing resources are obtained, computing resource balance is carried out, tasks with insufficient resources are skipped, and the tasks are submitted to an actual task scheduling module according to the task sequence until the residual resources are smaller than minimum task submitting parameters;

(3) And periodically checking cluster resources after the task is submitted, judging whether the task can be submitted continuously, and if the task can be submitted continuously, re-pulling the task to be scheduled for sorting, and entering a circulation state.

The actual task scheduling module is used for receiving the execution sequence submitted by the scheduling optimization module, submitting the execution sequence to the big data computing platform for operation, judging that the actual task goes to the task dependency solving module or the scheduling optimization computing module according to the operation result of the big data computing platform, and returning the execution result to the Web visualization module. The actual task scheduling module is responsible for task scheduling operation, monitoring task execution state, managing task execution log, retrying failed task and the like.

The actual task scheduling module specifically comprises the following steps:

(1) The actual task scheduling module receives the task submitted from the scheduling optimization computing module and then submits the task to the big data computing platform for operation;

(2) Judging the running result, if the running result is successful, going to a task dependence solving module to carry out batch dependence solving, if the running result is failed, judging whether to enter a retry flow according to whether to retry;

(3) If the task needs to be retried, the state of the failed task is modified according to the retry interval timing, and the failed task is reset to be in a queue to be scheduled of the scheduling optimization calculation module.

The specific working process of the invention comprises the following steps:

firstly, creating a task through a Web visualization module, filling in task parameters, judging whether to directly execute according to relevant regulations of a task template generation module, and if so, delivering to a scheduling optimization calculation module; if not, the task template generation module is submitted.

And secondly, generating a task template. And after filling the task parameters, entering a task template generation module. If abnormality is detected in the task template generating process, informing of failure reasons, and delivering the failure reasons to a Web visualization module for modifying tasks; if no abnormality is detected in the task template generating process, an actual task generating module is entered.

And thirdly, an actual task generating module. If abnormality is detected in the actual task generating process, informing of failure reasons, and delivering the failure reasons to the Web visualization module for modifying tasks; if no abnormality is detected in the actual task generating process, a task dependency solving module is entered.

And fourthly, a task dependence solving module. If the delay tolerance time is found to be reached in the dependence solving process, the Web visualization module is informed that the task is overtime, but the task still continues to solve the dependence until the dependence is solved, and then the next step is carried out.

And fifthly, dispatching the optimization calculation module. And receiving the tasks submitted by the direct execution and dependency resolution module, optimizing the scheduling sequence, and then entering the next step.

And step six, an actual task scheduling module. And actually submitting the task to a big data computing platform for execution, and returning the execution result to the Web visualization module.

And seventh, ending.

The invention relates to a big data calculation task multi-dependency scheduling system, which is used for establishing a process of a dependency relationship between templates by abstracting a task into templates and then setting a time dependency expression; the task dependency solving mode is a two-way processing mode, wherein when a task is newly established, the task dependency is scanned and solved, and when a task is completed, all the dependency depending on the task is solved; semi-automatic dependency relationship solving processing, and automatically realizing dependency relationship processing, so that a user only needs to pay attention to the technology of the dependency of the current task; a scheduling optimization process for dynamically adjusting the task execution sequence according to the priority, the task urgency degree, the cluster residual resources, the historical task execution statistics and the like; and respectively realizing task dependency warning and task execution effect evaluation on the overdue time before and after the task execution.

While the invention has been described and illustrated in considerable detail, it should be understood that modifications and equivalents to the above-described embodiments will become apparent to those skilled in the art, and that such modifications and improvements may be made without departing from the spirit of the invention.

Claims

1. A big data calculation task multi-dependency scheduling system is characterized in that: the dispatching system comprises a user side, a Web visualization module, a task template generation module, an actual task generation module, a task dependency solving module, a dispatching optimization calculation module, an actual task dispatching module and a big data calculation platform;

2. The big data computing task multi-dependency scheduling system of claim 1, wherein: the task template generation module supports setting task priority and adjusting task execution sequence, and supports setting tolerance delay time.

3. A big data computing task multi-dependency scheduling system in accordance with claim 2, wherein: the delay time is used to evaluate the run-time-out warning and the post-run result evaluation.

4. The big data computing task multi-dependency scheduling system of claim 1, wherein: the bi-directional dependency resolution includes a generated new task dependency resolution and a batch dependency resolution for each completed task.