CN111209101B - Big data calculation task multi-dependency scheduling system - Google Patents

Big data calculation task multi-dependency scheduling system Download PDF

Info

Publication number
CN111209101B
CN111209101B CN202010008151.7A CN202010008151A CN111209101B CN 111209101 B CN111209101 B CN 111209101B CN 202010008151 A CN202010008151 A CN 202010008151A CN 111209101 B CN111209101 B CN 111209101B
Authority
CN
China
Prior art keywords
task
module
dependency
big data
actual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010008151.7A
Other languages
Chinese (zh)
Other versions
CN111209101A (en
Inventor
黄胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Coship Electronics Co Ltd
Original Assignee
Shenzhen Coship Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Coship Electronics Co Ltd filed Critical Shenzhen Coship Electronics Co Ltd
Priority to CN202010008151.7A priority Critical patent/CN111209101B/en
Publication of CN111209101A publication Critical patent/CN111209101A/en
Application granted granted Critical
Publication of CN111209101B publication Critical patent/CN111209101B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/48Indexing scheme relating to G06F9/48
    • G06F2209/485Resource constraint
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5021Priority
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a big data computing task multi-dependency dispatching system which comprises a user side, a Web visualization module, a task template generation module, an actual task generation module, a task dependency solving module, a dispatching optimization computing module, an actual task dispatching module and a big data computing platform. The invention can automatically realize the processing of complex task dependency relationship through simple configuration of task parameters, reasonably allocate cluster computing resources and effectively track the computing process and result. The invention greatly simplifies the dispatching management process of big data calculation tasks with complex dependency relationship, improves the utilization rate of cluster calculation resources, strengthens the state management of execution tasks, and reduces the use difficulty and the possibility of task errors.

Description

Big data calculation task multi-dependency scheduling system
Technical Field
The invention belongs to the technical field of computers, and relates to a big data computing task multi-dependency scheduling system.
Background
With the rapid development of big data technology, the storage and calculation of a huge amount of offline data are not difficult, the most mainstream solution is a Hadoop distributed system, and the core of the Hadoop distributed system is a distributed file system HDFS and a unified resource management and scheduling system yacn, and a Spark memory calculation engine is added. However, as the dependency relationship between the calculation process and different calculation processes becomes more complex, how to simplify the management of increasingly complex calculation dependency relationship, how to accurately grasp the state of the task scheduling process, and how to accurately and efficiently complete the task scheduling of big data calculation are several difficulties faced by the current big data scheduling system.
The large data computing task scheduling mode at the present stage is quite various, and has programs Crontab and class library Quartz which are biased to single machine execution, open source distributed scheduling systems Oozie and Azkaban, scheduling software based on open source encapsulation and the like of other companies. However, the following problems exist: 1. the method can not process or can only process simple sequential dependency relationship, and has the advantages that the method needs to put in large development cost for realizing complex dependency and needs to be packaged and modified, and the function is limited by the software; 2. task intensive scheduling tasks cannot reasonably allocate cluster resources, so that cluster computing resources are inclined, and congestion is generated during scheduling; 3. the operation mode is not friendly, the learning cost is high, and the development and scheduling efficiency is low; 4. the task-free resource management function is completely managed by the user. Therefore, there is a need to provide a big data computing task multi-dependency scheduling system.
Disclosure of Invention
In order to overcome the defects in the prior art, a big data computing task multi-dependent scheduling system is provided.
The invention is realized by the following scheme:
the system comprises a user side, a Web visualization module, a task template generation module, an actual task generation module, a task dependency solving module, a dispatching optimization calculation module, an actual task dispatching module and a big data calculation platform;
the Web visualization module is used for providing a task management Web interface which is simple and easy to understand, supporting task state management, supporting cluster resource management, and creating tasks, filling or modifying task parameters;
the task template generation module is used for verifying and storing the task parameters filled in or modified in the Web visualization module and generating a task template;
the actual task generating module is used for checking the task template according to the set execution time, generating an actual task by using the task template, organizing task parameters into an execution command which can be directly submitted to a computing cluster and storing the execution command;
the task dependency solving module is used for solving the actual task in a bi-directional dependency way;
the scheduling optimization calculation module is used for adjusting the execution sequence of tasks to be scheduled, which solves the dependency relationship;
the actual task scheduling module is used for receiving the execution sequence submitted by the scheduling optimization module, submitting the execution sequence to the big data computing platform for operation, judging that the actual task goes to the task dependency solving module or the scheduling optimization computing module according to the operation result of the big data computing platform, and returning the execution result to the Web visualization module.
The task template generation module supports setting task priority and adjusting task execution sequence, and supports setting tolerance delay time.
The delay time is used to evaluate the run-time-out warning and the post-run result evaluation.
The bi-directional dependency resolution includes a generated new task dependency resolution and a batch dependency resolution for each completed task.
The method has the beneficial effects that:
1. the big data calculation task multi-dependency scheduling system provides a quick big data calculation task submitting method, has low learning cost, adopts semi-automation to solve task dependency relationship, and ensures that a user only needs to consider current task dependency, thereby simplifying the process of dependency relationship establishment and reducing the possibility of error caused by complex dependency relationship;
2. the solution of task dependency relationship in the big data computing task multi-dependency scheduling system can process complex multi-dependency relationship and abstract the task into a template, thereby realizing template management of task resources.
3. According to the big data computing task multi-dependency scheduling system, tasks with different priorities and different urgency degrees set by a user can be processed through the scheduling optimization computing module, control of the user on the task execution sequence is enhanced, and meanwhile automatic resource balance is achieved, so that computing resources can be fully utilized and cannot be inclined.
4. According to the big data calculation task multi-dependence scheduling system, the process of solving the problem of dependence can be timely reminded through judging that the task exceeds the preset time before execution, the next execution time of the task is estimated to optimize the scheduling sequence through analyzing the historical data after the task is executed, and further optimization of the task execution sequence is achieved.
Drawings
FIG. 1 is a block flow diagram of a big data computing task multi-dependency scheduling system of the present invention;
Detailed Description
The invention is further illustrated below in connection with specific examples:
the system comprises a user side, a Web visualization module, a task template generation module, an actual task generation module, a task dependency solving module, a dispatching optimization calculation module, an actual task dispatching module and a big data calculation platform;
the Web visualization module is used for providing a task management Web interface which is simple and easy to understand, supporting task state management, supporting cluster resource management, and creating tasks, filling or modifying task parameters; the Web visualization module provides a simple and easy task management Web interface, supports task state management and cluster resource management.
The Web visualization module is a friendly operation interface provided by the scheduling system, can perform overall management on tasks, and provides functions including, but not limited to, new creation, modification, deletion and start-stop of tasks, checking and modification of real-time task states, inquiry and statistics of historical tasks, checking of cluster load states and the like.
The task template generation module is used for verifying and storing the task parameters filled in or modified in the Web visualization module and generating a task template; the task template generation module supports setting task priority and adjusting task execution sequence, and supports setting tolerance delay time. The delay time is used to evaluate the run-time-out warning and the post-run result evaluation. The task template generation module is matched with the Web visualization module, and the task parameters filled in or modified in the task template generation module are checked and stored to generate corresponding task templates for the subsequent actual task generation module.
The main content of the task template generation module comprises task basic information, a task execution file, a task execution period, task execution parameters and a task dependency template. The task basic information comprises basic information such as a task name, a creator, creation time and the like; the task execution file detects the integrity of the executable file and stores the executable file in the HDFS; the task execution period comprises task execution time, execution period, task priority and delay tolerance time; the task execution period supports setting task periods at various time intervals, supporting setting task priorities to adjust task execution order, setting tolerance delay time to provide run timeout warnings and post-run result evaluations.
The task execution parameters include additional parameters required by the task, supporting the pass-through variables. The task extra parameters support the delivery of variable parameters filled in by the scheduling system, such as task execution time, time interval, submitting user, etc., while also supporting custom variable in-coming.
The task dependency template is the dependent task name + time dependency expression mode. The method supports the task names of the selected dependencies from the configured tasks, supports the time dependency expressions to set the dependency relationships among the periodic tasks, and supports the self-dependent setting of the periodic tasks with different execution times.
The actual task generating module is used for checking the task template according to the set execution time, generating an actual task by using the task template, organizing task parameters into an execution command which can be directly submitted to a computing cluster and storing the execution command; the actual task generating module mainly plays a role of generating corresponding actual tasks by using the task template generating module according to the set execution time.
The specific flow of the actual task generating module is as follows:
(1) Checking the task template according to the actual time to obtain a task template conforming to the execution time;
(2) If the task execution parameters have variables, the task execution parameters are subjected to variable replacement, and if the task execution parameters have dependency relationships, all the dependency relationships of the task are read according to the actual time;
(3) The splicing task executes the actual command, and then other information, execution commands, dependency relationships and the like of the task are stored as the actual task;
(4) The actual task generating module executes according to the period, so that the template is rechecked after the completion of the process and waiting for the next checking time to form a cycle.
The task dependency solving module is used for solving the actual task in a bi-directional dependency way; the bi-directional dependency resolution includes a generated new task dependency resolution and a batch dependency resolution for each completed task. The task dependency solving module is mainly responsible for solving task dependency from two aspects, namely, from top to bottom, performing dependency solving on all the generated new tasks, and from bottom to top, performing batch dependency solving on each completed task. The task dependency solving module adopts a bidirectional dependency solving mode, so that the highest efficiency of completing the checking of the dependency relationship is ensured, and the dependency relationship can be easily solved even if the number of tasks is numerous and the dependency relationship is complex.
The task dependency solving module specifically comprises the following steps:
(1) Firstly, performing top-down dependency resolution, after an actual task is generated by an actual task generating module, a task dependency resolution module matches all dependency relations of the task in a completed task to confirm which dependencies are resolved;
(2) Judging whether the dependence is completely solved, if so, delivering the task to a dispatching optimization calculation module, otherwise, waiting and then solving the dependence;
(3) The bottom-up dependency resolution is to perform batch dependency resolution on all tasks which are not resolved by the dependency after the task is successfully executed, all tasks which depend on the successful task can process the dependency, and then perform the operation in (2) on the tasks.
The scheduling optimization calculation module is used for adjusting the execution sequence of tasks to be scheduled, which solves the dependency relationship; the scheduling optimization calculation module selects a proper number of tasks and execution sequences according to comprehensive consideration of task priority, time urgency degree, historical task execution time, task number in scheduling, task resource estimated occupation, cluster residual resources and the like, and transmits the tasks and the execution sequences to the actual task scheduling module for scheduling according to the execution sequences.
The specific flow of the dispatching optimization calculation module is as follows:
(1) Aiming at all tasks to be scheduled which solve the dependency relationship, firstly calculating the urgency degree of the tasks, and then carrying out priority sequencing and urgency degree secondary sequencing on the tasks to obtain a basic execution sequence;
(2) After the cluster residual computing resources are obtained, computing resource balance is carried out, tasks with insufficient resources are skipped, and the tasks are submitted to an actual task scheduling module according to the task sequence until the residual resources are smaller than minimum task submitting parameters;
(3) And periodically checking cluster resources after the task is submitted, judging whether the task can be submitted continuously, and if the task can be submitted continuously, re-pulling the task to be scheduled for sorting, and entering a circulation state.
The actual task scheduling module is used for receiving the execution sequence submitted by the scheduling optimization module, submitting the execution sequence to the big data computing platform for operation, judging that the actual task goes to the task dependency solving module or the scheduling optimization computing module according to the operation result of the big data computing platform, and returning the execution result to the Web visualization module. The actual task scheduling module is responsible for task scheduling operation, monitoring task execution state, managing task execution log, retrying failed task and the like.
The actual task scheduling module specifically comprises the following steps:
(1) The actual task scheduling module receives the task submitted from the scheduling optimization computing module and then submits the task to the big data computing platform for operation;
(2) Judging the running result, if the running result is successful, going to a task dependence solving module to carry out batch dependence solving, if the running result is failed, judging whether to enter a retry flow according to whether to retry;
(3) If the task needs to be retried, the state of the failed task is modified according to the retry interval timing, and the failed task is reset to be in a queue to be scheduled of the scheduling optimization calculation module.
The specific working process of the invention comprises the following steps:
firstly, creating a task through a Web visualization module, filling in task parameters, judging whether to directly execute according to relevant regulations of a task template generation module, and if so, delivering to a scheduling optimization calculation module; if not, the task template generation module is submitted.
And secondly, generating a task template. And after filling the task parameters, entering a task template generation module. If abnormality is detected in the task template generating process, informing of failure reasons, and delivering the failure reasons to a Web visualization module for modifying tasks; if no abnormality is detected in the task template generating process, an actual task generating module is entered.
And thirdly, an actual task generating module. If abnormality is detected in the actual task generating process, informing of failure reasons, and delivering the failure reasons to the Web visualization module for modifying tasks; if no abnormality is detected in the actual task generating process, a task dependency solving module is entered.
And fourthly, a task dependence solving module. If the delay tolerance time is found to be reached in the dependence solving process, the Web visualization module is informed that the task is overtime, but the task still continues to solve the dependence until the dependence is solved, and then the next step is carried out.
And fifthly, dispatching the optimization calculation module. And receiving the tasks submitted by the direct execution and dependency resolution module, optimizing the scheduling sequence, and then entering the next step.
And step six, an actual task scheduling module. And actually submitting the task to a big data computing platform for execution, and returning the execution result to the Web visualization module.
And seventh, ending.
The invention relates to a big data calculation task multi-dependency scheduling system, which is used for establishing a process of a dependency relationship between templates by abstracting a task into templates and then setting a time dependency expression; the task dependency solving mode is a two-way processing mode, wherein when a task is newly established, the task dependency is scanned and solved, and when a task is completed, all the dependency depending on the task is solved; semi-automatic dependency relationship solving processing, and automatically realizing dependency relationship processing, so that a user only needs to pay attention to the technology of the dependency of the current task; a scheduling optimization process for dynamically adjusting the task execution sequence according to the priority, the task urgency degree, the cluster residual resources, the historical task execution statistics and the like; and respectively realizing task dependency warning and task execution effect evaluation on the overdue time before and after the task execution.
While the invention has been described and illustrated in considerable detail, it should be understood that modifications and equivalents to the above-described embodiments will become apparent to those skilled in the art, and that such modifications and improvements may be made without departing from the spirit of the invention.

Claims (4)

1. A big data calculation task multi-dependency scheduling system is characterized in that: the dispatching system comprises a user side, a Web visualization module, a task template generation module, an actual task generation module, a task dependency solving module, a dispatching optimization calculation module, an actual task dispatching module and a big data calculation platform;
the Web visualization module is used for providing a task management Web interface which is simple and easy to understand, supporting task state management, supporting cluster resource management, and creating tasks, filling or modifying task parameters;
the task template generation module is used for verifying and storing the task parameters filled in or modified in the Web visualization module and generating a task template;
the actual task generating module is used for checking the task template according to the set execution time, generating an actual task by using the task template, organizing task parameters into an execution command which can be directly submitted to a computing cluster and storing the execution command;
the task dependency solving module is used for solving the actual task in a bi-directional dependency way;
the scheduling optimization calculation module is used for adjusting the execution sequence of tasks to be scheduled, which solves the dependency relationship;
the actual task scheduling module is used for receiving the execution sequence submitted by the scheduling optimization module, submitting the execution sequence to the big data computing platform for operation, judging that the actual task goes to the task dependency solving module or the scheduling optimization computing module according to the operation result of the big data computing platform, and returning the execution result to the Web visualization module.
2. The big data computing task multi-dependency scheduling system of claim 1, wherein: the task template generation module supports setting task priority and adjusting task execution sequence, and supports setting tolerance delay time.
3. A big data computing task multi-dependency scheduling system in accordance with claim 2, wherein: the delay time is used to evaluate the run-time-out warning and the post-run result evaluation.
4. The big data computing task multi-dependency scheduling system of claim 1, wherein: the bi-directional dependency resolution includes a generated new task dependency resolution and a batch dependency resolution for each completed task.
CN202010008151.7A 2020-01-06 2020-01-06 Big data calculation task multi-dependency scheduling system Active CN111209101B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010008151.7A CN111209101B (en) 2020-01-06 2020-01-06 Big data calculation task multi-dependency scheduling system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010008151.7A CN111209101B (en) 2020-01-06 2020-01-06 Big data calculation task multi-dependency scheduling system

Publications (2)

Publication Number Publication Date
CN111209101A CN111209101A (en) 2020-05-29
CN111209101B true CN111209101B (en) 2023-05-02

Family

ID=70789541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010008151.7A Active CN111209101B (en) 2020-01-06 2020-01-06 Big data calculation task multi-dependency scheduling system

Country Status (1)

Country Link
CN (1) CN111209101B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559486A (en) * 2020-11-11 2021-03-26 国网江苏省电力有限公司信息通信分公司 Data center unified task scheduling management system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109375996A (en) * 2018-09-27 2019-02-22 安徽省鼎众金融信息咨询服务有限公司 A kind of support dependence managerial role scheduling system
CN109669767A (en) * 2018-11-30 2019-04-23 河海大学 A kind of task encapsulation and dispatching method and system towards polymorphic type Context-dependent
CN109684053A (en) * 2018-11-05 2019-04-26 广东岭南通股份有限公司 The method for scheduling task and system of big data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109375996A (en) * 2018-09-27 2019-02-22 安徽省鼎众金融信息咨询服务有限公司 A kind of support dependence managerial role scheduling system
CN109684053A (en) * 2018-11-05 2019-04-26 广东岭南通股份有限公司 The method for scheduling task and system of big data
CN109669767A (en) * 2018-11-30 2019-04-23 河海大学 A kind of task encapsulation and dispatching method and system towards polymorphic type Context-dependent

Also Published As

Publication number Publication date
CN111209101A (en) 2020-05-29

Similar Documents

Publication Publication Date Title
CN109992407B (en) YARN cluster GPU resource scheduling method, device and medium
CN111694888A (en) Distributed ETL data exchange system and method based on micro-service architecture
CN111125444A (en) Big data task scheduling management method, device, equipment and storage medium
CN106055670A (en) Inter-system data migration method and device
CN101707399A (en) Method and system for acquiring electric energy information
CN113157710B (en) Block chain data parallel writing method and device, computer equipment and storage medium
CN112445598B (en) Task scheduling method and device based on quartz, electronic equipment and medium
CN113157411B (en) Celery-based reliable configurable task system and device
CN111930489B (en) Task scheduling method, device, equipment and storage medium
CN112685153A (en) Micro-service scheduling method and device and electronic equipment
CN112579267A (en) Decentralized big data job flow scheduling method and device
CN110611707A (en) Task scheduling method and device
CN114035925A (en) Workflow scheduling method, device and equipment and readable storage medium
WO2024032781A1 (en) Algorithm testing method and apparatus, and storage medium
CN112862098A (en) Method and system for processing cluster training task
CN111209101B (en) Big data calculation task multi-dependency scheduling system
CN114781648A (en) Automatic arranging, constructing and executing method and system for machine learning task
CN111176831A (en) Dynamic thread mapping optimization method and device based on multithread shared memory communication
CN113658351A (en) Product production method and device, electronic equipment and storage medium
CN117290103A (en) Task scheduling realization method supporting multiple processes and multiple threads
CN116974994A (en) High-efficiency file collaboration system based on clusters
WO2022253165A1 (en) Scheduling method, system, server and computer readable storage medium
CN116302423A (en) Distributed task scheduling method and system for cloud management platform
CN112256418B (en) Big data task scheduling method
US20230161620A1 (en) Pull mode and push mode combined resource management and job scheduling method and system, and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant