CN117742907A - Scheduling method and device for data processing task - Google Patents

Scheduling method and device for data processing task Download PDF

Info

Publication number
CN117742907A
CN117742907A CN202311613202.9A CN202311613202A CN117742907A CN 117742907 A CN117742907 A CN 117742907A CN 202311613202 A CN202311613202 A CN 202311613202A CN 117742907 A CN117742907 A CN 117742907A
Authority
CN
China
Prior art keywords
task
data processing
job
components
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311613202.9A
Other languages
Chinese (zh)
Inventor
郑卓源
刘佳
甘俊杰
叶惠明
张超武
谢时焘
王立
张国彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Guangfa Bank Co Ltd
Original Assignee
China Guangfa Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Guangfa Bank Co Ltd filed Critical China Guangfa Bank Co Ltd
Priority to CN202311613202.9A priority Critical patent/CN117742907A/en
Publication of CN117742907A publication Critical patent/CN117742907A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a scheduling method and device for data processing tasks. Wherein the method comprises the following steps: acquiring job information of a plurality of data processing jobs, wherein the job information comprises a plurality of pipelines corresponding to each data processing job and directed acyclic graph information of a plurality of task components in each pipeline; for each data processing job, sequentially transmitting a plurality of corresponding task components to a corresponding task queue according to the corresponding job information; for each task queue, determining the priority order of each task component in the task queue, determining the target concurrency quantity of task execution nodes corresponding to the task queues, and parallelly transmitting a plurality of task components in the task queues to a database service cluster for data processing according to the priority order and the target concurrency quantity through the task execution nodes. The method and the device solve the technical problems that a high-efficiency scheduling scheme is lacking for a large number of data processing tasks in the related technology, and the overall processing efficiency is low.

Description

Scheduling method and device for data processing task
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for scheduling a data processing task.
Background
With the gradual enhancement of the componentization, flow and visual development capability of a big data platform, the data development sub-application is gradually promoted and used in each research and development center team, and with the continuous deep of big data application scenes, the number of data processing pipelines required to be executed every day is drastically increased, wherein the number of the data processing components is more countless. For a large number of data processing tasks, a stable, efficient and throughput-large task scheduling scheme is needed.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the application provides a scheduling method and device for data processing tasks, which at least solve the technical problems that a high-efficiency scheduling scheme is lacking for a large number of data processing tasks in the related technology, and the overall processing efficiency is low.
According to an aspect of the embodiments of the present application, there is provided a scheduling method of a data processing task, including: acquiring job information of a plurality of data processing jobs, wherein each data processing job corresponds to a plurality of pipelines, each pipeline comprises a plurality of task components, and the job information comprises directed acyclic graph information between the pipeline and the task components; for each data processing job, sequentially sending a plurality of task components corresponding to the data processing job to a task queue corresponding to the data processing job according to job information corresponding to the data processing job; for each task queue, determining the priority order of each task component in the task queue, determining the target concurrency quantity of task execution nodes corresponding to the task queues, and parallelly transmitting a plurality of task components in the task queues to a database service cluster for data processing according to the priority order and the target concurrency quantity through the task execution nodes.
Optionally, acquiring job information of a plurality of data processing jobs includes: and polling the first instance node in a load balancing mode, and respectively acquiring the job information of a plurality of data processing jobs in the first instance node through a plurality of independent task distribution nodes.
Optionally, the job information further includes tenant information corresponding to the data processing job, and the method sequentially sends the task components corresponding to the data processing job to the task queues corresponding to the data processing job according to the job information corresponding to the data processing job includes: for each data processing job, determining the tenant type according to tenant information corresponding to the data processing job; when the tenant type is an independent tenant, sequentially sending a plurality of task components corresponding to the data processing job to an exclusive task queue corresponding to the data processing job; and when the tenant type is a non-independent tenant, sequentially sending the plurality of task components corresponding to the data processing job to a public task queue.
Optionally, sequentially sending the task components corresponding to the data processing job to the task queues corresponding to the data processing job according to the job information corresponding to the data processing job, including: for each data processing job, determining an upstream-downstream relationship between a plurality of pipelines corresponding to the data processing job and an upstream-downstream relationship between a plurality of task components in each pipeline according to directed acyclic graph information of the data processing job; for each pipeline, only when all task components in an upstream pipeline of the pipeline finish data processing, starting task queue distribution of the task components in the pipeline; for each task component, task queue distribution to the task component is resumed only when the task component upstream of the task component completes data processing.
Optionally, determining the priority order of the task components in the task queue includes: the method comprises the steps of obtaining priority information of each task component in a task queue, wherein the priority information comprises priority levels and priority digital identifications, and the priority levels comprise: high, medium and low, when the priority levels are the same, the higher the priority number identification is, the higher the priority is, and the priority number identification is increased along with the increase of the waiting time of the task component in the task queue; determining the priority order of each task component in the task queue according to the priority information; or, in response to the priority order adjustment instruction of the target object, determining the adjusted priority order of each task component in the task queue.
Optionally, determining the target concurrency number of the task execution nodes corresponding to the task queues includes: determining the maximum overall concurrency quantity of a plurality of task execution nodes according to the calculation resource quantity for data processing in the database service cluster; determining the current actual overall concurrency quantity of a plurality of task execution nodes; for the task execution node corresponding to the task queue, determining the maximum concurrency quantity of the task execution node according to the schedulable resource quantity of the task execution node, and determining the target concurrency quantity of the task execution node according to the maximum integral concurrency quantity and the actual integral concurrency quantity of a plurality of task execution nodes, wherein the target concurrency quantity does not exceed the maximum concurrency quantity.
Optionally, after the task execution node sends the task components in the task queue to the database service cluster in parallel according to the priority order and the target concurrency number to process data, the task execution node receives a plurality of data processing results fed back by the database service cluster, and classifies and stores each data processing result to the target database according to the data processing job corresponding to the data processing result.
Optionally, when the first instance node fails, determining a second instance node associated with the first instance node, and continuing to poll the second instance node, wherein incomplete data processing jobs in the first instance node are all transferred to the second instance node.
According to another aspect of the embodiments of the present application, there is also provided a scheduling apparatus for data processing tasks, including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring job information of a plurality of data processing jobs, each data processing job corresponds to a plurality of pipelines, each pipeline comprises a plurality of task components, and the job information comprises directed acyclic graph information between the pipeline and the task components; the distribution module is used for sequentially sending a plurality of task components corresponding to the data processing operation to a task queue corresponding to the data processing operation according to the operation information corresponding to the data processing operation for each data processing operation; and the execution module is used for determining the priority order of each task component in the task queue for each task queue, determining the target concurrency quantity of the task execution node corresponding to the task queue, and parallelly transmitting a plurality of task components in the task queue to the database service cluster for data processing according to the priority order and the target concurrency quantity through the task execution node.
According to another aspect of the embodiments of the present application, there is further provided a nonvolatile storage medium, where the nonvolatile storage medium includes a stored computer program, and a device where the nonvolatile storage medium is located executes the scheduling method of the data processing task by running the computer program.
In the embodiment of the application, job information of a plurality of data processing jobs is firstly obtained, wherein the job information comprises a plurality of pipelines corresponding to each data processing job and directed acyclic graph information of a plurality of task components in each pipeline; then, for each data processing job, sequentially sending a plurality of corresponding task components to a corresponding task queue according to the corresponding job information; for each task queue, determining the priority order of each task component in the task queue, determining the target concurrency quantity of task execution nodes corresponding to the task queues, and parallelly transmitting a plurality of task components in the task queues to a database service cluster for data processing according to the priority order and the target concurrency quantity through the task execution nodes. After the data processing operation is refined to the granularity of the components, the corresponding components are sent to the queue according to the relation between the assembly line and the upstream and downstream of the components, so that the integrity of the task processing result can be ensured; according to the components in the priority consumption queue and based on the global resource use condition, the queue concurrency number is dynamically adjusted, so that the task scheduling efficiency can be greatly improved, namely, the technical problem that a high-efficiency scheduling scheme is lack for a large number of data processing tasks in the related technology and the overall processing efficiency is low is effectively solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a schematic diagram of an alternative data processing task scheduling system according to an embodiment of the present application;
FIG. 2 is a flow chart of an alternative data processing task scheduling method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an alternative pipeline creation interface according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a directed acyclic graph of an alternative data processing job according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an alternative data processing task scheduling device according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and the accompanying drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
According to an embodiment of the present application, an optional data processing task scheduling system is provided first, as shown in fig. 1, where the system mainly includes: a plurality of task distribution nodes 11 (1 to n), a plurality of task queues 12 (1 to m) and a plurality of task execution nodes 13 (1 to m). All task distribution nodes and task execution nodes are based on a micro-service architecture, and the multi-instance deployment is accessed to a registration center; the task queues may be divided into dedicated task queues for independent tenants and public task queues for non-independent tenants, and typically include MySQL clusters, redis clusters, hive clusters, hbase clusters, spark calculation engines, and the like in database service clusters.
In the system, a task distribution node acquires a data processing task from an instance node at a user side, and sends the task to a corresponding task queue in the form of a task component, and a task execution node consumes the task based on the priority of the task component in the task queue, so that a plurality of task components can be simultaneously sent to a database service cluster in parallel for data processing.
On the basis of the above-described data processing task scheduling system, the embodiments of the present application provide a scheduling method of data processing tasks, it should be noted that the steps illustrated in the flowcharts of the drawings may be performed in a computer system such as a set of computer executable instructions, and although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different from that herein.
Fig. 2 is a flow chart of an alternative method for scheduling data processing tasks according to an embodiment of the present application, as shown in fig. 2, the method includes steps S202-S206, where:
step S202, job information of a plurality of data processing jobs is obtained, wherein each data processing job corresponds to a plurality of pipelines, each pipeline comprises a plurality of task components, and the job information comprises directed acyclic graph information between the pipeline and the task components.
In general, when a user creates a data manipulation job at an instance node, the following may be used: the method comprises the steps of selecting a required task component from a preset task component library, dragging the task component into a canvas, wherein the task component can be divided into a management component, an access component, a processing component and a issuing component according to processing logic, and can support functions of file inspection, data uploading and downloading, data processing, data checking, encryption and decryption, code value conversion, cleaning standardization and the like; finishing processing logic definition by configuring SQL sentences, setting metadata, custom parameters and the like for task components; setting the execution sequence among the task components through the connecting lines to finish the creation of the assembly line; after a plurality of pipelines are established, the execution sequence among the pipelines is set through connecting lines, the creation of the data processing operation is completed, and a complete directed acyclic graph of the data processing operation is obtained.
Fig. 3 shows a schematic diagram of an interface for creating a pipeline in a canvas, and fig. 4 shows a directed acyclic graph for creating a completed data processing job, wherein only the directed acyclic graph of each task component in pipeline 1 is shown in detail as an example, and the component relationships in other pipelines have been determined, but not shown in detail.
In the task scheduling process, job information of a plurality of data processing jobs may be acquired by: and polling the first instance node in a load balancing mode, and respectively acquiring the job information of a plurality of data processing jobs in the first instance node through a plurality of independent task distribution nodes. Each task distribution node is responsible for distributing a data processing job and can monitor the processing state of the data processing job.
Optionally, when the first instance node fails, a second instance node associated with the first instance node may be determined, and polling of the second instance node by means of load balancing may be continued, where incomplete data processing jobs in the first instance node may be transferred to the second instance node, so as to ensure that tasks are not interrupted.
Step S204, for each data processing job, sequentially sending the task components corresponding to the data processing job to the task queues corresponding to the data processing job according to the job information corresponding to the data processing job.
As an optional implementation manner, the acquired job information further includes tenant information corresponding to the data processing job, and when each task distribution node distributes the data processing job acquired by itself, the tenant type can be determined according to the tenant information corresponding to the data processing job; when the tenant type is an independent tenant, sequentially sending a plurality of task components corresponding to the data processing job to an exclusive task queue corresponding to the data processing job; and when the tenant type is a non-independent tenant, sequentially sending the plurality of task components corresponding to the data processing job to a public task queue.
The dedicated task queue only receives task components of data processing jobs corresponding to independent tenants, and the common task queue can receive task components of data processing jobs of a plurality of different non-independent tenants.
Optionally, in order to ensure that the data processing operation is orderly and completely processed, when each task distribution node distributes the data processing operation acquired by itself, an upstream-downstream relationship between a plurality of pipelines corresponding to the data processing operation and an upstream-downstream relationship between a plurality of task components in each pipeline can be determined according to directed acyclic graph information of the data processing operation; and after the upstream task component finishes data processing, automatically triggering the downstream task component to enter the running state.
Specifically, for each pipeline, only when all task components in an upstream pipeline of the pipeline finish data processing, task queue distribution is started to the task components in the pipeline; for each task component, task queue distribution to the task component is resumed only when the task component upstream of the task component completes data processing.
Step S206, for each task queue, determining the priority order of each task component in the task queue, determining the target concurrency quantity of the task execution node corresponding to the task queue, and transmitting the task components in the task queue to the database service cluster in parallel for data processing according to the priority order and the target concurrency quantity through the task execution node.
In general, in the pipeline creation process, priorities may be preset for task components therein, and the priority may be set by adopting a mode of adding a priority level and a priority number identifier together as priority information, where the priority level may include: high, medium, low, the greater the priority number identification, the higher the priority when the priority levels are the same. For example, the priority information of the 3 task components is high-01, high-99, and low-99, respectively, and the priority order of the 3 task components is high-99, high-01, and low-99 in order.
In consideration of that some task components with lower priority may be in a queuing state for a long time and not consumed, the embodiment of the application also introduces a mechanism for automatically adjusting the priority of the task components based on the waiting time of the task components, wherein the priority digital identification of each task component automatically becomes larger along with the increase of the waiting time of the task components, but the priority level of each task component remains unchanged.
Optionally, when determining the priority order of each task component in the task queue, priority information of each task component in the task queue may be obtained; and determining the priority order of each task component in the task queue according to the priority information.
Optionally, the priority order of each task component in the task queue after adjustment can also be determined in response to the priority order adjustment instruction of the target object, that is, the user can adjust the priority order of each task component in real time by a manual mode, so as to ensure the priority execution of the urgent task.
The task execution nodes are generally provided with independent thread pools, a plurality of task components can be consumed in parallel, and in order to improve task scheduling efficiency, the concurrency quantity of each task execution node can be dynamically adjusted according to the use state of system resources, so that the system resources are fully utilized, and the efficiency of each task execution node for consuming task components is improved.
As an optional implementation manner, when determining the target concurrency number of the task execution nodes corresponding to the task queues, determining the maximum overall concurrency number of the plurality of task execution nodes according to the calculation resource amount for data processing in the database service cluster; determining the current actual overall concurrency quantity of a plurality of task execution nodes; and for the task execution node corresponding to the task queue, determining the maximum concurrency quantity of the task execution node according to the schedulable resource quantity of the task execution node, and determining the target concurrency quantity of the task execution node according to the maximum integral concurrency quantity and the actual integral concurrency quantity of a plurality of task execution nodes, wherein the target concurrency quantity does not exceed the maximum concurrency quantity.
For example, the maximum overall concurrency number of the system is determined to be 100, the maximum concurrency numbers of the task execution nodes a and B are both 70, when the task execution nodes a and B are simultaneously operated, the actual concurrency number of the task execution node B is assumed to be 40, and at this time, idle system resources are also evenly distributed, that is, the target concurrency number of the task execution node a is determined to be 55, and the target concurrency number of the task execution node B is determined to be 45.
In some scenarios, the task execution node a specifically processes the data processing job of the tenant a, which is simultaneously set with a minimum concurrency number threshold of 50; the task execution node B specially processes the data processing job of the tenant B, and is simultaneously provided with a minimum concurrency quantity threshold value of 35; the importance of the data processing job of the tenant A is higher than that of the tenant B; when the task execution nodes A and B run simultaneously, assuming that the actual concurrency number of the current task execution node A is 50 and the actual concurrency number of the task execution node B is 40, at this time, the target concurrency number of the task execution node B can be determined to be the minimum threshold 35, and the rest of system resources are all allocated to the task execution node A, namely, the target concurrency number of the task execution node A is determined to be 65, so that the important task is ensured to be executed preferentially while the system resources are fully utilized.
Optionally, after the task execution node sends the task components in the task queue to the database service cluster in parallel for data processing, the task execution node may receive a plurality of data processing results fed back by the database service cluster, and classify and store each data processing result to the target database according to a data processing job corresponding to the data processing result, so as to facilitate subsequent query.
In the embodiment of the application, job information of a plurality of data processing jobs is firstly obtained, wherein the job information comprises a plurality of pipelines corresponding to each data processing job and directed acyclic graph information of a plurality of task components in each pipeline; then, for each data processing job, sequentially sending a plurality of corresponding task components to a corresponding task queue according to the corresponding job information; for each task queue, determining the priority order of each task component in the task queue, determining the target concurrency quantity of task execution nodes corresponding to the task queues, and parallelly transmitting a plurality of task components in the task queues to a database service cluster for data processing according to the priority order and the target concurrency quantity through the task execution nodes. After the data processing operation is refined to the granularity of the components, the corresponding components are sent to the queue according to the relation between the assembly line and the upstream and downstream of the components, so that the integrity of the task processing result can be ensured; according to the components in the priority consumption queue and based on the global resource use condition, the queue concurrency number is dynamically adjusted, so that the task scheduling efficiency can be greatly improved, namely, the technical problem that a high-efficiency scheduling scheme is lack for a large number of data processing tasks in the related technology and the overall processing efficiency is low is effectively solved.
Example 2
According to an embodiment of the present application, there is further provided a scheduling apparatus for data processing tasks for implementing the scheduling method for data processing tasks in embodiment 1, as shown in fig. 5, where the scheduling apparatus for data processing tasks includes at least an obtaining module 51, a distributing module 52, and an executing module 53, where:
the obtaining module 51 is configured to obtain job information of a plurality of data processing jobs, where each data processing job corresponds to a plurality of pipelines, each pipeline includes a plurality of task components, and the job information includes directed acyclic graph information between the pipeline and the task components.
In general, when a user creates a data manipulation job at an instance node, the following may be used: the method comprises the steps of selecting a required task component from a preset task component library, dragging the task component into a canvas, wherein the task component can be divided into a management component, an access component, a processing component and a issuing component according to processing logic, and can support functions of file inspection, data uploading and downloading, data processing, data checking, encryption and decryption, code value conversion, cleaning standardization and the like; finishing processing logic definition by configuring SQL sentences, setting metadata, custom parameters and the like for task components; setting the execution sequence among the task components through the connecting lines to finish the creation of the assembly line; after a plurality of pipelines are established, the execution sequence among the pipelines is set through connecting lines, the creation of the data processing operation is completed, and a complete directed acyclic graph of the data processing operation is obtained.
In the task scheduling process, the acquisition module may acquire job information of a plurality of data processing jobs in the following manner: and polling the first instance node in a load balancing mode, and respectively acquiring the job information of a plurality of data processing jobs in the first instance node through a plurality of independent task distribution nodes. Each task distribution node is responsible for distributing a data processing job and can monitor the processing state of the data processing job.
Optionally, when the first instance node fails, the obtaining module may determine a second instance node associated with the first instance node, and continue to poll the second instance node in a load balancing manner, where incomplete data processing jobs in the first instance node are all transferred to the second instance node, so as to ensure that tasks are not interrupted.
The distribution module 52 is configured to sequentially send, for each data processing job, a plurality of task components corresponding to the data processing job to a task queue corresponding to the data processing job according to job information corresponding to the data processing job.
As an optional implementation manner, the job information acquired by the acquisition module further includes tenant information corresponding to the data processing jobs, and the distribution module can determine the tenant type according to the tenant information corresponding to the data processing jobs when distributing each data processing job; when the tenant type is an independent tenant, sequentially sending a plurality of task components corresponding to the data processing job to an exclusive task queue corresponding to the data processing job; and when the tenant type is a non-independent tenant, sequentially sending the plurality of task components corresponding to the data processing job to a public task queue.
The dedicated task queue only receives task components of data processing jobs corresponding to independent tenants, and the common task queue can receive task components of data processing jobs of a plurality of different non-independent tenants.
Optionally, in order to ensure that the data processing operations are orderly and completely processed, when the distribution module distributes each data processing operation, the distribution module may determine an upstream-downstream relationship between a plurality of pipelines corresponding to the data processing operation and an upstream-downstream relationship between a plurality of task components in each pipeline according to directed acyclic graph information of the data processing operation; and after the upstream task component finishes data processing, automatically triggering the downstream task component to enter the running state.
Specifically, for each pipeline, only when all task components in an upstream pipeline of the pipeline finish data processing, task queue distribution is started to the task components in the pipeline; for each task component, task queue distribution to the task component is resumed only when the task component upstream of the task component completes data processing.
The execution module 53 is configured to determine, for each task queue, a priority order of each task component in the task queue, determine a target concurrency number of task execution nodes corresponding to the task queue, and send, by the task execution nodes, the plurality of task components in the task queue to the database service cluster in parallel according to the priority order and the target concurrency number for data processing.
In general, in the pipeline creation process, priorities may be preset for task components therein, and the priority may be set by adopting a mode of adding a priority level and a priority number identifier together as priority information, where the priority level may include: high, medium, low, the greater the priority number identification, the higher the priority when the priority levels are the same. For example, the priority information of the 3 task components is high-01, high-99, and low-99, respectively, and the priority order of the 3 task components is high-99, high-01, and low-99 in order.
In consideration of that some task components with lower priority may be in a queuing state for a long time and not consumed, the embodiment of the application also introduces a mechanism for automatically adjusting the priority of the task components based on the waiting time of the task components, wherein the priority digital identification of each task component automatically becomes larger along with the increase of the waiting time of the task components, but the priority level of each task component remains unchanged.
Optionally, the execution module may obtain priority information of each task component in the task queue when determining the priority order of each task component in the task queue; and determining the priority order of each task component in the task queue according to the priority information.
Optionally, the execution module may also determine the priority order of each task component in the task queue after adjustment in response to the priority order adjustment instruction of the target object, that is, the user may manually adjust the priority order of each task component in real time, so as to ensure priority execution of the urgent task.
The task execution nodes are usually provided with independent thread pools, a plurality of task components can be consumed in parallel, and in order to improve task scheduling efficiency, the execution module can dynamically adjust the concurrency quantity of each task execution node according to the use state of system resources, so that the system resources are fully utilized, and the efficiency of each task execution node for consuming task components is improved.
As an optional implementation manner, when determining the target concurrency number of the task execution nodes corresponding to the task queues, the execution module may determine the maximum overall concurrency number of the plurality of task execution nodes according to the amount of computing resources for data processing in the database service cluster; determining the current actual overall concurrency quantity of a plurality of task execution nodes; and for the task execution node corresponding to the task queue, determining the maximum concurrency quantity of the task execution node according to the schedulable resource quantity of the task execution node, and determining the target concurrency quantity of the task execution node according to the maximum integral concurrency quantity and the actual integral concurrency quantity of a plurality of task execution nodes, wherein the target concurrency quantity does not exceed the maximum concurrency quantity.
Optionally, the scheduling device for data processing tasks further includes a receiving module, configured to receive a plurality of data processing results fed back by the database service cluster after the task execution node sends the plurality of task components in the task queue to the database service cluster in parallel for data processing, and categorize and store each data processing result to the target database according to a data processing job corresponding to the data processing result, so that subsequent query is facilitated.
It should be noted that, each module in the scheduling device for data processing tasks in the embodiment of the present application corresponds to each implementation step of the scheduling method for data processing tasks in embodiment 1 one by one, and since detailed description has already been made in embodiment 1, details that are not shown in part in this embodiment may refer to embodiment 1, and will not be described in detail here again.
Example 3
According to an embodiment of the present application, there is further provided a nonvolatile storage medium, where the nonvolatile storage medium includes a stored computer program, and a device where the nonvolatile storage medium is located executes the scheduling method of the data processing task in embodiment 1 by running the computer program.
Specifically, the device on which the nonvolatile storage medium resides performs the following steps by running the computer program: acquiring job information of a plurality of data processing jobs, wherein each data processing job corresponds to a plurality of pipelines, each pipeline comprises a plurality of task components, and the job information comprises directed acyclic graph information between the pipeline and the task components; for each data processing job, sequentially sending a plurality of task components corresponding to the data processing job to a task queue corresponding to the data processing job according to job information corresponding to the data processing job; for each task queue, determining the priority order of each task component in the task queue, determining the target concurrency quantity of task execution nodes corresponding to the task queues, and parallelly transmitting a plurality of task components in the task queues to a database service cluster for data processing according to the priority order and the target concurrency quantity through the task execution nodes.
According to an embodiment of the present application, there is further provided a processor, configured to execute a computer program, where the computer program executes the scheduling method of the data processing task in embodiment 1.
Specifically, the computer program when run performs the steps of: acquiring job information of a plurality of data processing jobs, wherein each data processing job corresponds to a plurality of pipelines, each pipeline comprises a plurality of task components, and the job information comprises directed acyclic graph information between the pipeline and the task components; for each data processing job, sequentially sending a plurality of task components corresponding to the data processing job to a task queue corresponding to the data processing job according to job information corresponding to the data processing job; for each task queue, determining the priority order of each task component in the task queue, determining the target concurrency quantity of task execution nodes corresponding to the task queues, and parallelly transmitting a plurality of task components in the task queues to a database service cluster for data processing according to the priority order and the target concurrency quantity through the task execution nodes.
According to an embodiment of the present application, there is also provided an electronic device including: a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the scheduling method of the data processing task in embodiment 1 by the computer program.
In particular, the processor is configured to implement the following steps by computer program execution: acquiring job information of a plurality of data processing jobs, wherein each data processing job corresponds to a plurality of pipelines, each pipeline comprises a plurality of task components, and the job information comprises directed acyclic graph information between the pipeline and the task components; for each data processing job, sequentially sending a plurality of task components corresponding to the data processing job to a task queue corresponding to the data processing job according to job information corresponding to the data processing job; for each task queue, determining the priority order of each task component in the task queue, determining the target concurrency quantity of task execution nodes corresponding to the task queues, and parallelly transmitting a plurality of task components in the task queues to a database service cluster for data processing according to the priority order and the target concurrency quantity through the task execution nodes.
The foregoing embodiment numbers are merely for the purpose of description and do not represent the advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of units may be a logic function division, and there may be another division manner in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution, in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims (10)

1. The scheduling method of the data processing task is characterized by comprising the following steps of:
acquiring job information of a plurality of data processing jobs, wherein each data processing job corresponds to a plurality of pipelines, each pipeline comprises a plurality of task components, and the job information comprises directed acyclic graph information between the pipeline and the task components;
for each data processing job, sequentially sending a plurality of task components corresponding to the data processing job to a task queue corresponding to the data processing job according to job information corresponding to the data processing job;
for each task queue, determining the priority order of each task component in the task queue, determining the target concurrency quantity of task execution nodes corresponding to the task queue, and parallelly transmitting a plurality of task components in the task queue to a database service cluster for data processing according to the priority order and the target concurrency quantity through the task execution nodes.
2. The method of claim 1, wherein acquiring job information for a plurality of data processing jobs comprises:
and polling the first instance node in a load balancing mode, and respectively acquiring the job information of a plurality of data processing jobs in the first instance node through a plurality of independent task distribution nodes.
3. The method of claim 1, wherein the job information further includes tenant information corresponding to the data processing job, and sequentially sending the plurality of task components corresponding to the data processing job to the task queue corresponding to the data processing job according to the job information corresponding to the data processing job, including:
for each data processing job, determining a tenant type according to tenant information corresponding to the data processing job;
when the tenant type is an independent tenant, sequentially sending a plurality of task components corresponding to the data processing job to an exclusive task queue corresponding to the data processing job;
and when the tenant type is a non-independent tenant, sequentially sending the task components corresponding to the data processing job to a public task queue.
4. The method of claim 3, wherein sequentially sending the plurality of task components corresponding to the data processing job to the task queue corresponding to the data processing job according to the job information corresponding to the data processing job, comprises:
for each data processing job, determining an upstream-downstream relationship between a plurality of pipelines corresponding to the data processing job and an upstream-downstream relationship between a plurality of task components in each pipeline according to the directed acyclic graph information of the data processing job;
for each pipeline, only when all task components in an upstream pipeline of the pipeline finish data processing, starting task queue distribution on the task components in the pipeline;
for each task component, only when the upstream task component of the task component completes data processing, task queue distribution on the task component is started again.
5. The method of claim 1, wherein determining the order of priority of the task components in the task queue comprises:
the method comprises the steps of obtaining priority information of each task component in the task queue, wherein the priority information comprises priority levels and priority digital identifications, and the priority levels comprise: high, medium and low, when the priority levels are the same, the larger the priority digital mark is, the higher the priority is, and the priority digital mark is larger along with the increase of the waiting time length of the task component in the task queue; determining the priority order of each task component in the task queue according to the priority information; or alternatively, the first and second heat exchangers may be,
and determining the adjusted priority order of each task component in the task queue in response to the priority order adjustment instruction of the target object.
6. The method of claim 1, wherein determining a target concurrency number of task execution nodes corresponding to the task queues comprises:
determining the maximum overall concurrency quantity of a plurality of task execution nodes according to the calculation resource quantity for data processing in the database service cluster;
determining the current actual overall concurrency quantity of the plurality of task execution nodes;
for the task execution node corresponding to the task queue, determining a maximum concurrency amount of the task execution node according to the schedulable resource amount of the task execution node, and determining the target concurrency amount of the task execution node according to the maximum overall concurrency amount and the actual overall concurrency amount of the plurality of task execution nodes, wherein the target concurrency amount does not exceed the maximum concurrency amount.
7. The method of claim 1, wherein after the plurality of task components in the task queue are concurrently sent to a database service cluster for data processing by the task execution node in accordance with the priority order and the target concurrency number, the method further comprises:
and receiving a plurality of data processing results fed back by the database service cluster, and classifying and storing each data processing result into a target database according to the data processing operation corresponding to the data processing result.
8. The method according to claim 2, wherein the method further comprises:
and when the first instance node fails, determining a second instance node associated with the first instance node, and continuing to poll the second instance node, wherein incomplete data processing jobs in the first instance node are all transferred to the second instance node.
9. A scheduling apparatus for data processing tasks, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring job information of a plurality of data processing jobs, each data processing job corresponds to a plurality of pipelines, each pipeline comprises a plurality of task components, and the job information comprises directed acyclic graph information between the pipeline and the task components;
the distribution module is used for sequentially sending a plurality of task components corresponding to the data processing operation to a task queue corresponding to the data processing operation according to the operation information corresponding to the data processing operation for each data processing operation;
and the execution module is used for determining the priority order of each task component in the task queue for each task queue, determining the target concurrency quantity of the task execution nodes corresponding to the task queues, and parallelly transmitting a plurality of task components in the task queues to a database service cluster for data processing according to the priority order and the target concurrency quantity through the task execution nodes.
10. A non-volatile storage medium, characterized in that the non-volatile storage medium comprises a stored computer program, wherein a device in which the non-volatile storage medium is located performs the scheduling method of the data processing task according to any one of claims 1 to 8 by running the computer program.
CN202311613202.9A 2023-11-28 2023-11-28 Scheduling method and device for data processing task Pending CN117742907A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311613202.9A CN117742907A (en) 2023-11-28 2023-11-28 Scheduling method and device for data processing task

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311613202.9A CN117742907A (en) 2023-11-28 2023-11-28 Scheduling method and device for data processing task

Publications (1)

Publication Number Publication Date
CN117742907A true CN117742907A (en) 2024-03-22

Family

ID=90255319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311613202.9A Pending CN117742907A (en) 2023-11-28 2023-11-28 Scheduling method and device for data processing task

Country Status (1)

Country Link
CN (1) CN117742907A (en)

Similar Documents

Publication Publication Date Title
CN104915407B (en) A kind of resource regulating method based under Hadoop multi-job environment
US20080263553A1 (en) Dynamic Service Level Manager for Image Pools
Das et al. Skedulix: Hybrid cloud scheduling for cost-efficient execution of serverless applications
CN104967619A (en) File pushing method, device and system
CN111427675B (en) Data processing method and device and computer readable storage medium
US11182217B2 (en) Multilayered resource scheduling
US20090282413A1 (en) Scalable Scheduling of Tasks in Heterogeneous Systems
CN114610474B (en) Multi-strategy job scheduling method and system under heterogeneous supercomputing environment
CN111984385A (en) Task scheduling method and task scheduling device based on decorative BIM model
CN107515781B (en) Deterministic task scheduling and load balancing system based on multiple processors
Cushing et al. Prediction-based auto-scaling of scientific workflows
CN111190691A (en) Automatic migration method, system, device and storage medium suitable for virtual machine
CN112162841A (en) Distributed scheduling system, method and storage medium for big data processing
CN108509280A (en) A kind of Distributed Calculation cluster locality dispatching method based on push model
CN113434282A (en) Issuing and output control method and device for stream computing task
CA2631255A1 (en) Scalable scheduling of tasks in heterogeneous systems
Wu et al. Abp scheduler: Speeding up service spread in docker swarm
WO2017167070A1 (en) Method for copying clustered data, and method and device for determining priority
CN116881012A (en) Container application vertical capacity expansion method, device, equipment and readable storage medium
CN117742907A (en) Scheduling method and device for data processing task
Xu et al. Resource optimization for speculative execution in a MapReduce cluster
CN113225269B (en) Container-based workflow scheduling method, device and system and storage medium
Hu et al. Requirement-aware strategies with arbitrary processor release times for scheduling multiple divisible loads
US20150120376A1 (en) Data processing device and method
Gopalakrishnan Sharp utilization thresholds for some realtime scheduling problems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination