CN115202876A - Task processing method and system based on ETL server and electronic equipment - Google Patents

Task processing method and system based on ETL server and electronic equipment Download PDF

Info

Publication number
CN115202876A
CN115202876A CN202210809383.1A CN202210809383A CN115202876A CN 115202876 A CN115202876 A CN 115202876A CN 202210809383 A CN202210809383 A CN 202210809383A CN 115202876 A CN115202876 A CN 115202876A
Authority
CN
China
Prior art keywords
processed
task
subtasks
target
subtask
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210809383.1A
Other languages
Chinese (zh)
Inventor
肖识战
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Medical Lijie Shanghai Information Technology Co ltd
Original Assignee
Medical Lijie Shanghai Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Medical Lijie Shanghai Information Technology Co ltd filed Critical Medical Lijie Shanghai Information Technology Co ltd
Priority to CN202210809383.1A priority Critical patent/CN115202876A/en
Publication of CN115202876A publication Critical patent/CN115202876A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)

Abstract

The application provides a task processing method, a system and electronic equipment based on an ETL server, which relate to the technical field of data calculation, and comprise the steps of obtaining a task processing request of a target task, and calling user requirements and a task processing flow chart of the target task based on the target task request; identifying a plurality of task nodes of the task processing flow chart, and determining the execution sequence and the mutual relation of a plurality of subtasks to be processed based on the positions of the task nodes; grading the subtasks to be processed according to the execution sequence and the correlation of the subtasks to be processed to obtain a grading result; predicting the execution time and the execution cost of each subtask to be processed based on a load prediction model, and determining a target optimization scheme by combining the grading result of the subtask to be processed and the user requirement; and distributing the to-be-processed subtasks to a plurality of ETL servers for task processing based on the target optimization scheme, so that the processing efficiency of the target tasks is improved.

Description

Task processing method and system based on ETL server and electronic equipment
Technical Field
The invention relates to the technical field of data calculation, in particular to a task processing method and system based on an ETL server and electronic equipment.
Background
A conventional ETL tool performs work on the ETL local machine, and can access multiple local or remote source systems. The runtime ETL tool connects to a database, which may be hosted on a remote machine, over a network, and extracts the data to the local machine, transforms the data locally, and loads the transformed data into a target database located on another remote machine in the network. In a large enterprise, there may be hundreds of such ETL tools running in parallel for a job. However, the ETL includes a lot of scheduling task jobs, and how to effectively manage these schedules and improve the execution efficiency of the ETL is a key to improve the whole data processing capability.
When source or target data is present in some other location (e.g., a cloud system), it may take a significant amount of time to extract the data to a local machine and then load the converted data back to the cloud system. This increases completion time and also increases the load ETL work performed on the local machine. This may lead to network timeout, ETL due to insufficient memory, job crash, ETL work pending or running forever, network congestion, etc. If the efficiency of ETL scheduling multitask is improved by increasing hardware in exchange for the improvement of processing capacity, there are also problems of increasing hardware overhead and increasing manufacturing cost.
Therefore, a task processing method, a task processing system and an electronic device based on the ETL server are provided.
Disclosure of Invention
The specification provides a task processing method, a task processing system and electronic equipment based on an ETL server, a grading result is confirmed based on a task processing flow chart, a target optimization scheme is determined by combining user requirements, and the to-be-processed subtasks are distributed to a plurality of ETL servers for task processing based on the target optimization scheme so as to improve the processing efficiency of the target tasks.
The task processing method based on the ETL server adopts the following technical scheme that the method comprises the following steps:
acquiring a task processing request of a target task, and calling a user requirement and a task processing flow chart of the target task based on the target task request;
identifying a plurality of task nodes of the task processing flow chart, and determining the execution sequence and the mutual relation of a plurality of subtasks to be processed based on the positions of the task nodes;
grading the subtasks to be processed according to the execution sequence and the mutual relation of the subtasks to be processed to obtain a grading result;
predicting the execution time and the execution cost of each subtask to be processed based on a load prediction model, and determining a target optimization scheme by combining the grading result of the subtask to be processed and the user requirement;
and distributing the to-be-processed subtasks to a plurality of ETL servers for task processing based on the target optimization scheme, so that the processing efficiency of the target tasks is improved.
Optionally, the determining the execution sequence and the interrelation of the multiple to-be-processed subtasks based on the positions of the task nodes includes:
the one to-be-processed subtask comprises at least one task node;
and determining the execution sequence of the subtasks to be processed based on the sequence of the task nodes.
Optionally, the step of ranking the subtasks to be processed according to the execution sequence and the correlation of the subtasks to be processed to obtain a ranking result includes:
according to the execution sequence of the sub tasks to be processed, the main priorities of the sub tasks to be processed are sequentially reduced;
and sequentially reducing the sub-optimal priorities of the to-be-processed subtasks with the same main priority according to the size of the correlation of the to-be-processed subtasks.
Optionally, the predicting, based on the load prediction model, the execution time and the execution cost of each to-be-processed subtask, and determining, by combining the classification result of the to-be-processed subtask and the user requirement, a target optimization scheme, includes:
predicting the predicted execution time and the predicted execution cost of each to-be-processed subtask based on a load prediction model;
determining a plurality of task optimization schemes based on a load prediction model and combined with the predicted execution time, the predicted execution cost and the classification result of the subtasks to be processed, wherein the task optimization schemes comprise a minimum time scheme, a minimum cost scheme, a minimum time scheme within the maximum cost and a minimum cost scheme within the maximum time;
determining the target optimization plan from the plurality of task optimization plans based on the user demand.
Optionally, the task nodes include an output task node and an input task node;
if the to-be-processed subtask only comprises an output task node, the to-be-processed subtask is an initial subtask;
and if the to-be-processed subtask only comprises the input task node, the to-be-processed subtask is the final subtask.
Optionally, the starting subtask includes a plurality of contents to be processed;
based on the source of the content to be processed, searching the ETL server in the same network as the source of the content to be processed as a first ETL server;
and preferentially distributing the first ETL server to process the content to be processed.
The task processing system based on the ETL server adopts the following technical scheme that the task processing system based on the ETL server comprises:
the acquisition module is used for acquiring a task processing request of a target task and calling a user requirement and a task processing flow chart of the target task based on the target task request;
the identification module is used for identifying a plurality of task nodes of the task processing flow chart and determining the execution sequence and the mutual relation of a plurality of subtasks to be processed based on the positions of the task nodes;
the classification module is used for classifying the subtasks to be processed according to the execution sequence and the mutual relation of the subtasks to be processed to obtain a classification result;
the scheme determining module is used for predicting the execution time and the execution cost of each subtask to be processed based on a load prediction model, and determining a target optimization scheme by combining the grading result of the subtask to be processed and the user requirement;
and the distribution module is used for distributing the subtasks to be processed to a plurality of ETL servers for task processing based on the target optimization scheme, so that the processing efficiency of the target tasks is improved.
Optionally, the identification module includes:
the identification submodule is used for identifying the task processing flow chart and acquiring a plurality of task nodes;
the association submodule is used for determining the execution sequence and the mutual relation of a plurality of subtasks to be processed based on the positions of the task nodes;
the association submodule includes:
the subtask construction unit is used for enabling the subtask to be processed to comprise at least one task node;
and the sequence determining unit is used for determining the execution sequence of the subtasks to be processed based on the sequence of the task nodes.
Optionally, the grading module includes:
the main priority judging submodule is used for sequentially reducing the main priority of the subtasks to be processed according to the execution sequence of the subtasks to be processed;
and the secondary priority judging submodule is used for sequentially reducing the secondary priority of the to-be-processed subtasks with the same main priority according to the size of the mutual relation of the to-be-processed subtasks.
Optionally, the scheme determining module includes:
the prediction sub-module is used for predicting the predicted execution time and the predicted execution cost of each to-be-processed subtask based on a load prediction model;
the scheme summarizing submodule is used for determining a plurality of task optimization schemes based on a load prediction model in combination with the predicted execution time, the predicted execution cost and the grading result of the subtasks to be processed, wherein the task optimization schemes comprise a minimum time scheme, a minimum cost scheme, a minimum time scheme within the maximum cost and a minimum cost scheme within the maximum time;
and the scheme determining submodule is used for determining the target optimization scheme from the plurality of task optimization schemes based on the user requirements.
Optionally, the task nodes include an output task node and an input task node;
if the to-be-processed subtask only comprises an output task node, the to-be-processed subtask is an initial subtask;
and if the to-be-processed subtask only comprises the input task node, the to-be-processed subtask is the final subtask.
Optionally, the starting subtask includes a plurality of contents to be processed;
based on the source of the content to be processed, searching the ETL server in the same network as the source of the content to be processed as a first ETL server;
and preferentially distributing the first ETL server to process the content to be processed.
The present specification also provides an electronic device, wherein the electronic device includes:
a processor; and (c) a second step of,
a memory storing computer-executable instructions that, when executed, cause the processor to perform any of the methods described above.
The present specification also provides a computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement any of the methods described above.
According to the method, a task processing request of a target task is obtained, and a user requirement and a task processing flow chart of the target task are called based on the target task request; identifying a plurality of task nodes of the task processing flow chart, and determining the execution sequence and the mutual relation of a plurality of subtasks to be processed based on the positions of the task nodes; grading the subtasks to be processed according to the execution sequence and the correlation of the subtasks to be processed to obtain a grading result; predicting the execution time and the execution cost of each to-be-processed subtask based on a load prediction model, and determining a target optimization scheme by combining the grading result of the to-be-processed subtask and the user requirement; and distributing the subtasks to be processed to a plurality of ETL servers for task processing based on the target optimization scheme, so that the data extraction efficiency is improved, the processing efficiency of the target tasks is improved, and the risk of network congestion is reduced.
Drawings
FIG. 1 is a schematic diagram illustrating a method for processing tasks based on an ETL server according to an embodiment of the present disclosure;
FIG. 2 is a task processing flow chart of a task processing method based on an ETL server according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a task processing system based on an ETL server according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification;
fig. 5 is a schematic diagram of a computer-readable medium provided in an embodiment of the present specification.
Detailed Description
The following description is presented to disclose the invention so as to enable any person skilled in the art to practice the invention. The preferred embodiments described below are by way of example only, and other obvious variations will occur to those skilled in the art. The basic principles of the invention, as defined in the following description, may be applied to other embodiments, variations, modifications, equivalents, and other technical solutions without departing from the spirit and scope of the invention.
Exemplary embodiments of the present invention will now be described more fully with reference to the accompanying drawings. The exemplary embodiments, however, may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept to those skilled in the art. The same reference numerals denote the same or similar elements, components, or parts in the drawings, and thus their repetitive description will be omitted.
Features, structures, characteristics or other details described in a particular embodiment do not preclude the fact that the features, structures, characteristics or other details may be combined in a suitable manner in one or more other embodiments in accordance with the technical idea of the invention.
In describing particular embodiments, the present invention has been described with reference to features, structures, characteristics or other details that are within the purview of one skilled in the art to provide a thorough understanding of the embodiments. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific features, structures, characteristics, or other details.
The flowcharts shown in the figures are illustrative only and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The term "and/or" and/or "includes all combinations of any one or more of the associated listed items.
Fig. 1 is a schematic diagram of a task processing method based on an ETL server according to an embodiment of the present disclosure, where the method includes:
s1, acquiring a task processing request of a target task, and calling a user requirement and a task processing flow chart of the target task based on the target task request;
s2, identifying a plurality of task nodes of the task processing flow chart, and determining the execution sequence and the mutual relation of a plurality of subtasks to be processed based on the positions of the task nodes;
s3, grading the subtasks to be processed according to the execution sequence and the mutual relation of the subtasks to be processed to obtain a grading result;
s4, predicting the execution time and the execution cost of each to-be-processed subtask based on a load prediction model, and determining a target optimization scheme by combining the grading result of the to-be-processed subtask and the user requirement;
and S5, distributing the subtasks to be processed to a plurality of ETL servers for task processing based on the target optimization scheme, so that the processing efficiency of the target tasks is improved.
In the process of extracting data from one or more databases or non-database systems (source data sources), converting the extracted data according to business logic, and loading the converted data into one or more databases or non-database systems (target data sources), because the amount of data to be queried is huge, if ETL servers allocated for different subtasks are unreasonable, problems that part of ETL servers run continuously, the usage amount of part of ETL servers is low, and the like may occur, the data processing efficiency is reduced, and the duration of task processing is prolonged. Therefore, based on the user requirements and the grading result, a target optimization scheme is obtained, and the use condition of the ETL server is reasonably distributed so as to optimize the task processing process.
Specifically, in step S1, a task processing request of a target task is obtained, and a task processing flowchart of a user requirement and the target task is called based on the target task request.
The task processing flow chart is used for representing the task processing flow. In one embodiment of the present specification, the task processing flow chart is a data flow chart, that is, from the perspective of data transfer and processing, the logical functions of the system, the logical flow direction of data inside the system and the logical transformation process are expressed in a graphic manner. The data flow diagram defines data flow and transmission direction, a target data source and a source data source, wherein the source data source is used for storing original data to be processed, and the target data source is used for storing data after conversion is completed.
In another embodiment of the present specification, the task processing flowchart may also be in other forms such as a basic flowchart, and based on task processing steps in the task processing flowchart, the task processing flowchart is adjusted to be a data flow diagram, so as to facilitate later data processing.
The user requirements may be pre-configured or individually configured based on a single target task request.
In an embodiment of the present specification, the task processing flowchart is obtained as shown in fig. 2, and the task processing request of the target task is to obtain the last year of the male clinic data and the last year of the female clinic data based on the respective clinic data.
S2, identifying a plurality of task nodes of the task processing flow chart, and determining the execution sequence and the mutual relation of a plurality of subtasks to be processed based on the positions of the task nodes;
and decomposing the target task into a plurality of operation sections based on the task processing flow chart, or decomposing the target task into a plurality of to-be-processed subtasks based on the task node by performing minimum cutting on the target task so as to optimize the distribution quantity of the ETL servers. The task processing flow chart comprises a plurality of task nodes, and the task nodes comprise output task nodes and input task nodes. One of the to-be-processed subtasks includes at least one task node.
If the to-be-processed subtask only comprises an output task node, the to-be-processed subtask is an initial subtask; and if the to-be-processed subtask only comprises the input task node, the to-be-processed subtask is the final subtask. The start subtask shows the source data source that needs to be called. And the final subtask is loaded to a target data source.
In one embodiment of the present description, the target task includes at least one starting subtask and at least one final subtask. As shown in fig. 2, fig. 2 is a task processing flowchart, in which A1 represents medical clinic data, A2 represents surgical clinic data, A3 represents pediatric clinic data, B1 represents first-type formatting of extracted data, B2 represents physical examination center data, B3 represents second-type formatting of extracted data, C1 represents collective screening of extracted data, D1 represents the amount of diagnosis for males in the last year of loading, and D2 represents the amount of diagnosis for females in the last year of loading. Wherein, A1, A2, A3, B2 only have output task nodes, therefore the to-be-processed subtasks A1, A2, A3, B2 are initial subtasks, D1, D2 only have input task nodes, therefore the to-be-processed subtasks D1, D2 are final subtasks,
and determining the subtasks to be processed based on the task nodes, and determining the execution sequence of the subtasks to be processed according to the data transmission direction between two adjacent subtasks to be processed. And determining the mutual relation of the subtasks to be processed based on the input task nodes of the same subtask to be processed.
In an embodiment of the present specification, the execution order of the to-be-processed subtasks is a, B, C, and D in sequence. A1, A2 and B1 are related; a3 is related to B3; b1, B2, B3 are related to C1, C1 is related to D1, and C1 is related to D1.
In another embodiment of the present specification, the target task is divided into a plurality of working segments, and the working segments include A1-B1-C1-D1; A1-B1-C1-D2; A2-B1-C1-D1; A2-B1-C1-D2; B2-C1-D1; B2-C1-D2; A3-B3-C1-D1; A3-B3-C1-D2, the correlation determined based on the job segment is the same as the result of the previous segment.
S3, grading the subtasks to be processed according to the execution sequence and the mutual relation of the subtasks to be processed to obtain a grading result;
according to the sequence of the execution sequence of the sub tasks to be processed, the main priorities of the sub tasks to be processed are sequentially reduced, namely the higher the execution sequence of the sub tasks to be processed is, the higher the main priority of the sub tasks to be processed is.
According to the size of the correlation of the to-be-processed subtasks, the sub-optimal priorities of the to-be-processed subtasks with the same main priority are sequentially reduced, namely the sub-optimal priorities of the to-be-processed subtasks with the same main priority are higher according to the correlation of the to-be-processed subtasks.
In an embodiment of the present specification, based on the order of execution of the to-be-processed subtasks, determining the main priority of the to-be-processed subtasks to be a priority greater than a > B > C > D; based on the correlation of the to-be-processed subtasks, among A1, A2, and A3, the second priority is A1= A2= A3. Among B1, B2, B3, the next priority is B1> B3> B2. The second priority among D1, D2 is D1= D2.
S4, predicting the execution time and the execution cost of each to-be-processed subtask based on a load prediction model, and determining a target optimization scheme by combining the grading result of the to-be-processed subtask and the user requirement;
and creating a load prediction model, wherein the load prediction model is based on the collected historical processing tasks of the ETL server and the load information in the collected historical processing tasks of the ETL server, and regularly trains and learns to predict the execution time and the execution cost required by each ETL server to execute the sub-task to be processed.
Determining a plurality of task optimization schemes based on a load prediction model and combined with the predicted execution time, the predicted execution cost and the classification result of the subtasks to be processed, wherein the task optimization schemes comprise a minimum time scheme, a minimum cost scheme, a minimum time scheme within the maximum cost and a minimum cost scheme within the maximum time;
in an embodiment of the present specification, the minimum time scenario refers to a scenario with the shortest execution time, and specifically, given that there are M tasks Task { T } T that need to be executed 1 ,T 2 ,T 3 .......T M N available resource nodes Workder W 1 ,W 2 ,W 3 .......W N Note that in general, N<M), distributing the M tasks to N available resource nodes for processing. Suppose that each task T j At W i The processing Time on the node is Time (i, j), and the minimum Time algorithm allocates the Task to the resource with the shortest execution Time in a proper allocation mode, so that the shortest total execution Time is ensured, namely Time (i, j) -min.
In an embodiment of the present specification, the minimum cost solution refers to that given g resource nodes Worker, n task ids, n ≦ g, each node is assigned with an id (the ids of the groups are different). However, each node needs to pay different cost costs for executing the id, and an optimal id execution scheme needs to be solved, so that the sum of the overall costs is minimum.
Specifically, as shown in table 1, the following 4 groups, 3 task ids, value matrix a, and for the cost (cost) assigned to H _ j by id _ i, are given as follows:
Figure BDA0003739965280000101
Figure BDA0003739965280000111
(Table 1)
For example, if giving H 1 Specifying id 1 Value =4 is reserved but a change cost of 3 is paid.
Needs to be H 1 -H 4 Each assigned an id 1 -id 3 ,id 4 (new id), the goal is to minimize the overall change cost.
The optimal allocation results in the example are:
H 1 <-id 2
H 2 <-New ID;
H 3 <-id 3
H 4 <-id 1
its corresponding changing cost =8 (4 +1+2+ 1).
In an embodiment of the present specification, the minimum time scenario within the maximum cost refers to an optimal execution path selected by selecting an execution plan that is less than or equal to a set maximum time among a plurality of execution plans that are predicted based on execution time, and then based on the minimum cost.
In one embodiment of the present specification, the maximum time scenario within the minimum cost refers to an optimal execution path selected by selecting an execution plan that is less than or equal to a set maximum cost among a plurality of execution plans that are predicted based on execution costs, and then based on the minimum time. Determining the target optimization plan from a plurality of task optimization plans based on the user requirements.
In one embodiment of the present description, the task optimization scheme further includes an optimal utilization scheme, and when the target optimization scheme is the optimal utilization scheme, the corresponding user requirements are to maximize the computing power of the hardware on the local network and the remote network, and in this case, the target optimization scheme is based on a plurality of ETL servers by providing parallel input/output interface conversion.
In one embodiment of the present specification, a data source index is obtained by collecting data source metrics of each data source through all participating data sources based on a DMA (direct memory access) process, the data source index including: data source mapping, data table statistics, etc.
In another embodiment of the present description, the workload is distributed to a plurality of different ETL servers based on the time and cost based computation criteria described above. First, a centralized data source registry is maintained that stores metadata about data sources (including source and target data sources) and their stored tables and files.
A "virtual" local data source table name ("virtual table name") and a new data source are created and the virtual table name is mapped to one or more tables or table partitions on the remote. The term "table" is also used herein to include "table partitions". For example, a table may be partitioned into 4 partitions that reside on different source systems.
For purposes of data extraction and processing involving all four partitions, the candidate execution plan may generate additional source system phases and a "partition aggregation" phase to account for all distributed table partitions and combine loaded partitions for subsequent processing phases in the task.
In one embodiment of the present description, the contents of the data source or workload map are automatically generated based on the location and server load information obtained by the servers previously participating in the ETL. The automatically generated data source or workload map may also be manually updated by the user to account for environmental changes and/or operational preferences. Based on the data source/workload mapping table, the data source and connection information in the original job may be replaced to redirect the workload to a specified ETL server for processing. Of course, the ETL server, the source data source machine, and the target data source machine may also be manually added to the data source registry.
ETL jobs provide connectivity, data manipulation functionality, and highly scalable capabilities. Generally, in an ETL job, data is extracted from a data source, transformed, and then loaded into a target data store. By embodiments, the computing power of hardware on local and remote machines is maximized by providing parallel I/O and translation through parallel ETL server processing based on a goal optimization scheme.
In another embodiment of the present specification, the mission optimization scheme further includes an optimal balancing scheme, and when the target optimization scheme is the optimal balancing scheme, the corresponding user requirements are an ETL server in a separate sub-network incorporated in the local network and an ETL server in the remote network. In one embodiment of the present description, in a distributed network or isolated local sub-network, certain data may only be called by a particular ETL server, through which balancing optimization ceases at the boundaries of the computing environment defined by the accessibility of the underlying data.
Specifically, the starting subtask includes a plurality of contents to be processed; the plurality of to-be-processed contents are all or part of to-be-processed data acquired from one source data source. Based on the source of the content to be processed, searching the ETL server in the same network with the content to be processed as a first ETL server; and preferentially distributing the first ETL server to process the content to be processed so as to reduce data movement and redundancy and further balance the working load of the ETL server. In an embodiment of the present specification, the ETL server retrieves the content to be processed in the same local network as the ETL server, which avoids inconsistent data processing across remote networks and reduces data processing risks.
And S5, distributing the subtasks to be processed to a plurality of ETL servers for task processing based on the target optimization scheme, and improving the processing efficiency of the target tasks.
Through distributed balanced optimization, the entire ETL is changed to divide the job into one or more optimized job segments based on the accessibility of the data, the minimum total job execution cost, and/or the total job execution time. These work segments are submitted to their corresponding ETL servers, which pass through a distributed work orchestration mechanism. And integrally executing ETL work to improve the processing efficiency of the target task.
In one embodiment of the present description, an "ETL network level" is constructed based on an ETL server as a network level of an input level (a level with one output link but no input link) or an output level (a level with one input link but no output link). For the output stage, the ETL network stage sends data and control signals ETL to other network stage ETL servers; for the input stage, the ETL network stage receives data and control signals ETL from other network stages ETL servers.
In one embodiment of the present description, the network phase of ETL for data communication between job segments on different servers may be implemented as custom operators in a parallel framework. Network conditions are monitored by periodically running a network traffic monitoring agent process to collect network performance indicators.
Fig. 3 is a schematic structural diagram of an ETL server-based task processing system provided in an embodiment of the present specification, where the system includes:
an obtaining module 301, configured to obtain a task processing request of a target task, and invoke a user requirement and a task processing flowchart of the target task based on the target task request;
an identifying module 302, configured to identify a plurality of task nodes of the task processing flowchart, and determine an execution sequence and a correlation of a plurality of to-be-processed subtasks based on positions of the task nodes;
a grading module 303, configured to grade the to-be-processed subtasks according to the execution order and the correlation of the to-be-processed subtasks, so as to obtain a grading result;
a scheme determining module 304, configured to predict an execution time and an execution cost of each to-be-processed subtask based on a load prediction model, and determine a target optimization scheme according to a classification result of the to-be-processed subtask and the user requirement;
the allocating module 305 is configured to allocate the to-be-processed subtasks to multiple ETL servers for task processing based on the target optimization scheme, so as to improve processing efficiency of the target task.
Optionally, the identifying module 302 includes:
the identification submodule is used for identifying the task processing flow chart and acquiring a plurality of task nodes;
the association submodule is used for determining the execution sequence and the mutual relation of a plurality of subtasks to be processed based on the positions of the task nodes;
the association submodule includes:
the subtask construction unit is used for enabling the subtask to be processed to comprise at least one task node;
and the sequence determining unit is used for determining the execution sequence of the subtasks to be processed based on the sequence of the task nodes.
Optionally, the grading module 303 includes:
the main priority judging submodule is used for sequentially reducing the main priority of the subtasks to be processed according to the execution sequence of the subtasks to be processed;
and the sub-priority judging sub-module is used for sequentially reducing the sub-priorities of the to-be-processed subtasks with the same main priority according to the magnitude of the correlation of the to-be-processed subtasks.
Optionally, the scheme determining module 304 includes:
the prediction submodule is used for predicting the predicted execution time and the predicted execution cost of each to-be-processed subtask based on a load prediction model;
the scheme summarizing submodule is used for determining a plurality of task optimization schemes based on a load prediction model in combination with the predicted execution time, the predicted execution cost and the grading result of the subtasks to be processed, wherein the task optimization schemes comprise a minimum time scheme, a minimum cost scheme, a minimum time scheme within the maximum cost and a minimum cost scheme within the maximum time;
and the scheme determining submodule is used for determining the target optimization scheme from the plurality of task optimization schemes based on the user requirements.
Optionally, the task nodes include an output task node and an input task node;
if the to-be-processed subtask only comprises an output task node, the to-be-processed subtask is an initial subtask;
and if the to-be-processed subtask only comprises the input task node, the to-be-processed subtask is the final subtask.
Optionally, the starting subtask includes a plurality of contents to be processed;
based on the source of the content to be processed, searching the ETL server in the same network with the content to be processed as a first ETL server;
and preferentially distributing the first ETL server to process the content to be processed.
The functions of the system in the embodiment of the present invention have been described in the above method embodiments, so that details that are not described in the embodiment of the present invention can be referred to the relevant descriptions in the foregoing embodiments, and are not described herein again.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
In the following, embodiments of the electronic device of the present invention are described, which may be regarded as an implementation in physical form for the above-described embodiments of the method and apparatus of the present invention. The details described in the embodiments of the electronic device of the invention are to be regarded as supplementary for the embodiments of the method or the apparatus described above; for details which are not disclosed in embodiments of the electronic device of the invention, reference may be made to the above-described embodiments of the method or the apparatus.
Fig. 4 is a block diagram of an exemplary embodiment of a federated login system device in accordance with the present invention. The computer device shown in fig. 4 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.
As shown in fig. 4, the computer apparatus 400 of the exemplary embodiment is in the form of a general purpose data processing apparatus. The components of computer device 400 may include, but are not limited to: at least one processor 410, at least one memory 420, a network interface 430, a display unit 440, an input component 450, and the like.
The memory 420 stores a computer readable program, which may be a code of a source program or a read-only program. The program may be executed by the processor unit 410 such that the processor unit 410 performs the steps of the various embodiments of the present invention. For example, the processor 410 may perform the steps shown in FIG. 1.
The memory 420 may include readable media in the form of volatile memory units, such as random access memory units (RAM) and/or cache memory units, and may further include read-only memory units (ROM). The memory 420 may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment.
Also included is a bus (not shown) that may be representative of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The computer device 400 may also communicate with one or more external devices (e.g., keyboard, display, network device, bluetooth device, etc.), enable a user to interact with the computer device 400 via the external devices, and/or enable the computer device 400 to communicate with one or more other data processing devices (e.g., router, modem, etc.). Such communication may occur through network interface 430 and may also occur through a network adapter and one or more networks, such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet. The network adapter may communicate with other modules of the computer device 400 over the bus. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in the computer device 400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
FIG. 5 is a schematic diagram of one computer-readable medium embodiment of the present invention. As shown in fig. 5, the computer program may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. The computer program, when executed by one or more data processing devices, enables the computer-readable medium to implement the above-described method of the invention, namely: when a plurality of third-party H5 programs access the APP system, the APP system firstly allocates a partner ID (PartnerID) to each third-party H5 program, and in addition, the native API interface of the APP automatically and randomly generates a temporary token (token), wherein the temporary token also contains the aging information. And will send the temporary token to the third party H5 program. The APP system receives the third-party server and verifies the encrypted temporary token and the partner ID, decrypts the temporary token by using a key stored in a system memory, and checks the validity of the temporary token, including whether the temporary token exists or not, and whether the temporary token is in a validity period or not. And if the temporary token is legal and valid and within the valid period, the APP system sends the unique user identification and other configured user information to the third-party server. Other user information may be age, gender, head portrait, identity, etc.
Through the description of the above embodiments, those skilled in the art will readily understand that the exemplary embodiments described in the present invention may be implemented by software, and may also be implemented by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a data processing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the present invention.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, the present invention can be implemented as a method, apparatus, electronic device, or computer-readable medium that executes a computer program. Some or all of the functions of the present invention may be implemented in practice using a general purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP).
While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims (9)

1. An ETL-based server optimization task processing method is characterized by comprising the following steps:
acquiring a task processing request of a target task, and calling a user requirement and a task processing flow chart of the target task based on the target task request;
identifying a plurality of task nodes of the task processing flow chart, and determining the execution sequence and the mutual relation of a plurality of subtasks to be processed based on the positions of the task nodes;
grading the subtasks to be processed according to the execution sequence and the mutual relation of the subtasks to be processed to obtain a grading result;
predicting the execution time and the execution cost of each subtask to be processed based on a load prediction model, and determining a target optimization scheme by combining the grading result of the subtask to be processed and the user requirement;
and distributing the subtasks to be processed to a plurality of ETL servers for task processing based on the target optimization scheme, so that the processing efficiency of the target tasks is improved.
2. The method of claim 1, wherein determining an execution order and interrelationship of a plurality of pending subtasks based on the location of the task node comprises:
one of the to-be-processed subtasks includes at least one task node;
and determining the execution sequence of the subtasks to be processed based on the sequence of the task nodes.
3. The method of claim 1, wherein said ranking said sub-tasks to be processed according to their execution order and interrelationships to obtain a ranking result comprises:
according to the execution sequence of the subtasks to be processed, the main priorities of the subtasks to be processed are reduced in sequence;
and sequentially reducing the sub-priorities of the to-be-processed subtasks with the same main priority according to the size of the correlation of the to-be-processed subtasks.
4. The method of claim 1, wherein the predicting the execution time and the execution cost of each of the pending subtasks based on the load prediction model, and the determining the target optimization scenario in combination with the ranking results of the pending subtasks and the user requirements, comprises:
predicting the predicted execution time and the predicted execution cost of each to-be-processed subtask based on a load prediction model;
determining a plurality of task optimization schemes based on a load prediction model and combined with the predicted execution time, the predicted execution cost and the classification result of the subtasks to be processed, wherein the task optimization schemes comprise a minimum time scheme, a minimum cost scheme, a minimum time scheme within the maximum cost and a minimum cost scheme within the maximum time;
determining the target optimization plan from the plurality of task optimization plans based on the user demand.
5. The method of claim 2,
the task nodes comprise an output task node and an input task node;
if the to-be-processed subtask only includes an output task node, the to-be-processed subtask is an initial subtask;
and if the to-be-processed subtask only comprises the input task node, the to-be-processed subtask is the final subtask.
6. The method of claim 5,
the starting subtask comprises a plurality of contents to be processed;
based on the source of the content to be processed, searching the ETL server in the same network with the content to be processed as a first ETL server;
and preferentially distributing the first ETL server to process the content to be processed.
7. An ETL server-based task processing system, comprising:
the acquisition module is used for acquiring a task processing request of a target task and calling a user requirement and a task processing flow chart of the target task based on the target task request;
the identification module is used for identifying a plurality of task nodes of the task processing flow chart and determining the execution sequence and the mutual relation of a plurality of subtasks to be processed based on the positions of the task nodes;
the classification module is used for classifying the subtasks to be processed according to the execution sequence and the mutual relation of the subtasks to be processed to obtain a classification result;
the scheme determining module is used for predicting the execution time and the execution cost of each to-be-processed subtask based on a load prediction model, and determining a target optimization scheme by combining the grading result of the to-be-processed subtask and the user requirement;
and the distribution module is used for distributing the subtasks to be processed to a plurality of ETL servers for task processing based on the target optimization scheme, so that the processing efficiency of the target tasks is improved.
8. An electronic device, wherein the electronic device comprises:
a processor; and the number of the first and second groups,
a memory storing computer-executable instructions that, when executed, cause the processor to perform the method of any of claims 1-6.
9. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-6.
CN202210809383.1A 2022-07-11 2022-07-11 Task processing method and system based on ETL server and electronic equipment Pending CN115202876A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210809383.1A CN115202876A (en) 2022-07-11 2022-07-11 Task processing method and system based on ETL server and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210809383.1A CN115202876A (en) 2022-07-11 2022-07-11 Task processing method and system based on ETL server and electronic equipment

Publications (1)

Publication Number Publication Date
CN115202876A true CN115202876A (en) 2022-10-18

Family

ID=83579910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210809383.1A Pending CN115202876A (en) 2022-07-11 2022-07-11 Task processing method and system based on ETL server and electronic equipment

Country Status (1)

Country Link
CN (1) CN115202876A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116069464A (en) * 2022-12-19 2023-05-05 深圳计算科学研究院 Optimization method and device based on distributed storage call data execution
CN117056068A (en) * 2023-08-08 2023-11-14 杭州观远数据有限公司 JobEngine task splitting method in ETL

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116069464A (en) * 2022-12-19 2023-05-05 深圳计算科学研究院 Optimization method and device based on distributed storage call data execution
CN116069464B (en) * 2022-12-19 2024-01-16 深圳计算科学研究院 Optimization method and device based on distributed storage call data execution
CN117056068A (en) * 2023-08-08 2023-11-14 杭州观远数据有限公司 JobEngine task splitting method in ETL
CN117056068B (en) * 2023-08-08 2024-03-19 杭州观远数据有限公司 JobEngine task splitting method in ETL

Similar Documents

Publication Publication Date Title
CN110727512B (en) Cluster resource scheduling method, device, equipment and storage medium
US10664308B2 (en) Job distribution within a grid environment using mega-host groupings of execution hosts
US8752059B2 (en) Computer data processing capacity planning using dependency relationships from a configuration management database
CN115202876A (en) Task processing method and system based on ETL server and electronic equipment
US20200174844A1 (en) System and method for resource partitioning in distributed computing
US9417919B2 (en) Computer cluster with objective-based resource sharing
US20070180451A1 (en) System and method for meta-scheduling
CN104239144A (en) Multilevel distributed task processing system
JP5121936B2 (en) RESOURCE ALLOCATION DEVICE, RESOURCE ALLOCATION PROGRAM, RECORDING MEDIUM, AND RESOURCE ALLOCATION METHOD
CN113342477B (en) Container group deployment method, device, equipment and storage medium
KR101471749B1 (en) Virtual machine allcoation of cloud service for fuzzy logic driven virtual machine resource evaluation apparatus and method
CN104123182A (en) Map Reduce task data-center-across scheduling system and method based on master-slave framework
US20140201371A1 (en) Balancing the allocation of virtual machines in cloud systems
US20160062929A1 (en) Master device, slave device and computing methods thereof for a cluster computing system
CN112463390A (en) Distributed task scheduling method and device, terminal equipment and storage medium
Mika et al. Modelling and solving grid resource allocation problem with network resources for workflow applications
CN111459641A (en) Cross-machine-room task scheduling and task processing method and device
Blythe et al. Planning for workflow construction and maintenance on the grid
JP5043166B2 (en) Computer system, data search method, and database management computer
Cucinotta et al. Optimum VM placement for NFV infrastructures
CN113703945B (en) Micro service cluster scheduling method, device, equipment and storage medium
CN110633142B (en) Block chain consensus method, management node, electronic device, and storage medium
JP2018181123A (en) Resource allocation control system, resource allocation control method, and program
CN113886086A (en) Cloud platform computing resource allocation method, system, terminal and storage medium
CN113296907A (en) Task scheduling processing method and system based on cluster and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination