CN116048759A

CN116048759A - Data processing method, device, computer and storage medium for data stream

Info

Publication number: CN116048759A
Application number: CN202310031867.2A
Authority: CN
Inventors: 王梅; 李粤平; 罗秋明
Original assignee: Shenzhen Polytechnic
Current assignee: Shenzhen Polytechnic
Priority date: 2023-01-10
Filing date: 2023-01-10
Publication date: 2023-05-02

Abstract

The invention discloses a data processing method of a data stream, which comprises the following steps: according to the embodiment of the invention, an operating system acquires a dependency relationship DAG graph of a node task and data traffic from a process block PCB of a program, wherein nodes in the dependency relationship DAG graph represent the node task, and edges connected with the nodes of the dependency relationship DAG graph represent the transmission quantity and the data traffic of the node task; distributing threads for node tasks according to the total amount of ready nodes, direct subsequent nodes and system threads in the dependency relationship DAG graph; and scheduling the threads according to the system load, the node task communication relationship and the traffic of the current processor. The embodiment of the invention provides a supporting method of the data stream execution mode on the operating system level, so that the efficiency and the optimizable space are greatly improved.

Description

Data processing method, device, computer and storage medium for data stream

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a data processing method, a data processing device, a computer and a storage medium of a data stream.

Background

The development direction of the processor is from the direction of simply increasing the running speed of the processor to the direction of the multi-core processor, and the large-scale distributed system is also more and more popular. Programming has traditionally employed a structure of sequential execution commands, in which data is often "static" and the operations to access the data are continued. So that the program is not particularly good for support of multi-core processors and large distributed systems. While data flow programming emphasizes the use of data as driving power, defining well-defined input and output connection operations. Instead of commands, related operations are performed immediately whenever data is ready, i.e., input is valid, so that the data flow programming is essentially parallel and can well run on multi-core processors as well as large distributed systems.

In the current massively parallel application context, the data stream computation is superior to the existing mainstream control stream execution mode in both the programming mode and the execution mode. In the current processor environment, which is still a control flow, the data flow execution mode can be implemented at the application level, for example, the internal execution engine of the Tensorf ow processes execution of tasks in the data flow execution mode. There are also specialized databases (e.g., taskf low) that implement data stream execution modes in the context of existing control stream processors, control stream operating systems, and control stream programming languages.

However, since the operating system level is not supported, there are significant limitations in both efficiency and optimizable space.

Disclosure of Invention

To solve the above technical problem, an embodiment of the present invention provides a data processing method for a data stream, including:

the method comprises the steps that an operating system obtains a dependency relationship DAG graph of a node task and data traffic from a process block PCB of a program, wherein nodes in the dependency relationship DAG graph represent the node task, and edges connected with the nodes of the dependency relationship DAG graph represent the transmission quantity and the data quantity of the node task;

distributing threads for node tasks according to the total amount of ready nodes, direct subsequent nodes and system threads in the dependency relationship DAG graph;

and scheduling the threads according to the system load, the node task communication relationship and the traffic of the current processor.

Further, the allocating threads for the node tasks according to the ready node, the direct subsequent node and the total system thread in the dependency DAG graph includes:

and sequencing ready nodes of the node tasks in the dependency relationship DAG graph according to the number of edges from more to less, and distributing the node tasks to threads of the ready nodes which are positioned at the first place and are online.

Further, the scheduling the threads according to the system load, the node task communication relation and the traffic of the current processor comprises the following steps:

counting the task data volume in each processor to obtain the total task data volume on the corresponding processor;

performing pre-scheduling according to a preset scheduling algorithm to schedule tasks to each processor one by one;

calculating total delay estimation of all processors according to a pre-scheduling result and the total task amount;

and evaluating various pre-scheduling results, and binding threads of the ready node tasks on a pre-scheduling processor with the minimum total delay estimation for data processing.

Further, the computing a total delay estimate for all processors based on the pre-scheduling result and the total task amount comprises:

calculating the data transmission time estimation value edge of the node task on the dependency relationship DAG graph=the sum of the time when all input data are copied from the NUMA node where the predecessor node is located to the NUMA node where the processor is located;

acquiring total data capacity and final cache capacity share of node tasks on each processor core to obtain a ratio k of the total data capacity to the cache;

the total delay estimate td=ridge+total data capacity x k x is calculated for each processing core, where x is an empirical value.

Further, the nodes of the remaining unassigned threads are offline ready nodes and offline direct subsequent nodes, and further include:

and tracking the ready state of the online blocking node by an operating system according to the precedence dependence in the PCB, wherein the ready state of the offline direct subsequent node is tracked and supported by a user code or a user state running library.

A data processing apparatus for a data stream, comprising:

the system comprises an acquisition module, a data processing module and a data processing module, wherein the acquisition module is used for acquiring a dependency relationship DAG graph of a node task and data traffic from a process block PCB of a program, wherein nodes in the dependency relationship DAG graph represent the node task, and edges connected with the nodes of the dependency relationship DAG graph represent the transmission quantity and the data quantity of the node task;

the processing module is used for distributing threads for node tasks according to the total amount of ready nodes, direct subsequent nodes and system threads in the dependency relationship DAG graph;

and the execution module is used for scheduling the threads according to the system load of the current processor, the node task communication relationship and the traffic.

Further, the processing module is further configured to sort ready nodes of the node tasks in the dependency relationship DAG graph according to the number of edges from more to less, and allocate the node tasks to threads of ready nodes that are located first and online.

Further, the execution module includes:

the first acquisition sub-module is used for counting the task data volume in each processor to obtain the total task data volume on the corresponding processor;

the first processing sub-module is used for carrying out pre-scheduling according to a preset scheduling algorithm to schedule tasks to each processor one by one;

a second processing sub-module for calculating a total delay estimate for all processors based on the pre-scheduling result and the total task amount;

and the first execution sub-module is used for evaluating various pre-scheduling results, and binding threads of the ready node tasks on a pre-scheduling processor with the minimum total delay estimation for data processing.

Further, the first execution submodule includes:

a second obtaining sub-module, configured to calculate a data transfer time estimate edge=sum of times when all input data is copied from a NUMA node where a predecessor node is located to a NUMA node where the processor is located for an edge of the node task on the dependency DAG graph;

the third obtaining submodule is used for obtaining the total data capacity of the node tasks on each processor core and the capacity share of the final cache to obtain the ratio k of the total data capacity to the cache;

and a second execution sub-module, configured to calculate a total delay estimate td=ridge+total data capacity λ for each processing core, where λ is an empirical value.

Further, the nodes of the remaining unassigned threads are offline ready nodes and offline direct subsequent nodes, and the execution module is further configured to track the ready state of the online blocking node according to the precedence dependence in the PCB, where the ready state of the offline direct subsequent nodes is tracked and supported by a user code or a user state runtime.

A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of a data processing method of a data stream as described above.

A storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of a data processing method of a data stream as described above.

According to the embodiment of the invention, an operating system acquires a dependency relationship DAG graph of a node task and data traffic from a process block PCB of a program, wherein nodes in the dependency relationship DAG graph represent the node task, and edges connected with the nodes of the dependency relationship DAG graph represent the transmission quantity and the data traffic of the node task; distributing threads for node tasks according to the total amount of ready nodes, direct subsequent nodes and system threads in the dependency relationship DAG graph; and scheduling the threads according to the system load, the node task communication relationship and the traffic of the current processor. The embodiment of the invention provides a supporting method of the data stream execution mode on the operating system level, so that the efficiency and the optimizable space are greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a data processing method of a data stream according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a data flow according to an embodiment of the present invention;

FIG. 3 is a basic block diagram of a data processing apparatus for data flow according to an embodiment of the present invention;

fig. 4 is a basic structural block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings.

In some of the flows described in the specification and claims of the present invention and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 101, 102, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

Referring to fig. 1, fig. 1 shows a data processing method of a data stream according to an embodiment of the present invention, and as shown in fig. 1, the method specifically includes the following steps:

s1, an operating system acquires a dependency relationship DAG graph of a node task and data traffic from a process block PCB of a program, wherein nodes in the dependency relationship DAG graph represent the node task, and edges connected with the nodes of the dependency relationship DAG graph represent the transmission quantity and the data quantity of the node task;

data stream programming is a high performance parallel programming model that solves the problem of efficient utilization of multi-core processors. The data flow programming is obviously different from the traditional programming language, the data flow programming is executed in a data driving mode, the data to be processed is distributed to each core, the calculation and the communication of the data are separated, and the potential parallelism in the flow program is fully mined by utilizing the parallel characteristic of software flow through task scheduling and distribution, so that the load among the cores is balanced. In the data flow paradigm, a static instance of a data flow program is described in terms of its structure as a directed graph DAG. As shown in fig. 2, the nodes in the figure represent the calculation units, and the edges represent the data transmission paths. And transmitting data between adjacent nodes through edges, calculating node consumption data, and outputting the generated data to an input-output sequence as the input of a next calculation unit.

It should be noted that, in the embodiment of the present invention, the data flow task manages the whole data flow calculation task in a DAG directed acyclic graph manner. Executing data flow tasks by using threads as carriers, modifying information of a Process Control Block (PCB), adding fields for sequential dependency relationship among the data flow tasks, adding fields for recording the size (byte number) of data corresponding to an output edge, adding a field for recording the stack frame length in a stack required by the tasks, adding a data preparation condition count of a data flow task node, and adding a data flow task activation mark.

S2, distributing threads for node tasks according to the ready nodes, the direct subsequent nodes and the total system threads in the dependency relationship DAG graph;

specifically, step S2 is to sort the ready nodes of the node tasks in the dependency relationship DAG graph according to the number of edges from more to less, and allocate the node tasks to threads of the ready nodes that are located first and online.

The operating system counts the node tasks it is running for each processor core, sums the data size of the output edges required by these tasks and the stack frame length required by the tasks to obtain the total data capacity. And counting the total data capacity on each core for new task scheduling basis. Wherein counting node tasks of the currently processed data from the DAG directed graph terminal and counting total data capacity of executing the node tasks on all processor cores, comprises:

searching a current ready node associated with a target node where data in the current process are located from the DAG directed graph;

and step two, summing the data size of the edge between the target node and the current ready node and the required stack frame length to obtain the total data capacity of each processor core.

And S3, scheduling the threads according to the system load, the node task communication relationship and the traffic of the current processor.

Specifically, step S3 includes the steps of:

step one, counting the task data volume in each processor to obtain the total task data volume on the corresponding processor;

step two, performing pre-scheduling according to a preset scheduling algorithm to schedule tasks to each processor one by one;

step three, calculating total delay estimation of all processors according to a pre-scheduling result and the total task amount;

in practical application, the third step includes: calculating the data transmission time estimation value edge of the node task on the dependency relationship DAG graph=the sum of the time when all input data are copied from the NUMA node where the predecessor node is located to the NUMA node where the processor is located; acquiring total data capacity and final cache capacity share of node tasks on each processor core to obtain a ratio k of the total data capacity to the cache; the total delay estimate td=ridge+total data capacity x k x is calculated for each processing core, where x is an empirical value.

Where there are multiple cores, the total data capacity and last level cache capacity share of the node tasks on each processor core need to be split equally.

In one embodiment of the invention, the precursor relation and the corresponding data size, the difference of communication cost among the processor cores, the final cache capacity of each processor core, and the total data capacity of the sum of the data size of the output edge of the node task precursor node, the data size of the output edge of the node and the stack frame length of the node task are obtained from the control block PCB of the node task. Traversing all the processor cores to obtain the total data capacity and the final cache capacity of the node tasks on each processor core, and obtaining the ratio k of the total data capacity and the cache.

One embodiment of the present invention adds a construct member to the thread control block task_construct { } construct, e.g., in the Li nux kernel

pre-suc { i nt pre_count; number of predecessor nodes

structpre-suc presides [ ]; the pointer array i nt suc_count of the precursor node; number of successor nodes

struct suc-sucnos [ ]; array of pointer for/successor node

}；

Adding a stack frame length required by the node task in the task_struct:

i nt frame_size; stack frame length required for the task of the node

Adding a data preparation condition of the node and a flag of whether to activate or not in the task_struct:

i nt data_ready_count; record of the number of readied predecessor data

i nt act i changed; when data_ready_count=pre_count, activate

In the operating system kernel, the data stream task data overhead on each CPU is increased:

i nt current_sIze [ CPUs ]; each CPU core counts alone

Current_sIze [ n ] records the sum of the data overhead of all data stream tasks on processor cores numbered n, including the sum of multiple output edge data, stack frame lengths.

Taking the data flow task shown in fig. 2 as an example:

after the task a finishes the calculation, the task_struct related information of the task c is updated: the number data_ready_count ready in the record predecessor data is incremented by one, and if data_ready_count=pre_count, the task is activated to activate=1. The same operation as described above is performed for the f task.

If the task c is activated at this time, the newly added data of the patent is utilized to realize the scheduling. Examples of possible scheduling schemes are as follows.

Assuming that the c-task is scheduled to processor i, then calculate: a data transfer time estimate edge on the DAG graph is calculated. And calculating the total capacity of the data on the processor core i and the data of the task c to obtain the ratio k of the total capacity of the data and the cache. Total delay estimate td=ridge+c total data capacity x k x, where x is an empirical value, obtained statistically (e.g., in 1 μs/kB). Traversing all the processor cores, completing the calculation one by one, selecting a core m with the lowest Td, and scheduling the task c to the core m for operation.

And step four, evaluating various pre-scheduling results, and binding threads of the ready node task on a pre-scheduling processor with the minimum total delay estimation for data processing.

In the embodiment of the invention, the nodes of the rest unassigned threads are offline ready nodes and offline direct subsequent nodes, wherein the ready state of the online blocking node is tracked by an operating system according to the precedence dependence in the PCB, and the ready state of the offline direct subsequent nodes is tracked and supported by a user code or a user state operation library.

As shown in fig. 3, in order to solve the above problem, an embodiment of the present invention further provides a data processing apparatus for data flow, including: the system comprises a module 2100, a processing module 2200 and an executing module 2300, wherein the module 2100 is used for acquiring a dependency relationship DAG graph of a node task and data traffic from a process block PCB of a program, wherein nodes in the dependency relationship DAG graph represent the node task, and edges connecting the nodes of the dependency relationship DAG graph represent the transmission quantity and the data quantity of the node task; a processing module 2200, configured to allocate threads for node tasks according to the total amount of ready nodes, direct subsequent nodes, and system threads in the dependency DAG graph; the execution module 2300 is configured to schedule the thread according to a system load, a node task communication relationship, and a traffic volume of the current processor.

In some embodiments, the processing module is further configured to sort the ready nodes of the node tasks in the dependency DAG graph according to the number of edges from more to less, and allocate the node tasks to threads of ready nodes that are located first and online.

In some embodiments, the execution module comprises: the first acquisition sub-module is used for counting the task data volume in each processor to obtain the total task data volume on the corresponding processor; the first processing sub-module is used for carrying out pre-scheduling according to a preset scheduling algorithm to schedule tasks to each processor one by one; a second processing sub-module for calculating a total delay estimate for all processors based on the pre-scheduling result and the total task amount; and the first execution sub-module is used for evaluating various pre-scheduling results, and binding threads of the ready node tasks on a pre-scheduling processor with the minimum total delay estimation for data processing.

In some embodiments, the first execution submodule includes: a second obtaining sub-module, configured to calculate a data transfer time estimate edge=sum of times when all input data is copied from a NUMA node where a predecessor node is located to a NUMA node where the processor is located for an edge of the node task on the dependency DAG graph; the third obtaining submodule is used for obtaining the total data capacity of the node tasks on each processor core and the capacity share of the final cache to obtain the ratio k of the total data capacity to the cache; and a second execution sub-module, configured to calculate a total delay estimate td=ridge+total data capacity λ for each processing core, where λ is an empirical value.

In some embodiments, the nodes of the remaining unassigned threads are offline ready nodes and offline direct subsequent nodes, and the execution module is further configured to track a ready state of an online blocking node according to a precedence dependency relationship in the PCB, where the ready state of the offline direct subsequent nodes is provided with a state tracking support by a user code or a user state runtime.

The data processing device of the data flow acquires a dependency relationship DAG graph of a node task and data traffic from a process block PCB of a program, wherein nodes in the dependency relationship DAG graph represent the node task, and edges connected with the nodes of the dependency relationship DAG graph represent the transmission quantity and the data traffic of the node task; distributing threads for node tasks according to the total amount of ready nodes, direct subsequent nodes and system threads in the dependency relationship DAG graph; and scheduling the threads according to the system load, the node task communication relationship and the traffic of the current processor. The embodiment of the invention provides a supporting method of the data stream execution mode on the operating system level, so that the efficiency and the optimizable space are greatly improved.

In order to solve the technical problems, the embodiment of the invention also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

As shown in fig. 4, the internal structure of the computer device is schematically shown. As shown in fig. 4, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected by a system bus. The non-volatile storage medium of the computer device stores an operating system, a database, and computer readable instructions, where the database may store a control information sequence, and the computer readable instructions, when executed by the processor, may cause the processor to implement an image processing method. The processor of the computer device is used to provide computing and control capabilities, supporting the operation of the entire computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, cause the processor to perform an image processing method. The network interface of the computer device is for communicating with a terminal connection. Those skilled in the art will appreciate that the structures shown in FIG. 4 are block diagrams only and do not constitute a limitation of the computer device on which the present aspects apply, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

The processor in this embodiment is configured to execute the specific contents of the acquisition module 2100, the processing module 2200, and the execution module 2300 in fig. 3, and the memory stores program codes and various types of data required for executing the above modules. The network interface is used for data transmission between the user terminal or the server. The memory in the present embodiment stores program codes and data necessary for executing all the sub-modules in the image processing method, and the server can call the program codes and data of the server to execute the functions of all the sub-modules.

The embodiment of the invention provides a computer device, which is characterized in that an operating system acquires a dependency relationship DAG graph of a node task and data traffic from a process block PCB of a program, wherein nodes in the dependency relationship DAG graph represent the node task, and edges connected with the nodes of the dependency relationship DAG graph represent the transmission quantity and the data quantity of the node task; distributing threads for node tasks according to the total amount of ready nodes, direct subsequent nodes and system threads in the dependency relationship DAG graph; and scheduling the threads according to the system load, the node task communication relationship and the traffic of the current processor. The embodiment of the invention provides a supporting method of the data stream execution mode on the operating system level, so that the efficiency and the optimizable space are greatly improved.

The present invention also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the image processing method of any of the embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. A data processing method of a data stream, comprising:

2. The method according to claim 1, wherein the allocating threads for node tasks according to the ready node, the directly following node and the total system thread in the dependency DAG graph comprises:

3. The method of claim 1, wherein scheduling threads based on system load, node task communication relationships, and traffic of the current processor comprises:

4. A data processing method according to claim 3, wherein said calculating a total delay estimate for all processors based on the pre-scheduling result and the total task amount comprises:

5. The data processing method of claim 1, wherein the remaining unassigned nodes of threads are offline ready nodes and offline directly subsequent nodes, further comprising:

6. A data processing apparatus for a data stream, comprising:

7. The data processing apparatus according to claim 6, wherein,

the processing module is further configured to sort ready nodes of the node tasks in the dependency relationship DAG graph according to the number of edges from more to less, and allocate the node tasks to threads of the ready nodes that are located first and online.

8. The data processing apparatus of claim 6, wherein the execution module comprises:

9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the data processing method of the data stream of any of claims 1 to 5.

10. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the data processing method of the data stream of any of claims 1 to 5.