CN111756802A

CN111756802A - Method and system for scheduling data stream tasks on NUMA platform

Info

Publication number: CN111756802A
Application number: CN202010456848.0A
Authority: CN
Inventors: 都政; 沙士豪; 温志伟; 舒继武; 罗秋明
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2020-10-09
Anticipated expiration: 2040-05-26
Also published as: CN111756802B

Abstract

The invention discloses a method and a system for scheduling data stream tasks on a NUMA platform, wherein the method comprises the following steps: marking the data flow graph and the data flow task according to the state of the data flow in the calculation operation; recording node information of the NUMA platform and memory access bandwidth of a memory between nodes; allocating an initial task of the data stream to an idle processor core of any NUMA node; and selecting the processor core with the minimum data transmission time cost from all the idle processor cores at present, and scheduling the newly ready task to the processor core with the minimum data transmission time cost for running. The invention provides a dynamic scheduling method according to the data storage characteristics of the NUMA platform and the characteristics of the node data stream, and the method ensures that the time consumed by data transmission between nodes is minimum and the overall calculation execution efficiency is improved by selecting a proper NUMA node as an operation node of a ready data stream task.

Description

Method and system for scheduling data stream tasks on NUMA platform

Technical Field

The invention relates to the technical field of multitask data processing, in particular to a method and a system for scheduling data stream tasks on a NUMA platform.

Background

At present, the development direction of processors is developed from the direction of simply improving the running speed of the processors to the direction of a multi-core processor, and large-scale distributed systems are more and more common. Conventionally, programming is performed by using a structure of executing commands sequentially, and in this mode, data is often static and is continuously accessed. Making programs not particularly well supported by multi-core processors and large distributed systems. And the data flow programming emphasizes that the data is used as driving power, and the connection operation of input and output is clearly defined. And a command mode is not adopted, and related operations are immediately executed when data are ready and input is effective, so that the data flow programming is parallel in nature and can be well operated in a multi-core processor and a large-scale distributed system. NUMA employs a distributed memory model, except that the processors in all nodes have access to all of the system physical memory. However, the time required for each processor to access memory within the node may be much less than the time it takes to access memory within some remote node in the prior art.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to solve the defect of low computational execution efficiency caused by long memory access time in a NUMA platform processor, thereby providing a method and a system for scheduling a base data stream task on a NUMA platform.

In order to achieve the purpose, the invention provides the following technical scheme:

in a first aspect, an embodiment of the present invention provides a method for scheduling a data stream task on a NUMA platform, including the following steps:

marking the data flow graph and the data flow task according to the state of the data flow in the calculation operation;

recording node information of the NUMA platform and memory access bandwidth of a memory between nodes;

allocating an initial task of the data stream to an idle processor core of any NUMA node;

and selecting the processor core with the minimum data transmission time cost from all the idle processor cores at present, and scheduling the newly ready task to the processor core with the minimum data transmission time cost for running.

In one embodiment, the data stream in a compute run state comprises: the completed task is marked as an F state; the running task is marked as an R state; a task is not prepared and marked as a U state; when the data flow starts to be calculated, an initial task is used as the input of the whole data flow graph, the initial task is in an R state, all the other data flow tasks are in a U state, along with the advance of the data flow calculation, the task in the R state is continuously operated and completed, and the input data of the subsequent data flow tasks are ready to enter the R state.

In one embodiment, scheduling is required each time a data flow task transitions from the U state to the R state.

In one embodiment, each task records its own estimated computation time and the required storage size Dsize ═ of the n input data (D)₀，D₁，...，D_n-1) The n values of the tasks are different, the n of the initial task is equal to 0, and the n values of the rest nodes are larger than 0.

In an embodiment, the step of recording node information of the NUMA platform and memory access bandwidth of a memory between nodes includes:

marking processor cores of a NUMA platform as C_i，jWherein i is used for representing the NUMA node number where the processor core is located, and j is used for representing the number given to the interior of the node by the processor core;

binding threads and data to different NUMA nodes to be tested to obtain a query access bandwidth;

communication matrix M for acquiring NUMA platform cross access bandwidth records with k nodes_crossElement B in the communication matrix_i，jThe method is used for recording the memory access bandwidth when the processor core on the NUMA node i accesses the memory on the node j, wherein the communication matrix

In an embodiment, its projected busy time T is recorded for several processor cores on each node_busy，T _busy0 characterizes the current processor core as idle.

In one embodiment, the candidate processor C_c，kThe method for calculating the data transfer time cost includes:

according to NUMA node position set A of N predecessor tasks (N)₀，N₁，...，N_n-1) And the amount of data D required for the current task_size＝(D₀，D₁，...，D_n-1) The data transfer time cost is calculated as:

wherein D is_iAmount of data required for the ith task, B_C，NiAccessing node N for processor core on NUMA node C_iMemory access bandwidth in memory.

In one embodiment, after all input data are copied to the corresponding processor core nodes according to the scheduling result, the predicted time T of the current processor core is updated according to the calculation duration of the data flow task_busyWorkload time, which characterizes the current processor as busy.

In a second aspect, an embodiment of the present invention provides a system for scheduling a data stream task on a NUMA platform, including:

the task marking module is used for marking the data flow graph and the data flow task according to the state of the data flow in the calculation operation;

the node information and memory access bandwidth recording module is used for recording the node information of the NUMA platform and the memory access bandwidth of the memory between the nodes;

the initial task allocation module is used for allocating the initial tasks of the data stream to idle processor cores of any NUMA node;

and the task scheduling module is used for selecting the processor core with the minimum data transmission time cost from all the current idle processor cores and scheduling the newly ready task to the processor core with the minimum transmission time cost for operation.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium stores computer instructions, and the computer instructions are configured to cause the computer to execute the method for scheduling a data stream task on a NUMA platform according to the first aspect of the present invention.

In a fourth aspect, an embodiment of the present invention provides a computer device, including: the scheduling method comprises a memory and a processor, wherein the memory and the processor are communicatively connected with each other, the memory stores computer instructions, and the processor executes the computer instructions to execute the scheduling method of the data stream task on the NUMA platform according to the first aspect of the embodiment of the present invention.

The technical scheme of the invention has the following advantages:

in order to complete the scheduling of the data flow task, firstly, a data flow graph and the data flow task are labeled according to the state of the data flow in the calculation operation process; then recording node information of the NUMA platform and memory access bandwidth of memories among the nodes; allocating an initial task of the data stream to an idle processor core of any NUMA node; and selecting the processor core with the minimum data transmission time cost from all the idle processor cores at present, and scheduling the newly ready task to the processor core with the minimum data transmission time cost for running. The invention provides a dynamic scheduling method by combining the characteristics of data flow according to the data storage characteristics of the NUMA platform, and the appropriate NUMA node is selected as the operation node of the ready data flow task, so that the time consumed for data transmission among the nodes is minimized, and the overall calculation execution efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a diagrammatic representation of the time and distance a processor spends accessing local and non-local memory in a NUMA framework in accordance with an embodiment of the present invention;

FIG. 2 is a flowchart of a specific example of a method for scheduling data stream tasks on a NUMA platform according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a dataflow graph in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a NUMA system having 18 cores in an embodiment of the invention;

FIG. 5 is a block diagram of a specific example of a scheduling system for dataflow tasks on a NUMA platform according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a computer device in an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example 1

Data flow programming is a high performance parallel programming model that addresses the problem of efficiency utilization of multi-core processors. The data flow programming is obviously different from the traditional programming language, the data flow programming is executed in a data driving mode, data needing to be processed are distributed to each core, the data calculation and communication are separated, potential parallelism in a flow program is fully mined by using the parallel characteristic of software pipelining through task scheduling and distribution, and load balance among the cores is achieved. In the data flow example, a static example of a data flow program will be described as a directed graph according to its structure, in which nodes represent computing units, edges represent data transmission paths, data is transmitted between adjacent nodes through the edges, nodes consume data to perform computation, and output the generated data to an input-output sequence as the input of the next computing unit.

Non-uniform memory access (NUMA), an architectural model of how memory is accessed by multiple CPUs, is a computer memory design for multiple processors, with memory access time dependent on the memory location of the processor. In the NUMA architecture, a physical cpu (generally including multiple logical cpus or multiple cores) constitutes a node, and the node includes not only a cpu but also a group of memory slots, that is, a physical cpu and a block of memory constitute a node. Each cpu can access the memory under its own node, and can also access the memories of other nodes.

NUMA employs a distributed memory model, except that the processors in all nodes have access to all of the system physical memory. However, the time required for each processor to access memory within the node may be much less than the time it takes to access memory within some remote nodes. Under NUMA, a processor accesses its own local memory faster than non-local memory (memory from where to another processor or memory shared between processors), and the time it takes to access non-local memory is also related to the distance between nodes, as shown in FIG. 1, the closer the distance the less time it takes.

The embodiment of the present invention provides a method for scheduling a data stream task on a NUMA platform, as shown in fig. 2, when computing the data stream task by using a multiprocessor on the NUMA platform, aiming at a data storage characteristic of the NUMA platform, the method including:

step S1: and marking the data flow graph and the data flow task according to the state of the data flow in the calculation operation.

A complete data stream calculation is composed of a plurality of data stream tasks and data input, each data stream task needs to wait for a plurality of prepositive input data to be ready, and when the prepositive data of the task are ready, the data stream task is ready to be executed. When the data flow task is executed, new data can be output, and the data is used as the prepositive input data of the subsequent task.

In the embodiment of the present invention, in order to complete the scheduling of the data flow task, the data flow graph and the data flow task are labeled in the data flow calculation operation. Data flow tasks are labeled into three categories: the task is completed, and the type is marked as F; the type of the running task is represented as R; the task is not ready and the type is denoted as U. The initial computing run is in the U state, except for the initial task(s) which are in the R state. As dataflow computations advance, R tasks continue to run to completion, thereby making the input data of subsequent dataflow tasks ready to enter the ready R state. Each task records its own estimated computation time Workload and the required storage space size Dsize of n input data (D0, D1.., Dn-1), the n values of the tasks are different, n of the initial task is 0, and n of the rest nodes is more than 0. The initial task is ready by default as an input to the entire dataflow graph.

The dataflow diagram shown in fig. 3: each node represents a task and nodes A, B, C represent 3 tasks, respectively. The small squares represent data, passing between dataflow tasks. Once the input data for a task is ready, execution of the task begins. The task will be assigned to a computing unit, i.e. CPU, to execute. R/F/U respectively represents the state of the data flow task. As can be seen from FIG. 2, task C is ready with only one data difference generated by task A, changes its state from R to F when task A ends, and generates data for the subsequent task. At this time, all the pre-input data of the task C are prepared, the state is changed from U to R, and the task scheduling is started.

Step S2: and recording node information of the NUMA platform and memory access bandwidth of memories between the nodes.

In the embodiment of the invention, when the data stream calculation is operated on the NUMA platform, the data stream task is allocated to the NUMA node and executed by the CPU to which the node belongs after being ready, and due to the characteristics of the data stream calculation, data is transferred between different NUMA nodes.

In order to complete the scheduling of data stream tasks, the embodiment of the invention records node information of a NUMA platform and memory access bandwidth of a memory between nodes, and a processor core is marked as Ci, j, wherein i is used for representing the NUMA node number where the processor core is located, and j is used for representing the number given to the processor core and then the number given to the interior of the node.

Obtaining the cross access bandwidth by binding the thread and the data on different NUMA nodes for testing, and recording the cross access bandwidth as a matrix M for a NUMA system with k nodes_crossAnd the internal element Bi, j is used for recording the access bandwidth when the processor core on the NUMA node i accesses the memory on the node j:

for several processor cores on each node, each records its predicted busy time T_busyWhen T is_busy0 means that the current processor core is idle.

As shown in fig. 4, putting into a NUMA platform has 9 nodes, each node having 2 processor cores, and memory space belonging to the node. A NUMA system having 18 cores is constructed. The node numbers are 0-8, and the processors Ci and j (i is 0-8 and j is 0-1). The node communication matrix provided by the existing general system can be represented by the following table, each value represents the access distance between different nodes, and the larger the distance is, the slower the data transmission between the nodes is.

The embodiment of the invention uses the bandwidth instead of the distance when calculating the priority, and according to the actual test or the hardware data thereof, the communication matrix of the cross-access bandwidth is shown as a table, each value of the cross-access bandwidth represents the access bandwidth between different nodes, and the larger the bandwidth is, the faster the data transmission between the nodes is, and the closer the distance is.

Step S3: the initial task of the data flow is assigned to the idle processor core of any NUMA node. I.e. the initial task is allocated to the state T of the current processor core _busy0 in the processor core.

Step S4: and selecting the processor core with the minimum data transmission time cost from all the idle processor cores at present, and scheduling the newly ready task to the processor core with the minimum data transmission time cost for running.

Since the access bandwidth of data is different between different NUMA nodes (the farther the nodes are apart, the smaller the access bandwidth), different task scheduling methods will cause the time it takes for data to be transmitted between nodes to be different. According to the characteristics of the NUMA platform, the embodiment of the invention selects the proper NUMA node as the operation node of the ready data flow task, so that the time consumed for data transmission between the nodes is the least.

After the initial task is allocated to the idle processor core of any NUMA node, the newly ready task will be all idle (T) from the current_busy0), selecting a core with the minimum data transmission cost TC (transfercost) value from the processor cores, and scheduling the core to run on the core.

The embodiment of the invention treats the selected processor C_c，kThe calculation method of the time cost TC of data transmission comprises: according to NUMA node position set A (N) of its N predecessor tasks₀，N₁，...，N_n-1)And the amount of data D required for the current task_size(D₀，D₁，...，D_n-1) Calculating the time cost of data transmission

Wherein D is_iAmount of data required for the ith task, B_C，NiAccessing node N for processor core on NUMA node C_iThe memory access bandwidth during the memory is the sum of the data volume required by the current task and the memory access bandwidth of the processor core at the NUMA node and the access node, and the data transmission time cost value is obtained. And then scheduling the newly ready task data to the processor core with the minimum time cost value for transmission, so that the time consumed for data transmission between the nodes is minimum, and the overall calculation execution efficiency is improved.

In embodiments of the present invention, scheduling is required each time a task transitions from the U state to the R state. According to the scheduling result, after all input data are copied to the local node (to accommodate the delay caused by the copied data), the estimated Tbusy or Workload time of the processor core is updated by the computing time Workload of the task, and the processor is indicated to be in a busy state.

According to the scheduling method of the data stream task on the NUMA platform, provided by the embodiment of the invention, a dynamic scheduling method is provided according to the data storage characteristics of the NUMA platform and by combining the characteristics of the data stream, and by selecting a proper NUMA node as an operation node of the ready data stream task, the time consumed for data transmission among the nodes is minimized, and the overall calculation execution efficiency is improved.

Example 2

An embodiment of the present invention provides a system for scheduling a data stream task on a NUMA platform, as shown in fig. 5, including:

the task marking module 1 is used for marking the data flow graph and the data flow task according to the state of the data flow in the calculation operation; this module executes the method described in step S1 in embodiment 1, and is not described herein again.

The node information and memory access bandwidth recording module 2 is used for recording the node information of the NUMA platform and the memory access bandwidth of the memory between the nodes; this module executes the method described in step S2 in embodiment 1, and is not described herein again.

An initial task allocation module 3, configured to allocate an initial task of a data stream to an idle processor core of any NUMA node; this module executes the method described in step S3 in embodiment 1, and is not described herein again.

And the task scheduling module 4 is used for selecting the processor core with the minimum data transmission time cost from all the idle processor cores at present, and scheduling the newly ready task to the processor core with the minimum transmission time cost for operation. This module executes the method described in step S4 in embodiment 1, and is not described herein again.

The scheduling system of the data stream task on the NUMA platform provided by the embodiment of the invention provides a dynamic scheduling method by combining the characteristics of the data stream according to the data storage characteristics of the NUMA platform, and the appropriate NUMA node is selected as the operation node of the ready data stream task, so that the time consumed by data transmission between the nodes is minimized, and the overall computation execution efficiency is improved.

Example 3

An embodiment of the present invention provides a computer device, as shown in fig. 6, the device may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or in another manner, and fig. 6 takes the connection by the bus as an example.

The processor 51 may be a Central Processing Unit (CPU). The Processor 51 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.

The memory 52, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as the corresponding program instructions/modules in the embodiments of the present invention. The processor 51 executes various functional applications and data processing of the processor by running non-transitory software programs, instructions and modules stored in the memory 52, namely, implementing the scheduling method of the data stream task on the NUMA platform in the above method embodiment.

The memory 52 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 51, and the like. Further, the memory 52 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 52 may optionally include memory located remotely from the processor 51, and these remote memories may be connected to the processor 51 via a network. Examples of such networks include, but are not limited to, the internet, intranets, mobile communication networks, and combinations thereof.

One or more modules are stored in the memory 52 and when executed by the processor 51, perform the scheduling method of dataflow tasks on a NUMA platform in embodiment 1.

The details of the computer device can be understood by referring to the corresponding related descriptions and effects in embodiment 1, and are not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program that can be stored in a computer-readable storage medium and that when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. A method for scheduling data stream tasks on a NUMA platform is characterized by comprising the following steps:

2. The method for scheduling data flow tasks on a NUMA platform according to claim 1, wherein the data flow in a compute running state comprises: the completed task is marked as an F state; the running task is marked as an R state; a task is not prepared and marked as a U state; when the data flow starts to be calculated, an initial task is used as the input of the whole data flow graph, the initial task is in an R state, all the other data flow tasks are in a U state, along with the advance of the data flow calculation, the task in the R state is continuously operated and completed, and the input data of the subsequent data flow tasks are ready to enter the R state.

3. The method of claim 2, wherein scheduling is required each time a dataflow task transitions from U state to R state.

4. A method for scheduling data flow tasks on a NUMA platform according to claim 1, wherein each task has its own estimated computation time and the required memory size Dsize ═ of n input data recorded (D)₀，D₁，...，D_n-1) The n values of the tasks are different, the n of the initial task is equal to 0, and the n values of the rest nodes are larger than 0.

5. The method for scheduling data stream tasks on a NUMA platform according to claim 4, wherein the step of recording node information of the NUMA platform and memory access bandwidth of a memory between nodes includes:

6. The method for scheduling data flow tasks on NUMA platform according to claim 5, wherein the predicted busy time T of each of the processor cores on each node is recorded_busy，T_busy0 characterizes the current processor core as idle.

7. The method for scheduling data stream tasks on a NUMA platform as claimed in claim 5, wherein the candidate processor C is_c，kThe method for calculating the data transfer time cost includes:

where Di is the amount of data required by the ith task, B_C，NiAccessing node N for processor core on NUMA node C_iMemory access bandwidth in memory.

8. The method for scheduling data stream tasks on a NUMA platform as claimed in claim 7, wherein after all input data are copied to corresponding processor core nodes according to the scheduling result, the predicted time T of the current processor core is updated according to the calculated time length of the data stream task_busyWorkload time, which characterizes the current processor as busy.

9. A system for scheduling data streaming tasks on a NUMA platform, comprising:

10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of scheduling data flow tasks on a NUMA platform of any one of claims 1 to 8.

11. A computer device, comprising: a memory and a processor, the memory and the processor being communicatively coupled to each other, the memory storing computer instructions, the processor executing the computer instructions to perform the method of scheduling data flow tasks on a NUMA platform as recited in any one of claims 1 to 8.