CN113190358A

CN113190358A - Job distribution method and device, electronic equipment and readable storage medium

Info

Publication number: CN113190358A
Application number: CN202110574633.3A
Authority: CN
Inventors: 苏勇; 李斌; 万伟; 刘耀华
Original assignee: Dawning Information Industry Beijing Co Ltd
Current assignee: Dawning Information Industry Beijing Co Ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-07-30

Abstract

The application provides a job distribution method, a job distribution device, electronic equipment and a readable storage medium, and relates to the technical field of computers. According to the method, the communication delay among the computing nodes is obtained, and then the jobs are sequentially distributed to the corresponding computing nodes according to the sequence of the communication delay from small to large, so that the communication delay among the computing nodes executing the jobs is smaller as much as possible, the computing efficiency can be improved, and the computing performance of the computing cluster can be improved.

Description

Job distribution method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a job assignment method, an apparatus, an electronic device, and a readable storage medium.

Background

The high-performance computing cluster has strong computing power and can provide a large amount of computing services for users, and the cluster management system can schedule jobs according to the requirements of the users, allocate computing resources and provide appropriate computing services. The high-performance computing cluster usually has a large number of computing nodes, and the high-performance interconnection network is responsible for orderly connecting the computing nodes together to provide high-efficiency communication service, so that the independent computing nodes coordinate to communicate to form an organic whole, and strong computing power and storage resources are provided.

The main function of the high-performance interconnection network is to realize message transmission among a large number of computing nodes, the existing job allocation mode is to randomly select some computing nodes to execute jobs, but if the message transmission among the computing nodes is not timely, the computing efficiency is low, and the computing capacity of the high-performance computing cluster is greatly influenced.

Disclosure of Invention

An object of the embodiments of the present application is to provide a job allocation method, an apparatus, an electronic device, and a readable storage medium, so as to improve that a job allocation manner in the prior art causes low computational efficiency, which affects computational capability of a high-performance computing cluster.

In a first aspect, an embodiment of the present application provides a job allocation method, where the method includes: acquiring the number of computing nodes required for executing the target operation; acquiring communication delay among all computing nodes in a computing cluster, wherein the communication delay is determined according to link information of communication paths among the computing nodes; selecting target computing nodes for executing the target operation according to the sequence of the communication delay from small to large until the number of the selected target computing nodes reaches the number of the computing nodes; and distributing the target operation to the target computing node for execution.

In the implementation process, the communication delay among the computing nodes is obtained, and then the jobs are sequentially distributed to the corresponding computing nodes according to the sequence of the communication delay from small to large, so that the communication delay among the computing nodes executing the jobs is as small as possible, the computing efficiency can be improved, and the computing performance of the computing cluster is improved.

Optionally, the communication delay between the computing nodes is obtained by:

acquiring communication paths among all computing nodes in a network structure formed by the computing clusters;

analyzing link information on the communication path, wherein the link information comprises information of each forwarding device on the communication path, link types between devices connected with each other and link lengths;

and acquiring communication delay among the computing nodes according to the link information.

In the implementation process, the communication delay is obtained according to the link information on the communication path between each computing node, so that the actual link deployment condition can be considered, and more accurate communication delay is obtained.

Optionally, the obtaining the communication delay between the computing nodes according to the link information includes:

determining the equipment forwarding delay corresponding to each forwarding equipment according to the information of each forwarding equipment;

determining link delays between respective devices based on the link types and the link lengths;

and determining communication delay among the computing nodes according to the equipment forwarding delay and the link delay.

In the implementation process, the communication delay is determined according to the device forwarding delay and the link delay, so that the device forwarding delay can be taken into account to obtain more accurate communication delay, and further, a computing node with the minimum communication delay can be selected for the job to improve the computing efficiency.

Optionally, the determining link delays between the devices based on the link types and the link lengths includes:

acquiring corresponding link processing delay and signal processing rate according to the link type;

acquiring corresponding transmission delay according to the signal processing rate and the link length;

and determining the link delay among the devices according to the link processing delay and the transmission delay.

In the implementation process, when the link delay is obtained, the link processing delay and the transmission delay are considered, so that more accurate link delay can be obtained.

Optionally, the obtaining a communication path between each computing node in a network structure formed by the computing clusters includes:

traversing a network structure formed by the computing cluster, analyzing and obtaining equipment information of each equipment and connection information among each equipment in the network structure, wherein each equipment comprises forwarding equipment and a computing node;

and acquiring communication paths among the computing nodes according to the equipment information of each equipment and the connection information among the equipment.

In the implementation process, more accurate and comprehensive communication paths can be obtained by analyzing the device information and the connection information of each device in the network structure.

Optionally, when a plurality of target jobs are performed, selecting target computing nodes for executing the target jobs according to the order of the communication delays from small to large includes:

acquiring the execution priority corresponding to each target job;

and sequentially selecting corresponding target computing nodes for each target job according to the sequence of the execution priority from large to small, wherein when a target computing node corresponding to one target job is selected each time, the computing nodes which are not currently selected as the target computing nodes are selected according to the sequence of the communication delay from small to large.

In the implementation process, the target computing nodes are selected for the jobs in turn based on the execution priority of the jobs, so that the jobs with high execution priority can be executed by the target computing nodes with smaller communication delay, and the computing efficiency of the jobs sensitive to the communication delay can be ensured as much as possible.

Optionally, the obtaining of the execution priority corresponding to each target job includes:

acquiring the operation type of each target operation;

and determining the execution priority corresponding to each target job according to the job type.

In the implementation described above, the execution priority is determined according to the job type, so that the calculation node with an appropriate communication delay can be employed for execution for different types of jobs to more reasonably distribute a plurality of jobs.

In a second aspect, an embodiment of the present application provides a job distribution apparatus, including:

the node number acquisition module is used for acquiring the number of calculation nodes required by executing the target operation;

a communication delay obtaining module, configured to obtain a communication delay between each computing node in a computing cluster, where the communication delay is determined according to link information of a communication path between each computing node;

the node selection module is used for selecting target computing nodes for executing the target operation according to the sequence of the communication delay from small to large until the number of the selected target computing nodes reaches the number of the computing nodes;

and the job distribution module is used for distributing the target job to the target computing node for execution.

Optionally, the communication delay obtaining module is configured to obtain a communication path between each computing node in a network structure formed by the computing clusters; analyzing link information on the communication path, wherein the link information comprises information of each forwarding device on the communication path, link types between devices connected with each other and link lengths; and acquiring communication delay among the computing nodes according to the link information.

Optionally, the communication delay obtaining module is configured to determine, according to the information of each forwarding device, a device forwarding delay corresponding to each forwarding device; determining link delays between respective devices based on the link types and the link lengths; and determining communication delay among the computing nodes according to the equipment forwarding delay and the link delay.

Optionally, the communication delay obtaining module is configured to obtain a corresponding link processing delay and a corresponding signal processing rate according to the link type; acquiring corresponding transmission delay according to the signal processing rate and the link length; and determining the link delay among the devices according to the link processing delay and the transmission delay.

Optionally, the communication delay obtaining module is configured to traverse a network structure formed by the computing cluster, and analyze and obtain device information of each device and connection information between the devices in the network structure, where each device includes a forwarding device and a computing node; and acquiring communication paths among the computing nodes according to the equipment information of each equipment and the connection information among the equipment.

Optionally, when a plurality of target jobs are available, the node selection module is configured to obtain an execution priority corresponding to each target job; and sequentially selecting corresponding target computing nodes for each target job according to the sequence of the execution priority from large to small, wherein when a target computing node corresponding to one target job is selected each time, the computing nodes which are not currently selected as the target computing nodes are selected according to the sequence of the communication delay from small to large.

Optionally, the node selection module is configured to obtain a job type of each target job; and determining the execution priority corresponding to each target job according to the job type.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the steps in the method as provided in the first aspect are executed.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps in the method as provided in the first aspect.

Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic structural diagram of an electronic device for executing a job assignment method according to an embodiment of the present application;

fig. 2 is a flowchart of a job allocation method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a network structure according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating parsing of each network device according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a data center plane layout using fat tree topology according to an embodiment of the present application;

fig. 6 is a comparison graph of a test result of the computing performance of the job executed by selecting a computing node according to the job allocation method of the present application and the existing job allocation method provided in the embodiment of the present application;

fig. 7 is a block diagram of a job assigning apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

The embodiment of the application provides a job allocation method, which is characterized in that communication delay among computing nodes is obtained, and then jobs are sequentially allocated to corresponding computing nodes according to the sequence of the communication delay from small to large, so that the communication delay among the computing nodes executing the jobs is as small as possible, the communication delay can be effectively reduced, and the computing performance is improved.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device for executing a job assignment method according to an embodiment of the present application, where the electronic device may include: at least one processor 110, such as a CPU, at least one communication interface 120, at least one memory 130, and at least one communication bus 140. Wherein the communication bus 140 is used for realizing direct connection communication of these components. The communication interface 120 of the device in the embodiment of the present application is used for performing signaling or data communication with other node devices. The memory 130 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). Memory 130 may optionally be at least one memory device located remotely from the aforementioned processor. The memory 130 stores computer readable instructions, and when the computer readable instructions are executed by the processor 110, the electronic device executes the method process shown in fig. 2, for example, the memory 130 may be configured to store communication delays between the respective computing nodes, and the processor 110 may be configured to, when performing job assignment, obtain the communication delays between the respective computing nodes from the memory 130, and then select some computing nodes with smaller communication delays from the obtained communication delays to execute the job.

It will be appreciated that the configuration shown in fig. 1 is merely illustrative and that the electronic device may also include more or fewer components than shown in fig. 1 or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

Referring to fig. 2, fig. 2 is a flowchart of an operation allocation method according to an embodiment of the present application, where the method includes the following steps:

step S110: the number of calculation nodes required to execute the target job is acquired.

In the present application, the job scheduling system may execute the job assignment method, and the job scheduling system may be run in an electronic device, and the electronic device may refer to a server, a terminal, and other devices having a certain data processing capability. The job scheduling system may be a scheduling system such as SLURM, PBS, or the like, and a user may submit a job request through the job scheduling system, where the job request carries a job to be executed, and the job scheduling system may select a suitable computing node to execute the job according to a specific scheduling rule and a demand of the user for the job. For convenience of description, the following description will use the job scheduling system as an execution subject to describe a specific implementation process of the job assignment method in the present application.

In some embodiments, the job scheduling system may provide a display interactive interface for a user, the user may submit a corresponding job request through the display interactive interface, and the job request submitted by the user may include a target job to be executed and the number of computing nodes required to execute the target job, i.e., the number of computing nodes required to execute the target job. The target operation may be one of the operations to be performed, for example, there may be a plurality of operations to be performed, including weather forecast, earthquake prediction, vehicle detection, and the like, and the target operation may be one of the operations, and the assignment manner for each operation may be assigned in the same manner.

The computing nodes refer to physical machines with certain data processing capacity in a computing cluster, such as terminal devices, servers and the like, and the computing nodes can be used for executing jobs, the computing nodes communicate with each other through forwarding devices, the forwarding devices include communication devices such as routers, switches, gateways and the like, and the job scheduling system can also communicate with the computing nodes through the forwarding devices. It will be appreciated that these forwarding devices and compute nodes form a compute cluster, such as a large-scale fat-tree network structure, with the uppermost core layer switch being in communication with the job scheduling system and the lowermost compute node being connected to an access layer switch. After receiving a job request submitted by a user, the job scheduling system may first obtain the computing nodes executing the job, and then send the job to the computing nodes through forwarding devices such as a core layer switch and an access layer switch, so that the computing nodes execute the job, and the computing nodes may transmit a computing result obtained after the job is executed to the job scheduling system through the forwarding devices.

In some embodiments, the job scheduling system may also automatically analyze the number of computing nodes required to execute the job according to the related information of the job, for example, the job scheduling system may split the job according to the information of the type of the job, the CPU resources required to execute the job, the memory resources, and the like, for example, split the job into a plurality of sub-jobs. For example, for a weather prediction job, splitting the weather prediction job to obtain a plurality of sub-jobs including 30 sub-jobs such as water volume calculation, illumination calculation, temperature calculation, etc., it means that the number of computing nodes required for executing the job is 30, and each computing node corresponds to executing one sub-job, so that 30 computing nodes can execute 30 sub-jobs. Or, the corresponding number of compute nodes may be determined according to information such as CPU resources and memory resources required for executing the job, for example, if the CPU resources and memory resources of one compute node are known, the job request submitted by the user may carry the total CPU resources and memory resources required for executing the job, so that the number of the required compute nodes may be obtained through computation, and if the CPU resources or memory resources of each compute node are different, when a target compute node for executing the target job is selected, the target compute node may be selected according to the CPU resources or memory resources of each compute node, so that the CPU resources or memory resources of the selected compute node better conform to the resources required for the target job.

Step S120: communication delay among all computing nodes in the computing cluster is obtained, and the communication delay is determined according to link information of communication paths among all the computing nodes.

Since the type or length of the cable may be different between each device in the actual wiring process of the computing cluster, the actual wiring is accurately computed according to the site environment, and the short-distance optical fiber cable is used as far as possible, for example, a large-scale data center may have 20 meters, 30 meters, 50 meters or 100 meters of optical fiber cable at the same time, and the longer the distance, the higher the communication delay. Therefore, in order to improve the computational performance of the computing cluster, when distributing the job, the computing nodes with small communication delay are selected as much as possible to execute the job.

The communication delay can be understood as a communication time length required for communication between the computing nodes, wherein the longer the communication time length is, the lower the communication efficiency between the two computing nodes is, and the shorter the communication time length is, the lower the communication efficiency between the two computing nodes is.

The link information of the communication path between the computing nodes may include information such as the type of forwarding devices passing through the communication path, the link type (e.g., fiber optic cable or copper cable), and the link length, and the communication delay between each two computing nodes may be determined through the link information.

In some embodiments, the communication delay between the computing nodes may be pre-calculated and pre-stored in the job scheduling system, and since the network structure of the computing cluster is not changed in general, the communication delay between the computing nodes is not changed, so that the job scheduling system may directly read the stored communication delay between the computing nodes each time a job is allocated. Of course, when the network structure of the computing cluster is changed or the link information between the computing nodes is changed, the communication delay between the computing nodes may be recalculated and then the stored communication delay may be updated.

Step S130: and selecting target computing nodes for executing the target operation according to the sequence of the communication delay from small to large until the number of the selected target computing nodes reaches the number of the computing nodes.

In order to improve the computing performance, when selecting a target computing node for executing a target job, the target computing node is selected according to the sequence of communication delays from small to large, for example, the communication delays between the computing nodes may be sorted according to the sequence of communication delays from small to large, each communication delay corresponds to two computing nodes, and then the computing nodes may be selected as the target computing nodes according to the sequence of communication delays from small to large until the number of the selected target computing nodes reaches the required number of computing nodes.

Step S140: and distributing the target operation to the target computing node for execution.

After the target computing nodes are selected, the job scheduling system can distribute the target jobs to the target computing nodes, the target computing nodes execute the target jobs, after the target computing nodes execute the target jobs, the execution results can be transmitted to the job scheduling system, and the job scheduling system can present the execution results to a user, so that the user can know the execution condition of the target jobs.

In some embodiments, in order to improve the computing efficiency, the communication delay between each computing node may be stored in the job scheduling system after being calculated in advance, and when the job scheduling system senses that the network deployment condition of the computing cluster changes, the communication delay between each computing node is updated. However, if the job scheduling system does not update the communication delay between the computing nodes in time after the network deployment of the computing cluster, the target computing node selected based on the communication delay may not be optimal, so that some computing nodes with the minimum communication delay can be selected more accurately each time, and the communication delay between the computing nodes can be obtained again each time the job assignment method is executed.

Whether the communication delay is acquired in advance and then stored or acquired every time when the job is distributed, the manner of acquiring the communication delay among the computing nodes can be as follows: the method comprises the steps of obtaining communication paths among all computing nodes in a network structure formed by computing clusters, analyzing link information on the communication paths, wherein the link information comprises information of all forwarding devices on the communication paths, link types and link lengths among devices connected with each other, and then obtaining communication delay among all the computing nodes according to the link information.

The communication path refers to a path through which communication between two computing nodes passes, as shown in fig. 3, the communication path between the computing node N1 and the computing node N2 is N1- > SW1- > N2, the communication path between the computing node N1 and the computing node N3 is N1- > SW1- > SW5- > SW2- > N3, and the communication path between each computing node can be traversed by analyzing a network structure formed by the computing clusters.

After the communication paths between the computing nodes are obtained, the communication paths may be parsed to obtain corresponding link information, such as device information of three forwarding devices, SW1, SW2 and SW5, included in the communication path N1- > SW1- > SW5- > SW2- > N3, and link connection information, such as link types and link lengths, between two adjacent devices on the communication path. Such as link type and link length between N1 and SW1, and link type and link length between SW1 and SW 5. Wherein the link type and the link length can be obtained from a port register storing link information between each two devices connected to each other.

Since the communication delay is affected differently by the type of link and the length of the link, for example, the signal transmission rate of the optical fiber cable is usually 0.005 us/m, the smaller the signal transmission rate, the longer the link length, the higher the communication delay, and the propagation speed of the copper cable is about 80% of the optical fiber cable, the signal transmission rate of the copper cable can be 0.00625 us/m, so that the link delay for information transmission of the link can be obtained according to the signal transmission rate of the link and the link length. After obtaining the information, each forwarding device also needs to perform operations such as forwarding after encapsulating the information, and also needs to consume a certain time duration, and different types of forwarding devices (such as switches and routers, or switches or routers of different models) have different rates of processing the information, and have different corresponding communication delays.

Therefore, the communication delay among the computing nodes can be comprehensively calculated according to the link information, so that the actual network deployment condition (including the deployment of the forwarding equipment and the deployment of the line) can be considered, and more accurate communication delay can be calculated.

In some embodiments, when obtaining the communication delay between the computing nodes, the device forwarding delay corresponding to each forwarding device may be determined according to the information of each forwarding device, the link delay between each device may be determined based on the link type and the link length, and then the communication delay between the computing nodes may be determined according to the device forwarding delay and the link delay.

The information of each forwarding device may include information such as a type and a model of each forwarding device, and device forwarding delays corresponding to forwarding devices of different types or different models are different, for example, for a switch of HDR200Gbps model, the device forwarding delay is 0.09 us. The job scheduling system may store device forwarding delays corresponding to forwarding devices of different types or different models in advance, so that the job scheduling system may directly find and obtain the device forwarding delays corresponding to the forwarding devices according to the information of the forwarding devices, for example, for communication paths N1- > SW1- > SW5- > SW2- > N3, the device forwarding delays of switches SW1, SW5, and SW2 may be obtained.

In order to quickly acquire the link delay, the job scheduling system may also store the link delay corresponding to the link type and the link length in advance, so that the corresponding link delay may be found and obtained directly according to the link type and the link length, for example, if the link type is an optical fiber, and the link delay corresponding to the link length of 10 meters is 0.05us, if the link type between N1 and SW1 is an optical fiber, and the link length of the link is 10 meters, the link delay of the link is obtained by finding and obtaining 0.05us, and according to this manner, the link delay between SW1 and SW5, the link delay between SW5 and SW2, and the link delay between SW2 and N3 may also be found and obtained.

After obtaining the device forwarding delay and the link delay, the device forwarding delay and the link delay may be added to obtain the communication delay between the computing nodes, for example, the communication delay between N1 and N3 is equal to the device forwarding delay of SW1 + the device forwarding delay of SW5 + the device forwarding delay of SW2 + the link delay of a link of N1-SW1 + the link delay of a link of SW1-SW5 + the link delay of a link of SW5-SW2 + the link delay of a link of SW 2-N3.

It can be appreciated that, in order to facilitate network deployment, in a network structure of a large data center, a link type and a link length between each compute node and an access stratum switch are fixed, that is, a link delay between each compute node and the access stratum switch is fixed, so to reduce a computation amount, when computing a communication delay of a communication path N1- > SW1- > SW5- > SW2- > N3, it may not be necessary to compute link delays of two links of N1- > SW1 and SW2- > N2, or, if types or models of all access stratum switches are the same, it may not be necessary to compute device forwarding delays of SW1 and SW2, so that the communication delay may be simplified to be equal to the device forwarding delay of SW5 + the link delay of a link of SW1-SW5 + the link delay of a link of SW5-SW 2. In this way, communication delays between the various computing nodes can be obtained.

In some embodiments, in the network operation process, a deployed network line may have a fault, and at this time, a situation such as a change of a link length may be involved, so that, in order to obtain a more accurate link delay and reduce the storage amount of the job scheduling system, the job scheduling system may store link processing rates corresponding to respective link types, so that corresponding link processing rates may be directly obtained according to the link types, and then link delays between respective devices may be determined according to the link processing rates and the link lengths.

For example, if the obtained link type is an optical fiber, the link processing rate corresponding to the optical fiber is found to be 0.005 us/m, and the obtained link length is 20 m, then the link delay is the link processing rate and the link length is 0.01 us.

In some embodiments, different links may have a certain delay to information processing, for example, when an optical fiber transmits information, optical-to-electrical conversion is also required, and the processing delay of the optical-to-electrical conversion is generally 0.1us, so when acquiring the link delay, link processing delays corresponding to different link types may also be considered. The implementation mode can be as follows: and acquiring corresponding link processing delay and signal processing rate according to the link type, acquiring corresponding transmission delay according to the signal processing rate and the link length, and determining the link delay among the devices according to the link processing delay and the transmission delay.

The job scheduling system may store link processing delays corresponding to the link types, for example, the link processing delay of optical fiber for performing photoelectric conversion is 0.1us, and the copper cable does not need to perform photoelectric conversion, so the link processing delay of the copper cable may be 0. For the transmission delay, it may be pre-stored in the job scheduling system, so that the corresponding transmission delay can be obtained directly by searching. Or the corresponding signal processing rate may be obtained first according to the link type, then the signal transmission rate and the link length are multiplied to calculate the signal processing rate, and then the link processing delay and the transmission delay may be added to obtain the link delay, or the average value of the link processing delay and the transmission delay is taken as the link delay.

A calculation formula for calculating the obtained link delay is given below:

wherein K represents the number of links on the communication path, such as communication path N1->SW1->SW5->SW2->N3 includes four links, i.e., K4, Latency denotes link delay, Tsw_iIndicating the device forwarding delay of the ith forwarding device, e.g. of 3 switches on the communication path, S_iIndicating the signal transmission rate, L, of the ith link_iIndicates the link length of the ith link, D_iIndicating the link processing delay of the ith link.

The following is a description of a process of obtaining a communication path by analyzing a network structure formed by a computation cluster.

The manner of obtaining the communication path between each computing node may be: traversing a network structure formed by the computing cluster, analyzing and obtaining equipment information of each equipment and connection information among each equipment in the network structure, wherein each equipment comprises forwarding equipment and computing nodes, and then obtaining communication paths among the computing nodes according to the equipment information of each equipment and the connection information among each equipment.

Specifically, the job scheduling system may analyze a network structure formed by the computing cluster to construct a network topology structure of the computing cluster. In the topology discovery process, a head node (a node where a scheduler is located) is first designated, then the network is traversed, all network devices are searched, the type of the device is identified as a switch or a computing node, and then the identified device information is added to the device array, and the specific process is as shown in fig. 4. The specific implementation process can be as follows: firstly, detecting node information of a scheduler through a function check _ topop- > device _ head (), and taking the node information as a head node; then, each device is searched from the head node by the function check _ topop, and all networks for _ all _ devices are traversed (check _ topop, device): returning relevant information (equipment name, GUID and the like) of each equipment; then, traversing all ports of each node again, for (i ═ start _ port; i < ═ end _ port; i + +) adding all neighbors of the node to the connection relation list: status is sm _ setup _ node (head _ topop, & predefttopology, device, port, path, cable _ type, cable _ length); detecting the link type connected with each port, and returning to a cable _ type; and detecting the link length of each port connection and returning to cable _ length. Then, establishing equipment information through a function setup _ device () call → a function device _ Create (), and identifying whether the equipment is a switch equipment or a computing node according to the equipment information; then, calling a discover _ device _ port () function to assign and name the coordinates of the discovered equipment to form a Fabric device, and adding the Fabric device into Fabric topology; and finally, calling a function build _ device _ array () to construct a device array, and mapping all network devices to the device array to complete topology discovery.

In the process of constructing the network topology structure, the connection relationship between each device may be combed based on the obtained device array, and then the network structure is traversed, and a connection relationship list of each switch is constructed by taking the switch as a basic unit, where the connection relationship list includes the name of the switch, the type of the opposite-end device, the port quality, the type of the link, the length of the link, and the like, so as to be used by the job scheduling system to perform topology structure analysis, and the constructed connection relationship list is shown in table 1 below.

TABLE 1

It can be understood that the device information of each device may include information such as a name and a type of the device, and the connection information between each device may be shown by the information in the above table, including information such as a name of the opposite device, a type of the opposite device, a connected port, a link type, and a link length.

After obtaining the information in the table, the job scheduling system may analyze and obtain the connection relationship between each device according to the information in the table, and further obtain a communication path between each two computing nodes. As shown in fig. 5, a schematic plan view of a data center adopting a fat-tree topology structure is shown, due to site limitation, a cluster is divided into a plurality of different computing cabinet groups (participating groups), each access layer switch is located inside a corresponding computing cabinet Group, and Core switches (Core SW) are centrally located inside a switch cabinet. In actual deployment, the routing distances from each computing cabinet group to the core switch are different, and in consideration of cost and performance, the shortest line type is selected as much as possible for actual deployment according to requirements. Thus, for a real data center there may be 20 meters, 30 meters or 100 meters of optical fiber cable (AOC). Thus, the job scheduling system can obtain the computing node with the minimum link distance, as shown in fig. 5, the lengths of the optical fiber links connected to the core switch by each computing cabinet group are different along with the actual deployment and routing distances, the optical fiber distance closer to the core switch is only 10 meters, and the optical fiber distance farther from the core switch is 50 meters, so that the job scheduling system can find the computing node with the best communication locality for distribution according to the computing requirements, and the computing efficiency can be improved to the maximum extent.

The operation scheduling system analyzes the network structure according to the connection relation list and traverses all networks. K links are defined to be needed for communication from one computing node to another computing node, and the communication paths between the two computing nodes can be obtained by combining link information of the K links. If the link types of the K links are the same, when the transmission delay is calculated, the total link distance may be obtained first, and then multiplied by the signal transmission rate to obtain the transmission delay, where a calculation formula of the total link distance is as follows:

wherein D represents the total link distance, K represents the number of links, and L_iIndicating the link length of the ith link.

In some embodiments, in order to facilitate calculating the link delay, the link distance between each calculation node may be obtained by traversing all communication paths according to the above link distance calculation formula, and a link distance matrix is constructed and stored, as shown in table 2 below.

TABLE 2

Thus, when calculating the transmission delay, the link length of each link can be obtained by looking up the table 2, for example, the link length corresponding to the device at both ends in the table 2 can be looked up by looking up the device information at both ends of each link. Thus, the communication delay between the computing nodes or the communication delay between the devices can be calculated quickly in the above manner, and in order to obtain the communication delay quickly, a communication delay matrix can be constructed, as shown in table 3 below, where the communication delay between the computing nodes is shown in table 3.

TABLE 3

Thus, when selecting a target computing node, table 3 may be traversed, and then selected according to the communication delay from small to large (excluding 0, because 0 is the communication delay between two identical computing nodes), if the communication delay in table 3 is the smallest 1.4 at present, a computing node with a communication delay of 1.4, which has 0x0001 and 0x0002, may be selected first, the two computing nodes are taken as target computing nodes, and then the computing node corresponding to the communication delay of 1.6 is continuously selected as the target computing node until the selected target computing node reaches the required number of computing nodes.

Therefore, the communication delay between the target computing nodes selected for executing the target job is smaller, so that the computing node with the lowest communication delay can be selected for distribution, the job executing efficiency can be improved, and the computing performance of the computing cluster can be improved.

In some embodiments, when receiving a plurality of target jobs, the job scheduling system may set an execution priority for each job in advance, perform priority assignment for a job with a high execution priority, and perform post-assignment for a job with a low execution priority, in order to achieve optimal assignment for the plurality of target jobs. Therefore, when selecting the target computing node for executing the job, the execution priority corresponding to each target job may be obtained first, and then the corresponding target computing node is selected for each target job in sequence according to the descending order of the execution priority, wherein when selecting the target computing node corresponding to one target job each time, the target computing node not currently selected as the target computing node is selected according to the ascending order of the communication delay.

When the execution priority is configured for each job in advance, the communication delay can be configured according to the sensitivity of each job to the communication delay, and jobs such as weather forecast, earthquake forecast, AI application, 5G application and the like are sensitive to the communication delay, so that high execution priority can be set for the jobs, and when the jobs are allocated, the jobs can be allocated preferentially to provide the computing resources with the minimum communication delay as much as possible so as to obtain the optimal computing performance. The job scheduling system may pre-store the execution priority corresponding to each job (for example, pre-mark the execution priority for each job), and after obtaining a plurality of target jobs, the job scheduling system may obtain the execution priority corresponding to each target job by searching.

Alternatively, in some embodiments, the job scheduling system may obtain the execution priority of each target job by obtaining the job type of each target job, and then determining the execution priority corresponding to each target job according to the job type, where if some types of jobs have higher sensitivity to communication delay, the execution priority is higher, and thus the execution priority of each target job may also be obtained, and it is understood that the execution priorities corresponding to jobs of different job types may be the same or different. Or the job scheduling system can store the execution priority corresponding to each job type, so that the job scheduling system can analyze the job type of each target job first and then search and obtain the execution priority corresponding to each target job according to the job type.

When allocating a plurality of target jobs, the plurality of target jobs may be sorted according to execution priority, for example, there are 3 target jobs, if the execution priority of target job 1 is the highest, the execution priority of target job 2 is the second, and the execution priority of target job 3 is the lowest, at this time, target job 1 may be allocated first, if the computing cluster includes 400 idle computing nodes to which the executable jobs are allocated at this time, these 400 computing nodes may all execute the jobs at present, if the number of computing nodes required by target job 1 is 40, 40 computing nodes may be selected as target computing nodes from among these 400 computing nodes in order of small communication delay to large communication delay for executing target job 1, and then target job 1 may be allocated to these 40 computing nodes. Then, target job 2 is distributed, and if the number of computing nodes required for target job 2 is 30 and the number of remaining unselected target computing nodes is 360, 30 computing nodes are selected as target computing nodes for executing target job 2 from the remaining 360 computing nodes in the order of increasing communication delay. Then, the selection of the computing node for executing the target job 3 is continued from the remaining 330 computing nodes in the order of the communication delay from small to large. The job assignment is performed in this manner, so that a computing node with low communication delay can be assigned to a job with high execution priority to ensure the execution efficiency of the jobs.

And aiming at the jobs with the same execution priority and the same type, the calculation nodes with the consistent communication distance can be selected for the jobs as much as possible, so that the consistency of the average communication delay is ensured, the asynchronous return of the calculation results caused by the different communication distances is avoided, and the communication efficiency is reduced. For example, there are jobs 1-5 that all have the same execution priority, but

jobs

1 and 4 are both communication intensive jobs, and

jobs

2, 3, and 5 are all computation intensive jobs, and since these two types of jobs are the same degree of sensitivity to communication delays, different types of jobs having the same execution priority may be tagged with an urgent priority, e.g., a communication intensive job is generally more urgent than a computation intensive job and therefore has a higher urgent priority. Thus, when the job scheduling system performs job assignment, it can assign some computing nodes with low communication delay to job 1 and job 4 in priority, and then assign them to job 2, job 3 and job 5. In the case of job 1 and job 4, the assignment order may be decided by arbitrarily selecting which job is assigned with priority, or by performing more detailed priority setting based on other cases. If the number of the computing nodes required for executing the job 1 is 30 and the number of the computing nodes required for executing the job 4 is 40, at this time, 30 target computing nodes can be selected for the job 1 in the order of the communication delay from small to large, and then 40 target computing nodes can be selected for the job 4 in the order of the communication delay from small to large among the remaining unselected computing nodes, so that the communication delays between the target computing nodes selected by the jobs of the same type are not large as much as possible, and the returned computing results are ensured to be synchronized as much as possible after the computing nodes execute the jobs in parallel.

In order to embody the effect of improving the computing performance achieved by the job allocation method, the job allocation method provided by the embodiment of the application is applied to a high-performance interconnection network environment of an actual super-computing center to perform actual comparison test, and the comparison test is performed on computing nodes with different AOC optical fiber lengths by adopting molecular dynamics software GROMACS under the scale of 240 computing nodes. The test results are shown in table 4 below and fig. 6, and the ordinate in fig. 6 is the test time in seconds. Compared with the existing job distribution method (common scheduling), the test time adopting the job distribution method (scheduling of the invention) has obvious performance improvement, and the shorter the optical fiber communication distance is, the higher the performance acceleration ratio is, thus fully explaining the performance improvement brought by job distribution according to communication delay.

TABLE 4

Referring to fig. 7, fig. 7 is a block diagram of a job distribution apparatus 200 according to an embodiment of the present application; the apparatus 200 may be a module, a program segment, or code on an electronic device. It should be understood that the apparatus 200 corresponds to the above-mentioned embodiment of the method of fig. 2, and can perform various steps related to the embodiment of the method of fig. 2, and the specific functions of the apparatus 200 can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy.

Optionally, the apparatus 200 comprises:

a node number obtaining module 210, configured to obtain the number of computing nodes required to execute the target job;

a communication delay obtaining module 220, configured to obtain a communication delay between each computing node in the computing cluster, where the communication delay is determined according to link information of a communication path between each computing node;

a node selecting module 230, configured to select target computing nodes for executing the target job according to the sequence of the communication delays from small to large until the number of the selected target computing nodes reaches the number of computing nodes;

a job assigning module 240, configured to assign the target job to the target computing node for execution.

Optionally, the communication delay obtaining module 220 is configured to obtain a communication path between each computing node in a network structure formed by the computing clusters; analyzing link information on the communication path, wherein the link information comprises information of each forwarding device on the communication path, link types between devices connected with each other and link lengths; and acquiring communication delay among the computing nodes according to the link information.

Optionally, the communication delay obtaining module 220 is configured to determine, according to the information of each forwarding device, a device forwarding delay corresponding to each forwarding device; determining link delays between respective devices based on the link types and the link lengths; and determining communication delay among the computing nodes according to the equipment forwarding delay and the link delay.

Optionally, the communication delay obtaining module 220 is configured to obtain a corresponding link processing delay and a corresponding signal processing rate according to the link type; acquiring corresponding transmission delay according to the signal processing rate and the link length; and determining the link delay among the devices according to the link processing delay and the transmission delay.

Optionally, the communication delay obtaining module 220 is configured to traverse a network structure formed by the computing cluster, and analyze and obtain device information of each device and connection information between the devices in the network structure, where each device includes a forwarding device and a computing node; and acquiring communication paths among the computing nodes according to the equipment information of each equipment and the connection information among the equipment.

Optionally, when a plurality of target jobs are available, the node selecting module 230 is configured to obtain an execution priority corresponding to each target job; and sequentially selecting corresponding target computing nodes for each target job according to the sequence of the execution priority from large to small, wherein when a target computing node corresponding to one target job is selected each time, the computing nodes which are not currently selected as the target computing nodes are selected according to the sequence of the communication delay from small to large.

Optionally, the node selecting module 230 is configured to obtain a job type of each target job; and determining the execution priority corresponding to each target job according to the job type.

It should be noted that, for the convenience and brevity of description, the specific working procedure of the above-described apparatus may refer to the corresponding procedure in the foregoing method embodiment, and the description is not repeated herein.

Embodiments of the present application provide a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the method processes performed by an electronic device in the method embodiment shown in fig. 2.

The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments, for example, comprising: acquiring the number of computing nodes required for executing the target operation; acquiring communication delay among all computing nodes in a computing cluster, wherein the communication delay is determined according to link information of communication paths among the computing nodes; selecting target computing nodes for executing the target operation according to the sequence of the communication delay from small to large until the number of the selected target computing nodes reaches the number of the computing nodes; and distributing the target operation to the target computing node for execution.

In summary, the embodiments of the present application provide a job allocation method, an apparatus, an electronic device, and a readable storage medium, where communication delays between computing nodes are obtained, and then jobs are sequentially allocated to corresponding computing nodes according to a sequence from small to large of the communication delays, so that the communication delays between the computing nodes executing the jobs are as small as possible, that is, the computing efficiency is improved, and the computing performance of a computing cluster is improved.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for job assignment, the method comprising:

acquiring the number of computing nodes required for executing the target operation;

acquiring communication delay among all computing nodes in a computing cluster, wherein the communication delay is determined according to link information of communication paths among the computing nodes;

selecting target computing nodes for executing the target operation according to the sequence of the communication delay from small to large until the number of the selected target computing nodes reaches the number of the computing nodes;

and distributing the target operation to the target computing node for execution.

2. The method of claim 1, wherein the communication delay between each computing node is obtained by:

3. The method of claim 2, wherein the obtaining communication delay between each computing node according to the link information comprises:

4. The method of claim 3, wherein determining the link delay between the devices based on the link type and the link length comprises:

5. The method of claim 2, wherein obtaining communication paths between computing nodes in a network fabric formed by the computing clusters comprises:

6. The method according to any one of claims 1 to 5, wherein when there are a plurality of target jobs, selecting target computing nodes for executing the target jobs according to the sequence of the communication delays from small to large comprises:

acquiring the execution priority corresponding to each target job;

7. The method of claim 6, wherein the obtaining the execution priority corresponding to each target job comprises:

acquiring the operation type of each target operation;

8. A work distribution apparatus, characterized in that the apparatus comprises:

9. An electronic device comprising a processor and a memory, the memory storing computer readable instructions that, when executed by the processor, perform the method of any of claims 1-7.

10. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.