CN101663649A - Dynamically rerouting node traffic on a parallel computer system - Google Patents

Dynamically rerouting node traffic on a parallel computer system Download PDF

Info

Publication number
CN101663649A
CN101663649A CN200880012450A CN200880012450A CN101663649A CN 101663649 A CN101663649 A CN 101663649A CN 200880012450 A CN200880012450 A CN 200880012450A CN 200880012450 A CN200880012450 A CN 200880012450A CN 101663649 A CN101663649 A CN 101663649A
Authority
CN
China
Prior art keywords
node
network
computational system
concurrent computational
prompting position
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200880012450A
Other languages
Chinese (zh)
Other versions
CN101663649B (en
Inventor
A·彼得斯
A·西德尼克
D·达灵顿
P·J·麦卡西
B·A·斯沃茨
B·E·史密斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CN101663649A publication Critical patent/CN101663649A/en
Application granted granted Critical
Publication of CN101663649B publication Critical patent/CN101663649B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/22Alternate routing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17381Two dimensional, e.g. mesh, torus
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/02Topology update or discovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/28Routing or path finding of packets in data switching networks using route fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/34Source routing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2038Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with a single idle spare processing component
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2051Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant in regular structures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks

Abstract

A method and apparatus for dynamically rerouting node processes on the compute nodes of a massively parallel computer system using hint bits to route around failed nodes or congested networks withoutrestarting applications executing on the system. When a node has a failure or there are indications that it may fail, the application software on the system is suspended while the data on the failed node is moved to a backup node. The torus network traffic is routed around the failed node and traffic for the failed node is rerouted to the backup node. The application can then resume operation without restarting from the beginning.

Description

Dynamically re-route the node traffic on the concurrent computational system
Technical field
Put it briefly, the present invention relates to the fault recovery in the concurrent computational system, more specifically, relate at the node traffic that uses under the situation that does not restart the application of carrying out on the large-scale parallel supercomputer on the computing node of pointing out the position dynamically to re-route large-scale concurrent computational system.
Background technology
Effective fault recovery is very important for the stop time and the repair cost that reduce complex computer system.Have on the concurrent computational system of a large amount of computing nodes, the fault of single part can cause most of or whole computing machine stops to carry out with place under repair.Restart application and may be wasted in fault a large amount of processing times before.
Large-scale concurrent computational system is a kind of concurrent computational system with a large amount of interconnected compute nodes.International Business Machines Corp. has developed a series of these type of large-scale parallel computers with title Blue Gene (Blue Gene/).Blue Gene/L system is that a kind of current maximum number of wherein computing node is 65,536 scalable system.Blue Gene/L node comprises the single ASIC (special IC) with 2 CPU and storer.Whole computing machine is contained in 64 frames or the rack, has 32 gusset plates in each frame.
Blue Gene/L supercomputer communicates via some communication networks.65,536 computing nodes be arranged to logic tree mesh network and three-dimensional ring network both.The logic tree mesh network connects computing node with tree structure, so that each node and father node or one or two child node communicate.Loop network connects computing node with three-dimensional lattice shape architecture logic ground, allows each computing node and its immediate 6 neighbours in a section of computing machine to communicate.Because each computing node is arranged to the annular and the tree network that need communicate with adjacent node, stops till faulty hardware is repaired so the hardware fault of single node can cause the major part of system.For example, the single node fault can make the complete section of loop network to operate, and wherein the section of the loop network in the Blue Gene/L system is half frame or 512 nodes.In addition, all hardware that is assigned to the subregion that breaks down may also need to stop to carry out till fault is repaired.
On the large-scale concurrent computational system in the prior art, the term of execution single node fault often require software application from beginning or restart from the checkpoint of being preserved.When breaking down incident, advantageously, the processing of malfunctioning node can be moved to another node so that use and to recover on reserve hardware with the delay of minimum, thereby increase overall system efficiency.Under the situation of the method for not recovering more effectively from fault or the node that just breaking down, concurrent computational system will continue the potential computer processing time that waste increases running cost.
Summary of the invention
Described a kind of apparatus and method, be used for not restarting under the situation of the application of carrying out on the large-scale concurrent computational system, used the node traffic on the computing node of pointing out the position dynamically to re-route this system, to walk around malfunctioning node or congested network.When node has fault or have its indication that may break down, suspend the application software in the described system, simultaneously the data on the malfunctioning node are moved to backup node.To walk around malfunctioning node and come route loop network business, and the business of malfunctioning node will be re-routed to backup node.Similarly, can walk around congested network and come the route Network.
Example and disclosure relate to Blue Gene framework, but can extend to any concurrent computational system of a plurality of processors with the network structure of being arranged to, and wherein the node hardware handles is from the express traffic (cut through traffic) of other nodes.
Above-mentioned and other features and advantage will be from following the descriptions and apparent more specifically, as illustrated in the accompanying drawing.
Description of drawings
Only pass through the case description embodiments of the invention referring now to accompanying drawing, wherein:
Fig. 1 is the calcspar of large-scale concurrent computational system;
Fig. 2 is the calcspar that the input and output connection of the computing node in the large-scale concurrent computational system is shown;
Fig. 3 is the calcspar of the computing node in the large-scale concurrent computational system;
Fig. 4 is the calcspar of the torus network hardware of the computing node in the large-scale concurrent computational system;
Fig. 5 is the calcspar of the loop network packet in the large-scale concurrent computational system;
Fig. 6 is the calcspar of the part of the large-scale concurrent computational system of expression with illustrated example;
Fig. 7 represents that the part of large-scale concurrent computational system is to illustrate another calcspar of another example;
Fig. 8 is used for monitor node and network to set up the method flow diagram of problem list at concurrent computational system; And
Fig. 9 is used for using the prompting position dynamically to re-route the method flow diagram of node processing at concurrent computational system.
Embodiment
Disclosure and claims book herein relates to a kind of apparatus and method, be used for not restarting under the situation of the application of carrying out on the large-scale concurrent computational system, use the node traffic on the computing node of pointing out the position dynamically to re-route this system.When node had fault or have its indication that may break down, the application software on the Break-Up System moved to backup node with the data on the malfunctioning node simultaneously.Walk around malfunctioning node and come route loop network business, and the business of malfunctioning node is re-routed to backup node.To example be described according to the large-scale parallel computer of BlueGene/L of International Business Machines Corp.'s exploitation.
Fig. 1 illustrates the calcspar of the large-scale concurrent computational system 100 of expression such as Blue Gene/L computer system.Blue Gene/L system is that the maximum number of wherein computing node is 65,536 scalable system.Each node 110 has special IC (ASIC) 112, also is called BlueGene/L computing chip 112.Computing chip is associated with two processors or central processor unit (CPU) and is installed on the node subcard 114.Node also has the local storage (not shown) of 512 megabyte usually.Gusset plate 120 holds 32 node subcards 114 that have node 110 separately.Therefore, each gusset plate has 32 nodes, and each node has 2 processors, and the associative storage that is used for each processor.Frame 130 is for containing the shell of 32 gusset plates 120.In the gusset plate 120 each is connected in the midplane printed circuit board (PCB) 132 with midplane connector 134.Midplane 132 is inner and not shown in Figure 1 in frame.Whole Blue Gene/L computer system will be contained in 64 frames 130 or the rack, all have 32 gusset plates 120 in each frame 130 or the rack.Total system will have 65,536 nodes and 131,072 CPU (64 frame * 32 gusset plate * 32 node * 2 CPU) then.
Blue Gene/L Computer Systems Organization can be described to have the compute node core on I/O node surface, wherein to the communication of 1024 computing nodes 110 by each I/O node processing with the I/O processor 170 that is connected to service node 140.The I/O node does not have local storage device.The I/O node is connected to computing node via the logic tree mesh network and has functional Wide Area Network ability via the functional network (not shown).Functional network is connected to the I/O processor (or Blue Gene/L link chip) 170 of the communication of processing from service node 160 to a plurality of nodes that is positioned on the gusset plate 120.Blue Gene/L system has one or more I/O processor 170 being connected on the I/O plate (not shown) of gusset plate 120.The I/O processor can be configured to communicate with 8,32 or 64 nodes.Except the I/O node is not attached to loop network, be similar to the connection of computing node to the connection of I/O node.
Referring again to Fig. 1, computer system 100 comprises service node 140, and it handles the operation of using the software loading node and controlling total system.Service node 140 is generally the microcomputer system such as the IBM pSeries server of operator's console (not shown) execution Linux.Service node 140 uses control system network 150 to be connected to the frame 130 of computing node 110.The control system network provides at the control of Blue Gene/L system, test and promotes foundation structure.Control system network 150 is included as the diverse network interface that large-scale concurrent computational system provides necessary communication.Hereinafter further describe network interface.
Service node 140 management are exclusively used in the control system network 150 of system management.Control system network 150 comprises the special-purpose 100-Mb/s Ethernet that is connected to Ido chip 180, and this Ido chip 180 is positioned on the nodes in communication plate 120 of processing from service node 160 to a plurality of nodes.Because this network uses the JTAG agreement to communicate, so be called the JTAG network sometimes.All controls, test and running via the computing node 110 on the jtag port dominate node plate 120 that communicates with service node.In addition, service node 140 comprises the node/network monitor 142 of maintenance issues tabulation 144, the node that these problem list 144 indications are broken down, may just broken down, the maybe network link that need avoid.Node/network monitor is included in the software in the service node 140, but may be assisted by the operating system software that carries out on the node of system.
Blue Gene/L supercomputer communicates via some communication networks.Fig. 2 is the calcspar that the I/O connection of computing node on the BlueGene/L computer system is shown.65,536 computing nodes and 1024 I/O processors 170 are arranged to logic tree mesh network and logic three-dimensional ring network.Loop network logically connects computing node in lattice-like structure, allow each computing node 110 6 neighbour immediate with it to communicate.In Fig. 2, connect to come the illustration loop network by the X+, the X-that node are connected to six corresponding adjacent nodes, Y+, Y-, Z+ and Z-network.Tree network in Fig. 2 by tree 0, tree 1 and set 2 and connect expression.Other communication networks that are connected to node comprise JTAG network and global interrupt network.The JTAG network is provided for via the control system network 150 shown in Fig. 1 from the test of service node 140 and communicating by letter of control.Global interrupt network is used for implementing software barriers at the synchronous of similar processing on the computing node, to move to the different phase of processing after finishing certain task.Therefore global interrupt network can be used to start, stop and suspending the application of carrying out on the subregion of node.In addition, the clock and the power signal that have each computing node 110.
Blue Gene/L ring interconnect is connected to its six nearest neighbours (X+, X-, Y+, Y-, Z+, Z-) with logic 3D cartesian array with each node.Being connected node layer and finishing to these six neighbours at midplane layer place.Each midplane is 8 * 8 * 8 node arrays.Six faces (X+, X-, Y+, Y-, Z+, Z-) of the node array in the midplane are 8 * 8=64 node in size.Be sent to respective nodes in the contiguous midplane from each torus network signal of 64 nodes on each face in six faces via the link cards (not shown) that is connected to midplane.When midplane is used for when any dimension has in the subregion of the degree of depth of a midplane, but the signal of each face also route return the input of the same midplane on opposite face.
Fig. 3 illustration is according to the calcspar of the computing node 110 in the Blue Gene/L computer system of prior art.Computing node 110 has node compute chip 112, and the latter has two processor 310A, 310B.Each processor 310A, 310B have one and handle core 312.Processor is connected to third level storage high-speed cache (L3 high-speed cache) 320, and is connected to static RAM (SRAM) memory set 330.Data from L3 high-speed cache 320 are loaded on one group of DDR Synchronous Dynamic Random Access Memory (SDRAM) 340 by double data rate (DDR) Memory Controller 350.
Referring again to Fig. 3, SRAM storer 330 is connected to jtag interface 360, and at jtag interface 360 places, communication is left computing chip 112 and arrived Ido chip 180.Service node communicates by Ido chip 180 and computing node via ethernet link, and this ethernet link is the part of control system network 150 (describing above with reference to Fig. 1).In Blue Gene/L system, there is an Ido chip in each gusset plate 120, and on other plates in each midplane 132 (Fig. 1).The Ido chip uses original UDP grouping via the order of trusted private 100Mbit/s Ethernet Control Network reception from service node.The support of Ido chip is used for the multiple serial protocol of communicating by letter with computing node.The JTAG agreement is used for carrying out read and write from service node 140 (Fig. 1) to any address of computing node 110 SRAM 330, and is used for system initialization and guiding (booting) processing.
Illustrative node compute chip 112 also comprises the network hardware 390 in Fig. 3.The network hardware 390 comprise be used to encircle 392, the hardware of tree 394 and global interrupt 396 networks.These networks of Blue Gene/L are used for making concise and to the point as mentioned other nodes of describing ground and system of computing node 110 to communicate.The network hardware 390 allows computing node to receive via loop network and the Data transmission grouping.The network hardware 390 is the network data business independently, so the processor of computing node does not bear the caused burden of data volume that flows by on loop network.This network data of going to another node by node is called as " leading directly to " business.
The calcspar of the torus network hardware of introducing among Fig. 4 illustration Fig. 3 392.Torus network hardware 392 comprises three formants (processor interface 410, annular transmitter 420 and annular receiver 430).Processor interface 410 comprises that processor injection 412 and processor receive 414FIFO (wherein access is according to the formation of first-in first-out rule).To the access of these FIFO two floating point units (FPU) register (not shown) via from processor (310A among Fig. 3,310B); That is, data are loaded into FIFO from a pair of FPU register via the reservoir that 128 bit memories shine upon, and data are read to the FPU register via 128 loadings from FIFO.Have eight the injection FIFO altogether that are organized into two groups: two high priorities (being used for operating system message between node) and six normal priority FIFO, it is enough for the nearest-neighbors connectivity.Grouping among all FIFO can be left with any direction on loop network.Receive among the FIFO 414 at processor, also have two groups of FIFO.Each group contains seven FIFO, a high priority and each direction that is exclusively used in six inbound directions.Particularly, between the corresponding reception of each receiver FIFO, there is private bus with it.For reservoir, all annular FIFO use the static random access memory chip (SRAM) by bug check and correction (ECC) protection, and will check the parity checking in all internal data paths.
Above-described torus network hardware 392 is striden the packet of various loop network directs variable-size.The example of Fig. 5 illustration torus network packet 510.Each grouping 510 in the Blue Gene/L system is n * 32 byte, wherein n=1 to 8 " bulk ".Can comprise many groupings such as those message that meet message passing interface (MPI), by the described grouping of carrying out on one or two related BlueGene/L processor 310A, 310B in Fig. 3 of software building, transmission and reception.The first eight of each grouping byte is a packet headers 512.Packet headers 512 contains link layer protocol information (for example, sequence number); Routing iinformation comprises the destination; Tunnel and size; And the byte wide Cyclic Redundancy Check 514 that detects header data destruction during the transmission.Packet headers 512 also comprises the prompting position 516 that hereinafter further describes.
Referring again to Fig. 5, after packet headers 512, comprise a plurality of data bytes 518.In addition, 24 CRC are attached to each grouping together with byte significance indicator 520.Because grouping can begin to transmit before being received fully, so significance indicator is necessary.This CRC allows to check each grouping when sending grouping via each link.For timeout mechanism is used in the re-transmission of destroyed grouping.Because header CRC is included among the complete packet CRC, so the use of eight packet headers CRC is a kind of optimization, it allows early detection packet headers mistake.
Introduce as mentioned, header 512 comprises six " prompting " position 516.516 indications of prompting position wherein can be in three dimensions of loop network the direction of routing packets.The prompting position is as follows with the XYZ sequential definition: X+, X-, Y+, Y-, Z+, Z-.For example, be 100100 prompting position mean grouping can be on x+ and y-direction route.Since one be provided with position indication in this dimension with which direction direct packets, so x+ or x-prompting position can be set, but both can not be set.Default value is that all prompting positions all are reset or are 0 can send grouping in any direction with indication.
In loop network, the dimension order that exists data between node, to flow usually.Suppose that the dimension order in this paper example is XYZ, but also can use other orders.The dimension order of XYZ means that data will at first flow from node in the X dimension, then by the node in the Y dimension, then by the node in the Z dimension.Use XYZ prompting position respectively in the route in the XYZ dimension.
Each node is all safeguarded one group of configurable register (not shown) of software of controlling annular function.For example, one group of register contains its neighbours' coordinate.It is 0 that the prompting position is set when leaving a node on being grouped in a direction, so that it will arrive its destination in this dimension, as determined by neighbours' coordinate register.These prompting positions early come across in the header so that arbitration can be by pipeline conveying effectively.The prompting position can be by software or hardware initialization; If finish, then use one group of two register of every dimension to determine suitable direction by hardware.These registers can be configured to provide the minimum hop count route.By checking that prompting position and tunnel come integral body to finish route; That is, there is not routing table.Dynamically or definite dimension order ground (xyz) routing packets.That is, it can follow the path of least congested based on other business, and perhaps it can be routed on fixed route.Except that point-to-point grouping, the position in the header can be set so that each node place is broadcasted and be deposited on to grouping downwards along any Descartes's dimension.Software can suitably be provided with prompting position, so that as described further belowly avoid like that " extremely " node or link.When having maximum three non-colinear malfunctioning nodes, the connectivity that can be kept perfectly.
The part 600 that Fig. 6 illustrates the large-scale concurrent computational system shown in the presentation graphs 1 dynamically re-routes the calcspar of node traffic with illustrated example.A part 600 illustrations of concurrent computational system are labeled as nine nodes of node 1 610 to node 9 612.Node among Fig. 6 only node in illustration X and the Y dimension but it should be understood that computer system also can have the node that is arranged in the Z dimension with simplified example.X and Y dimension such as 614 indications of XY axle.For this example, just suppose to use and to node 8 622, carry out at node 1 610.When detecting fault or incipient fault on node 5 618, application is supspended or is suspended, and till the all-network business network is mourned in silence by waiting in removing FIFO.Application on the malfunctioning node 618 is moved into secondary node (node 9612) then.Upgrade each node to need by the malfunctioning node mobile data then, with by to all nodes or send the problem list (144 among Fig. 1) that has upgraded to affected node at least and avoid malfunctioning node.
Referring again to Fig. 6, node uses the problem list upgraded to guarantee walking around affected node or network comes the route data.Then suitable prompting position is set the packet that sends from each node and comes the route grouping so that will walk around malfunctioning node.In the example shown in Fig. 6, will have the prompting position that is provided with at X-from the packet of node 2 620, on the X-direction, march to node 8 and therefore avoid malfunctioning node with direct packets.Similarly, will have the prompting position that is provided with at X+, on the X+ direction, march to node 2 with direct packets from the packet of node 8 622.In addition, packet from node 4624 will have the prompting position that is provided with at Y+, on the Y+ direction, march to node 6 and therefore avoid malfunctioning node with direct packets, and will have the prompting position that is provided with at Y-from the packet of node 6 626, on the Y-direction, march to node 4 with direct packets.
The part 700 that Fig. 7 illustrates expression large-scale concurrent computational system shown in Figure 1 is used for dynamically re-routing the calcspar of another example of node traffic with illustration.How this example illustration uses the prompting position at non-adjacent node.In addition, a part 700 illustrations of concurrent computational system as at Fig. 6 at above-described nine nodes that are labeled as node 1 610 to node 9 612.In this example, on node 8 622, detect fault or incipient fault.Application is suspended and makes network to mourn in silence, and the application on the malfunctioning node 618 is moved into secondary node (node 9 612) then.As mentioned, upgrade each node that may need then, to avoid malfunctioning node by sending the problem list (144 among Fig. 1) that has upgraded to affected node by the malfunctioning node mobile data.The prompting position is set then to guarantee that walking around malfunctioning node comes the route data.In the example of Fig. 7, will be from the packet of node 1 610 less than the prompting position that is provided with at directions X, this be since malfunctioning node not in that this side up.But node 1 610 will be provided with Y+ prompting position, advance on the Y+ direction with direct packets.When the grouping from node 1610 arrived node 7 628 and begins to advance in the Y dimension, it will march to node 9 612 and therefore avoid malfunctioning node 8622 as being guided in of Y+ prompting position indication that is provided with on the Y+ direction.
Introduce as mentioned, the prompting position also can be used for walking around congested network and carries out dynamic routing.As an example, consider illustrative network 710 between node 8 622 and node 5 618 among Fig. 7.If network monitor (142 among Fig. 1) is expressed as congested network with network 710 usefulness signs, then with as mentioned at walking around that node 8 622 carries out route and the same way as described is used the prompting position to walk around this network and carried out dynamic routing.Alternatively, node may bear the undue burden that is caused by express traffic.For example, if node 8 is owing to be defined as overload by the express traffic of node 8 622 by node/network monitor, then the processing on the node 8 dynamically re-routed to available switching node to alleviate in the processing of carrying out on the node 8 or to use the express traffic that is loaded.
Fig. 8 illustrates and is used in the method 800 of concurrent computational system monitor node with the processing that dynamically re-routes malfunctioning node.The method is carried out by the software on the service node, but may software and/or hardware on node collect information needed.At first, monitoring network (step 810) and the network focus is recorded in (step 820) in the problem list.Then, monitor node (step 830) and nodes records (step 840) in problem list that malfunctioning node maybe may be broken down.Method is finished then.
Fig. 9 illustrates the method 900 that is used for dynamically re-routing at concurrent computational system the processing of malfunctioning node.The method is preferably carried out by software on each node of concurrent computational system and/or hardware.At first, detect the tabulation of replacement problem (step 910) that to avoid node or network by containing of the transmission of the network monitor on the service node.Then, suspend the application of on the subregion of parallel system, carrying out (step 920) with malfunctioning node.Then, make network mourn in silence (step 930) till sending its message by waiting for having finished until torus network hardware FIFO.Then the location is used for the switching node of network or alternative path (step 940) and the processing of malfunctioning node is migrated to switching node (step 950).Then, notice will be used the prompting position isolate node by the node that malfunctioning node or network send Network and walk around malfunctioning node or congested network is come route Network (step 960).Use then and can recover (step 970) from the point of its time-out.Method is finished then.
Disclosure of the present invention comprises a kind of being used at the method and apparatus that does not restart the node traffic on the computing node that is using the prompting position dynamically to re-route described system under the situation of the application of carrying out on the large-scale concurrent computational system.Dynamically re-route node traffic and can reduce amount stop time significantly, thereby improve the efficient of computer system.One skilled in the art will appreciate that and in computer software, to realize this method.
One skilled in the art will appreciate that within the scope of the claims and can carry out many modifications.Therefore, although specifically illustrate and describe DISCLOSURE OF INVENTION hereinbefore, one skilled in the art will appreciate that in these and other changes that can carry out therein under the situation of the spirit and scope that do not break away from claim on form and the details.

Claims (17)

1. concurrent computational system comprises:
A plurality of nodes, it is connected by one or more networks;
Node/network monitoring mechanism, it monitors the node and the network of described concurrent computational system, and sets up the problem list of node and network; And
Node, its use the prompting position via described one or more networks dynamically the route data grouping to avoid node and the network in the described problem list.
2. concurrent computational system as claimed in claim 1, wherein said prompting position is meant a plurality of binary values of the preferred orientations that is shown in guide service on the loop network.
3. concurrent computational system as claimed in claim 1 or 2, wherein said prompting position are included in the header of the packet that sends via described one or more networks.
4. as the described concurrent computational system of arbitrary previous claim, wherein said node by suspend should being used for of on this node, carrying out use described prompting position dynamically route data divide into groups so that upgrade described problem list, and recover this application from the point that suspends this application then.
5. as the described concurrent computational system of arbitrary previous claim, wherein said concurrent computational system is the large-scale concurrent computational system that has by the node of three-dimensional ring network interconnection.
6. computer implemented method, be used under the situation that does not restart the application of carrying out on the concurrent computational system, use the prompting position dynamically to re-route by the node processing on the computing node of the one or more networks connections in the described concurrent computational system, wherein this method may further comprise the steps:
Monitor the problem of described node and network and problem node and the network in the identified problems tabulation;
Detect and when upgrade described problem list;
Suspend the execution of the node of carrying out application;
In the described prompting position at least one is set to isolate node or the network in the described problem list; And
Notify all nodes in the described application to recover to carry out.
7. computer implemented method as claimed in claim 6, wherein said prompting position is meant a plurality of binary values of the preferred orientations that is shown in guide service on the loop network.
8. as claim 6 or 7 described computer implemented methods, wherein said prompting position is included in the header of the packet that sends via described one or more networks.
9. as the described computer implemented method of the arbitrary claim in the claim 6 to 8, wherein said concurrent computational system is the large-scale concurrent computational system that has by the node of three-dimensional ring network interconnection.
10. as the described computer implemented method of the arbitrary claim in the claim 6 to 9, further comprising the steps of:
At least one backup node is moved in the processing of at least one malfunctioning node.
11. as the described computer implemented method of the arbitrary claim in the claim 6 to 10, the step that wherein detects replacement problem tabulation comprises: detect the congested network of node and at least one prompting position is set and come the route business so that walk around congested node.
12. a computer-readable program product is used for carrying out on the concurrent computational system with a plurality of nodes that connected by one or more networks, this computer-readable program product comprises:
Node/network monitoring mechanism, it monitors the node and the network of described concurrent computational system, and sets up the problem list of node and network; And
The node routing mechanism, its use the prompting position via described one or more networks dynamically the route data grouping to avoid node and the network in the described problem list; And
Computer storage media with computer program instructions, described computer program instructions can be operated to be used to making computing machine carry out described node/network monitoring mechanism and described node routing mechanism.
13. program product as claimed in claim 12, wherein said prompting position is meant a plurality of binary values of the preferred orientations that is shown in guide service on the loop network.
14. as claim 12 or 13 described program products, wherein said prompting position is included in the header of the packet that sends via described one or more networks.
15. as the described program product of arbitrary claim in the claim 12 to 14, wherein said node by suspend should being used for of on this node, carrying out use described prompting position dynamically route data divide into groups so that upgrade described problem list, and recover this application from the point that suspends this application then.
16. as the described program product of arbitrary claim in the claim 12 to 15, wherein said concurrent computational system is the large-scale concurrent computational system that has by the node of three-dimensional ring network interconnection.
17. a computer program, comprise when described program is carried out on computers, be suitable for enforcement of rights require 6 to 11 program code devices in steps.
CN2008800124505A 2007-04-18 2008-03-20 Dynamically rerouting node traffic on a parallel computer system Expired - Fee Related CN101663649B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US11/736,811 US7644254B2 (en) 2007-04-18 2007-04-18 Routing data packets with hint bit for each six orthogonal directions in three dimensional torus computer system set to avoid nodes in problem list
US11/736,811 2007-04-18
PCT/EP2008/053377 WO2008128836A2 (en) 2007-04-18 2008-03-20 Dynamically rerouting node traffic on a parallel computer system

Publications (2)

Publication Number Publication Date
CN101663649A true CN101663649A (en) 2010-03-03
CN101663649B CN101663649B (en) 2012-07-18

Family

ID=39739647

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008800124505A Expired - Fee Related CN101663649B (en) 2007-04-18 2008-03-20 Dynamically rerouting node traffic on a parallel computer system

Country Status (7)

Country Link
US (1) US7644254B2 (en)
EP (1) EP2156291A2 (en)
JP (1) JP5285690B2 (en)
KR (1) KR20090122209A (en)
CN (1) CN101663649B (en)
TW (1) TW200907702A (en)
WO (1) WO2008128836A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106130895A (en) * 2016-08-18 2016-11-16 中国联合网络通信集团有限公司 The heavy route method of SDN fault and device
CN113364603A (en) * 2020-03-06 2021-09-07 华为技术有限公司 Fault recovery method of ring network and physical node
WO2023207952A1 (en) * 2022-04-29 2023-11-02 上海商汤智能科技有限公司 Data processing method and apparatus, chip, electronic device, and medium

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2462492B (en) * 2008-08-14 2012-08-15 Gnodal Ltd A multi-path network
WO2010060923A1 (en) * 2008-11-26 2010-06-03 Danmarks Tekniske Universitet Biologically inspired hardware cell architecture
EP2399192A4 (en) * 2009-02-13 2016-09-07 Ab Initio Technology Llc Communicating with data storage systems
US8479215B2 (en) * 2009-08-18 2013-07-02 International Business Machines Corporation Decentralized load distribution to reduce power and/or cooling costs in an event-driven system
US8392661B1 (en) * 2009-09-21 2013-03-05 Tilera Corporation Managing cache coherence
US8103910B2 (en) * 2009-11-13 2012-01-24 International Business Machines Corporation Local rollback for fault-tolerance in parallel computing systems
US8359404B2 (en) * 2010-01-08 2013-01-22 International Business Machines Corporation Zone routing in a torus network
CA2782414C (en) * 2009-12-14 2021-08-03 Ab Initio Technology Llc Specifying user interface elements
US8559307B2 (en) * 2009-12-28 2013-10-15 Empire Technology Development Llc Routing packets in on-chip networks
US8140889B2 (en) * 2010-08-23 2012-03-20 International Business Machines Corporation Dynamically reassigning a connected node to a block of compute nodes for re-launching a failed job
JP5750963B2 (en) * 2011-03-22 2015-07-22 富士通株式会社 Parallel computer system, control apparatus, parallel computer system control method, and parallel computer system control program
US9811233B2 (en) 2013-02-12 2017-11-07 Ab Initio Technology Llc Building applications for configuring processes
US9424229B2 (en) 2013-02-13 2016-08-23 Advanced Micro Devices, Inc. Parallel torus network interconnect
US10996989B2 (en) * 2016-06-13 2021-05-04 International Business Machines Corporation Flexible optimized data handling in systems with multiple memories
US11423083B2 (en) 2017-10-27 2022-08-23 Ab Initio Technology Llc Transforming a specification into a persistent computer program
TWI686696B (en) 2018-08-14 2020-03-01 財團法人工業技術研究院 Compute node, failure detection method thereof and cloud data processing system
JP7167687B2 (en) * 2018-12-18 2022-11-09 富士通株式会社 Information processing device, information processing method and information processing program
US11057265B2 (en) * 2019-06-27 2021-07-06 Cerner Innovation, Inc. Path check insight
JP2021135983A (en) * 2020-02-28 2021-09-13 京セラドキュメントソリューションズ株式会社 Data cooperation system and data collection system

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU3059689A (en) * 1988-02-04 1989-08-25 City University, The Improvements in or relating to data handling arrays
US5495426A (en) * 1994-01-26 1996-02-27 Waclawsky; John G. Inband directed routing for load balancing and load distribution in a data communication network
JPH07239835A (en) * 1994-02-25 1995-09-12 Hitachi Ltd In-network data transfer control system for parallel computer
US6865149B1 (en) * 2000-03-03 2005-03-08 Luminous Networks, Inc. Dynamically allocated ring protection and restoration technique
US7729261B2 (en) * 2004-08-10 2010-06-01 Alcatel Lucent Forwarding of network traffic in respect of differentiated restricted transit network nodes
US20070053283A1 (en) * 2005-09-06 2007-03-08 International Business Machines Corporation Correlation and consolidation of link events to facilitate updating of status of source-destination routes in a multi-path network
US7839786B2 (en) 2006-10-06 2010-11-23 International Business Machines Corporation Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by semi-randomly varying routing policies for different packets
US7835284B2 (en) 2006-10-06 2010-11-16 International Business Machines Corporation Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by routing through transporter nodes
US7680048B2 (en) 2006-10-06 2010-03-16 International Business Machiens Corporation Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by dynamically adjusting local routing strategies
US8031614B2 (en) 2006-10-06 2011-10-04 International Business Machines Corporation Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by dynamic global mapping of contended links
US20080178177A1 (en) 2007-01-19 2008-07-24 Charles Jens Archer Method and Apparatus for Operating a Massively Parallel Computer System to Utilize Idle Processor Capability at Process Synchronization Points
US7631169B2 (en) 2007-02-02 2009-12-08 International Business Machines Corporation Fault recovery on a massively parallel computer system to handle node failures without ending an executing job
US7706275B2 (en) 2007-02-07 2010-04-27 International Business Machines Corporation Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by employing bandwidth shells at areas of overutilization

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106130895A (en) * 2016-08-18 2016-11-16 中国联合网络通信集团有限公司 The heavy route method of SDN fault and device
CN106130895B (en) * 2016-08-18 2019-11-15 中国联合网络通信集团有限公司 The heavy route method and device of SDN network failure
CN113364603A (en) * 2020-03-06 2021-09-07 华为技术有限公司 Fault recovery method of ring network and physical node
CN113364603B (en) * 2020-03-06 2023-05-02 华为技术有限公司 Fault recovery method of ring network and physical node
WO2023207952A1 (en) * 2022-04-29 2023-11-02 上海商汤智能科技有限公司 Data processing method and apparatus, chip, electronic device, and medium

Also Published As

Publication number Publication date
CN101663649B (en) 2012-07-18
TW200907702A (en) 2009-02-16
US7644254B2 (en) 2010-01-05
JP2010525433A (en) 2010-07-22
KR20090122209A (en) 2009-11-26
US20080263386A1 (en) 2008-10-23
WO2008128836A2 (en) 2008-10-30
JP5285690B2 (en) 2013-09-11
EP2156291A2 (en) 2010-02-24
WO2008128836A3 (en) 2008-12-18

Similar Documents

Publication Publication Date Title
CN101663649B (en) Dynamically rerouting node traffic on a parallel computer system
CN101589370B (en) A parallel computer system and fault recovery method therefor
US20190260504A1 (en) Systems and methods for maintaining network-on-chip (noc) safety and reliability
KR101091360B1 (en) Fault recovery on a parallel computer system with a torus network
US8769034B2 (en) Query performance data on parallel computer system having compute nodes
JP5363064B2 (en) Method, program and apparatus for software pipelining on network on chip (NOC)
JP2004062535A (en) Method of dealing with failure for multiprocessor system, multiprocessor system and node
US20140189443A1 (en) Hop-by-hop error detection in a server system
US10007629B2 (en) Inter-processor bus link and switch chip failure recovery
JP2006195821A (en) Method for controlling information processing system, information processing system, direct memory access controller, and program
CN107533493B (en) Restoring service acceleration
WO2017118080A1 (en) Heat removing and heat adding method and device for central processing unit (cpu)
CN101211282B (en) Method of executing invalidation transfer operation for failure node in computer system and computer system
CN100538647C (en) The processing method for service stream of polycaryon processor and polycaryon processor
US7512836B2 (en) Fast backup of compute nodes in failing midplane by copying to nodes in backup midplane via link chips operating in pass through and normal modes in massively parallel computing system
US7656789B2 (en) Method, system and storage medium for redundant input/output access
CN105009086A (en) Method for switching processors, computer, and switching apparatus
Hopwood et al. The design of a distributed computing system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120718

CF01 Termination of patent right due to non-payment of annual fee