US20080259798A1 - Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics - Google Patents
Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics Download PDFInfo
- Publication number
- US20080259798A1 US20080259798A1 US11/737,511 US73751107A US2008259798A1 US 20080259798 A1 US20080259798 A1 US 20080259798A1 US 73751107 A US73751107 A US 73751107A US 2008259798 A1 US2008259798 A1 US 2008259798A1
- Authority
- US
- United States
- Prior art keywords
- shared memory
- frames
- circuitry
- ports
- congestion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/10—Flow control; Congestion control
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/302—Route determination based on requested QoS
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/10—Flow control; Congestion control
- H04L47/11—Identifying congestion
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/10—Flow control; Congestion control
- H04L47/12—Avoiding congestion; Recovering from congestion
- H04L47/125—Avoiding congestion; Recovering from congestion by balancing the load, e.g. traffic engineering
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/50—Overload detection or protection within a single switching element
- H04L49/505—Corrective measures
- H04L49/508—Head of Line Blocking Avoidance
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/10—Packet switching elements characterised by the switching fabric construction
- H04L49/103—Packet switching elements characterised by the switching fabric construction using a shared central buffer; using a shared memory
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/20—Support for services
- H04L49/205—Quality of Service based
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/25—Routing or path finding in a switch fabric
- H04L49/253—Routing or path finding in a switch fabric using establishment or release of connections between ports
- H04L49/254—Centralised controller, i.e. arbitration or scheduling
Definitions
- the present invention relates to switch and multi-hop switch fabric architectures and, in particular, to flow and congestion control techniques in such architectures.
- the goal of scalable switch fabric architectures is to interconnect N switches in such a way so as to achieve as close to N times the transmission bandwidth that could be achieved with only one of the switches.
- Such techniques include flow control and congestion management which attempt to ensure efficient use of fabric bandwidth without latency spikes or packet loss.
- Conventional architectures such as, for example, those commonly used with Fibre Channel and InfiniBand protocols, are typically constructed to rely on a credit-based, input/output queued switch architecture that results in stiff flow-control which can significantly limit the bandwidth of the overall fabric.
- common architectures for fabrics in Ethernet switches often rely on statistical packet drop with very large buffers to achieve near full bandwidth operation. This has the disadvantage of penalizing applications which are highly sensitive to loss or jitter and results in a high manufacturing cost basis of the switch for off-chip memories, etc.
- a shared memory switch which includes a plurality of ports configured to receive and transmit frames of data, frame classification circuitry configured to classify the frames into a plurality of traffic classes, and frame memory configured to store the frames.
- the frame memory includes a plurality of shared memory partitions. Each of the shared memory partitions corresponds to one or more of the traffic classes, and has a plurality of counters associated therewith.
- the plurality of counters includes at least one per port memory usage counter for each of the plurality of ports, and at least one aggregate memory usage counter.
- the counters associated with each of the shared memory partitions are independent of the counters associated with others of the shared memory partitions.
- Congestion management circuitry implements congestion management policies for each of the partitions independently with reference to the counters associated with each of the partitions.
- a shared memory switch for use in a single-chip fabric or a multi-chip fabric.
- the switch includes a plurality of ingress ports and egress ports configured to receive and transmit frames of data, frame memory configured to store the frames, and rate limiting circuitry configured to pause individual ones of the ingress ports in response to usage of the frame memory by the individual ingress ports.
- the rate limiting circuitry is further configured to pause the individual ingress ports in response to congestion notification information corresponding to one or more of the egress ports or other ones of the shared memory switches in the multi-chip fabric downstream from the shared memory switch.
- a shared memory switch for use in a multi-chip fabric which includes a plurality of shared memory switches.
- the switch includes a plurality of ports configured to receive and transmit frames of data, frame classification circuitry configured to classify the frames into a plurality of traffic classes, frame memory configured to store the frames, congestion management circuitry configured to generate and transmit class-specific pause frames to selected ones of the ports in response to states of at least some of the plurality of counters, and egress scheduling circuitry configured to facilitate transmission of the frames.
- the egress circuitry is further configured to pause transmission of selected ones of the frames corresponding to specific ones of the traffic classes in response to downstream congestion.
- the congestion management circuitry and the egress scheduling circuitry enable implementation of a congestion management domain encompassing the plurality of shared memory switches.
- a shared memory switch for use in a multi-chip fabric which includes a plurality of shared memory switches.
- the shared memory switch includes a plurality of ingress ports and egress ports configured to receive and transmit frames of data. Each of the frames includes one or more segments.
- the switch further includes frame memory configured to store the frames and a scheduler configured to allocate and de-allocate space in the frame memory for storage of the frames on a segment-by segment basis.
- the scheduler is further configured to generate memory allocation status information on a segment-by-segment basis.
- a low latency status channel communicates the memory allocation status information on a segment-by-segment basis.
- a frame processing pipeline provides frame-level processing of the frames in parallel with the scheduler and frame memory and with reference to headers associated with the frames.
- the frame processing pipeline is further configured to maintain port information for each frame identifying one of the ingress ports on which the frame was received and at least one of the egress ports on which the frame is to be transmitted.
- the frame processing pipeline is further configured to receive the memory allocation status information from the scheduler via the status channel and to correlate the memory allocation status information with the port information.
- Congestion management circuitry is configured to effect at least one policy with reference to the correlated memory allocation status and port information.
- a shared memory switch for use in a multi-chip fabric which includes a plurality of shared memory switches.
- the shared memory switch includes a plurality of ingress ports and egress ports configured to receive and transmit frames of data and frame memory configured to store the frames.
- Congestion management circuitry is configured to detect congestion associated with a particular one of the egress ports, identify a flow with reference to a frame directed to the particular egress port, and generate a first congestion notification message directed to an upstream one of the shared memory switches in the multi-chip fabric from which the flow originated.
- the congestion management circuitry is further configured to pause a particular one of the ingress ports for a period of time in response to a second congestion notification message received from a downstream one of the shared memory switches in the multi-chip fabric, and automatically unpause the particular ingress port without a subsequent congestion notification message from the downstream shared memory switch.
- the congestion management circuitry is further configured to exponentially increase the period of time in response to a third congestion notification message from the downstream shared memory switch.
- a shared memory switch for use in a multi-chip fabric comprising a plurality of shared memory switches.
- the shared memory switch includes a plurality of ingress ports and egress ports configured to receive and transmit frames of data and frame memory configured to store the frames.
- Congestion management circuitry is configured to detect congestion associated with a particular one of the egress ports, generate a first multi-cast congestion notification message directed to a subset of ingress ports of the shared memory switches upstream in the multi-chip fabric and associated with a first flow directed to the particular egress port and an associated priority to thereby facilitate pausing of the first flow, and generate a second multi-cast congestion notification message directed to the subset of ingress ports to thereby facilitate unpausing of the first flow.
- the congestion management circuitry is further configured to, in response to a third multi-cast congestion notification message received from a downstream one of the shared memory switches, pause a second flow associated with a particular one of the ingress ports and directed to a particular egress port associated with the downstream shared memory switch and an associated priority, and unpause the second flow in response to a fourth multi-cast congestion notification message from the downstream shared memory switch.
- rate limiting circuitry for use in a shared memory switch having a plurality of input ports for receiving frames of data.
- the rate limiting circuitry includes token bucket circuitry implementing a token bucket for each input port.
- the token bucket circuitry for each port is configured to add tokens to the corresponding token bucket at a specified rate, and to remove tokens from the corresponding token bucket in response to receipt of frames on the corresponding input port.
- the rate limiting circuitry further includes pause circuitry configured to enable a pause function for the corresponding input port in response to crossing of a minimum threshold associated with the corresponding token bucket, and to disable the pause function in response to crossing of a pause-off threshold associated with the corresponding token bucket and above the minimum threshold.
- FIG. 1 is a block diagram illustrating operation of a congestion management architecture according to a specific embodiment of the invention.
- FIG. 1A is a block diagram of an example of a shared memory architecture in which embodiments of the invention may be implemented.
- FIG. 2 is a block diagram illustrating operation of an ingress rate flow control technique according to a specific embodiment of the invention.
- FIG. 3 is a block diagram illustrating operation of a stateless congestion management technique according to a specific embodiment of the invention.
- FIG. 4 is a block diagram illustrating operation of a congestion management technique in a VOQ fabric according to a specific embodiment of the invention.
- a shared memory switch which employs partitions of the shared memory to implement multiple, independent virtual congestion domains. As will be described, this approach allows congestion to be handled for different classes of traffic independently.
- Specific embodiments of the invention will be described with reference to an Ethernet switch implementation which may be employed in multi-chip architectures such as, for example, Clos architectures, spanning trees, fat trees, etc. Examples of architectures in which embodiments of the present invention may be implemented are described in U.S. patent application Ser. No. 11/208,451 for SHARED-MEMORY SWITCH FABRIC ARCHITECTURE filed on Aug. 18, 2005 (Attorney Docket No. FULCP011), the entire disclosure of which is incorporated herein by reference for all purposes. However, it should be noted that embodiments of the present invention are not limited to the foregoing and may be implemented in a wide variety of architectures.
- FIG. 1 shows a portion of a shared-memory Ethernet switch 100 .
- the diagram has been simplified to better illustrate important aspects of the invention. For example, only one ingress port and one egress port are shown in FIG. 1 . However, it will be understood that such a switch typically has many ports, e.g., 16 , 24 , or 36 , each of which will have associated instances of at least some of the circuitry shown in FIG. 1 . Therefore, the scope of the invention should not be limited with reference to such simplifications.
- Ethernet Port Logic (EPL) 102 receives an ingress frame which is classified by frame classifier 104 to determine how the frame will be treated in the switch, e.g., quality of service (QoS) and destination port.
- QoS quality of service
- Different classes of traffic might include, for example, storage traffic, inter-processor communication traffic, LAN traffic, etc.
- Congestion control 108 implements a policer which limits bandwidth by dropping or marking frames which exceed configured rate thresholds for the particular traffic class.
- Congestion control 108 also implements a rate limiter which handles bandwidth throttling, causing input ports to be paused if they exceed certain rate thresholds, e.g., using Ethernet “pause” and “pause off” frames.
- the rate limiter included in congestion control 108 implements a pause-pacing function which enables “lossless” rate limiting for some classes of traffic. That is, for such classes of traffic, frame transfer is generally paused rather than allowing the frame to enter the port and then discarding it.
- congestion control 108 integrates two different function, i.e., it can cause discard of forward going packets, thereby decreasing the ingress rate through the policing function (e.g., red, yellow, green marking), and it can facilitate lossless link-level flow control, i.e., a pause pacing function in the backward direction.
- a modified token bucket is employed by the rate limiter to measure the input rate and then translate that into the link-level pause-pacing function to the input. This includes class-based pauses in which pausing can be done on a link for specific classes of traffic.
- Congestion control 108 looks at the ingress rate as defined by the token buckets and uses it to either police the flows (i.e., mark frames red, yellow, or green), or to rate limit the flows via a pause pacing function.
- congestion control 108 also interprets multi-hop congestion notification messages which enables it to replace or “proxy” similar functionality in a network interface card (NIC) to which it is linked, i.e., if there isn't logic in the NIC capable of facilitating a rate limiting function, or the NIC does it inefficiently, the pause pacing function may be introduced in the switch as a proxy. This enables the implementation of such functionalities with legacy NICs.
- NIC network interface card
- a switch architecture includes a shared memory 152 , a scheduler 154 , and a frame processing pipeline 156 as described in U.S. patent application Ser. No. 11/208,451 incorporated herein by reference above.
- a packet is streamed into shared memory 152 without the possibility of blocking through a system of crossbars 158 and 160 , while the packet headers are copied into frame processing pipeline 156 .
- Scheduler 154 allocates pointers to memory and associates them with port logic as the packets are coming in.
- a status channel 162 goes from scheduler 154 to frame processing pipeline 156 , and communicates the status of each segment of memory as it is being allocated to each port.
- Frame processing pipeline 156 maintains state on what ingress port each packet is arriving, and the egress port or ports to which the packet is directed.
- Such an architecture enables the communication of memory allocation information from the scheduler to the frame processing pipeline with extremely low latency, i.e., for each memory segment rather than each packet or frame which might include, for example, dozens of segments.
- congestion management policies are based on the status of what memory is actually allocated in the system, and because such an architecture enables updating the status of memory allocation on a segment-by-segment basis rather than a packet-by-packet basis
- flow control i.e., the implementation of congestion management policies, may be effected and enforced much more quickly and richly than conventional approaches allow.
- the very low latency information transfer between the switch element datapath and the frame processing pipeline is leveraged to enable rapid flow control responses within a chip and, according to some embodiment, in a multi-chip fabric, i.e., the latency of flow control loops in which one chip can communicate congestion information to upstream chips in the fabric is greatly reduced.
- frames stored in shared memory 110 are retrieved for transmission by scheduler 112 which is followed by another rate limiting mechanism in egress shaper 114 .
- the egress shaper 114 uses the output of classifier 104 together with the mapping table 116 to determine the bandwidth allocated to a particular bandwidth sharing group. According to a specific embodiment, egress shaper 114 performs this function with reference to bandwidth sharing groups (discussed below) to which the various traffic classes are mapped by mapping function 116 . Frames exceeding their QoS rates are marked by the policer in congestion control 108 with reference to the configuration stored in the policer.
- a set of counters and “watermarks” monitor how frame memory 110 is used.
- the counters and watermarks are used for a variety of purposes including, to enable packet discard, i.e., the policing function which results in the dropping of packets because queues are full; to enable pause frame generation, i.e., link level flow control which uses a pause frame to tell the immediately upstream link partner to stop sending packets on a particular link; and to enable congestion notification frame generation, i.e., frames indicating congestion which can potentially traverse multiple hops to any upstream port in a multi-chip congestion domain. Two different modes of congestion notification are described below.
- the first is a uni-cast approach in which egress frames are statistically sampled and, when an egress port is found to be congested, the source and destination addresses of the frame are switched in a congestion notification frame which is then transmitted upstream to the source of the congestion. The source then interprets that information to slow down the corresponding flow (see the description of SCN and BCN below).
- the second is a multi-cast approach in which the congestion notification message is sent back to all input ports (see the description of VCN below). In both cases, a layer 2 address tells the frame where to go, and it's tagged so that when it gets to its destination, a compliant device can filter and interpret it properly.
- these features enable policy enforcement with regard to memory usage for different traffic classes.
- the policing and rate limiting functions of congestion control 106 are enforced.
- exceeding some of the watermarks may be reflected in the CM state generated by CM block 118 which may result in generation of congestion notification frames by congestion notification block 120 .
- These congestion notification frames are sent to link partners, e.g., neighboring switches in the switch fabric, i.e., from which the frames exceeding the threshold were transmitted, for use in determining rate adjustments (e.g., by rate adjustment block 122 ) to be applied by the local rate limiting function (e.g., rate limiter 108 ).
- frame memory 110 is implemented with multiple shared memory partitions 124 which enabling mapping of different traffic classes into different partitions, and the application of sets of watermarks accordingly.
- the combination of multiple shared memory partitions, the implementation of the egress scheduler, and the use of class-based pause enables end-to-end partitioning of traffic in multiple virtual congestion domains which, in turn, enables the application of independent congestion management policies for different classes of traffic.
- This enables a switch fabric in which frames in different partitions do not interfere with each other on the ingress ports, in the shared memory, or on the egress ports.
- policies can be implemented in which LAN traffic can be allowed to be lossy (i.e., dropped frames permitted), but storage traffic, which cannot tolerate dropped frames and is latency-sensitive, can be handled in a lossless manner, and each type of traffic can be sub-divided into different priorities irrespective of the other type of traffic.
- a rate limiter which employs a token bucket to measure input rates and then translate those rates into a pause pacing function to the input using “pause” and “pause-off” frames, e.g., as defined by the IEEE Ethernet specification. This may be applied to a link as a whole or for specific classes of traffic on a link.
- pause e.g., as defined by the IEEE Ethernet specification.
- a congestion control algorithm is enabled to adjust the rate at which tokens are added to the token bucket.
- Ingress frames received by Ethernet port 102 are classified in one of a plurality of traffic classes, i.e., by classifier 104 .
- Rate meter 202 in congestion control 108 monitors the traffic rates for the respective classes and provides its output to both policer 204 and rate limiter 206 .
- Policer 204 uses the information provided by rate meter 202 to implement the policing function described above.
- Rate limiter 206 uses the information provided by rate meter 202 in conjunction with congestion notification information from other downstream switches in the congestion domain to implement the pausing function described above.
- rate limiter 206 introduces pause frames into the upstream datapath which are communicated to the upstream link partner, e.g., represented by Ethernet port 208 .
- Port 208 may be inside or outside of a congestion domain which may be defined by a multi-chip switch fabric such as, for example, a Clos architecture or spanning tree.
- rate limiter 206 implements two different forms of link level, lossless rate limiting, one based on configured link level rates, and the other based on congestion notification messages at the congestion domain level. That is, rate limiter 206 allows one to specify a fixed desired link level rate thus creating a local loop which enables local rate limiting.
- the congestion notification information received by congestion control block 108 enables end-to-end or multi-hop congestion control in the congestion domain.
- the congestion notification information is derived from congestion notification frames indicating congestion in downstream switches in the fabric which is determined to have resulted from frames originating from the switches to which the congestion notification frames are sent. It should be noted that these frames may be generated according to any of a wide variety of public or proprietary congestion notification algorithms.
- congestion notification messages may also be employed to enable link-level pause at the ingress boundary of a single switch or multi-hop fabric. And by spreading congestion from a congestion point to the periphery of a switch fabric, the amount of head-of-the-line blocking is greatly reduced even if the ultimate source and sink of data frames are not included in the congestion control domain. It should be noted that implementation of such an approach outside of the switch fabric, e.g., in a network interface controller (NIC), is difficult in that there might be thousands of simultaneous flows which would need to be monitored and this is extremely expensive to implement in silicon.
- NIC network interface controller
- the classification of layer 2 traffic at the edges of the switch fabric followed by the monitoring of traffic rates at that level of granularity enables an optimization which, while accepting some amount of head-of-the-line blocking, does not require devices outside of the congestion domain defined by the switch fabric to implement any corresponding algorithms. As mentioned above, this enables the use of legacy NICs.
- pause frames might be independently generated and transmitted to the link partners for one or more ports when the shared memory becomes full (not shown).
- “lossless” rate control is implemented in congestion control 108 using one or more token buckets, e.g., one for the link as a whole, and/or one for each class of traffic.
- the token buckets are implemented as part of rate meter 202 . Tokens are added to each bucket at a specified rate. Each time a frame is received, some number of tokens corresponding to the length of the frame (e.g., number of bytes) are removed from the appropriate bucket(s). When the number of tokens in a bucket reaches or drops below zero, the pause function for the link or the specific class is enabled, e.g., a pause frame is sent to the upstream link partner.
- the pause frame sent may be for the entire link or just for a particular class of traffic on that link, i.e., class-based pause.
- class-based pause When the number of tokens in a bucket reaches some threshold above zero a pause-off frame is sent to the link partner.
- the level of the pause-off threshold for each bucket introduces hysteresis and may be empirically determined as a balance between jitter and consumption of bandwidth by pause function frames.
- the rate at which tokens are introduced into the token bucket(s) associated with congestion control 108 are adjusted in response to the output of rate meter 202 and congestion notification information derived from frames received from downstream link partners.
- These congestion notification frames may include information such as, for example, the level of the downstream congestion, whether the congestion is increasing or decreasing, etc.
- rate adjustment 210 decreases the token rate(s) relative to the actual traffic rate(s) measured by rate meter 202 which, according to a specific embodiment, employs exponentially weighted moving averages to measure traffic rates.
- rate adjustment 210 may filter out multiple congestion notification messages that come from downstream link partners and arrive more frequently than once every minimum round trip delay of the network, thus preventing over constriction of any particular source of congestion.
- rate limiting algorithms typically employ a multiplicative decrease (or an additive increase) to converge to a new rate.
- a metering function is implemented in which the multiplicative decrease starts from the current rate (in a time averaging sense) rather than from the predefined (and often high) line rate (as with conventional algorithms).
- rate increases are generated with respect to a previously stored acceptable rate in order to ensure a fast recovery to the full rate. That is, if the measured rate is used for rate increases, the new measured rate would be a function of the previous measured rate. This time dependency would then slow down the recovery.
- frame memory 110 includes multiple shared memory partitions 124 . Every ingress frame is mapped based on its traffic class to one of memory partitions 124 .
- Congestion management block 118 monitors multiple private counters (associated with frame memory 110 ) for each partition 124 (i.e., at least one for each port) which count the frames stored in that partition from each of the corresponding ports. This is represented by private memory blocks 126 .
- Congestion management block 118 also monitors an aggregate receive (rx) counter (associated with frame memory 110 ) for each partition 124 which counts frames from all of the ports, but only when the watermarks associated with one or more of the private receive counters are exceeded. That is, specific frames are not registered by the aggregate counter as using memory beyond the private memory allocated to the corresponding port unless the watermark for that port has been exceeded. This is represented by shared memory block 128 .
- the congestion management block 118 also keeps track of per transmission port per traffic class memory usage in transmit/traffic class (tx/tc) counters. When an aggregate receive counter exceeds its shared memory watermark, action is taken only for the ports that have private counters above their respective private memory watermarks. By tracking usage of the different memory partitions in this way, congestion management policies may be implemented independently for the different traffic classes on a per port basis.
- each counter associated with a partition 124 has multiple watermarks. Depending on the set of watermarks exceeded, an incoming frame will be assigned to some priority, dropped, or marked. If the frame is not dropped, the level of service provided to the frame depends on the assigned priority.
- these watermarks are used to facilitate pause, congestion notification, and packet discard.
- rx pause watermark and an rx hog watermark are both per port watermarks, the first of which results in pause frame being sent back out that link when that port is using more memory than it's allowed to, and the second of which results in discarding a packet when that port is using more memory than it's allowed to.
- a tx hog watermark is a per port watermark which will drop a packet based on a tx port being full.
- a tx congestion notification watermark is a per port watermark which results in a congestion frame being sent back to the source address.
- a sum over all ports of the memory usage represented by the “per port” watermarks can be much greater than total memory.
- the sum of the memory usage represented by the “per port private” watermarks must be less than the total memory, i.e., shared memory is the remaining portion of the total memory. In the case of rx ports, the private memory is there to minimize head-of-the-line blocking for pause.
- a pause is executed in response to a global watermark, instead of pausing all input ports, only ports exceeding their private watermarks will be paused, as those not exceeding theirs aren't actually contributing to congestion.
- each port's receive counter has an rx watermark
- each port's tx/tc counter has a tx/tc watermark
- the aggregate of the tx/tc counters is compared with the tx watermark. That is, for each shared memory partition, there is an aggregate rx counter, per port rx counters, and per port per traffic class tx counters.
- the purpose of the rx watermark is to prevent excessive usage of the shared memory partition by traffic from the corresponding port. When the rx watermark is exceeded, the frame is either dropped or paused depending, for example, on the traffic class.
- the ability to pause on a per port basis is advantageous in that, if a port is not contributing to congestion, it is undesirable to pause it.
- the pause is implemented similarly to the pause function associated with the rate limiter described above, e.g., generation of an Ethernet pause frame.
- the purpose of the tx and tx/tc watermarks is to prevent congestion of the corresponding port by frames transmitted out of the shared memory. When these watermarks are exceeded, frames are dropped. Having both the rx and tx watermarks active allows the transmission of frames between any pair of ports that are not congested independent of congestion conditions for other ports.
- the tx watermark is compared against the aggregate of the tx/tc counters, and is used for applications in which it is not important to distinguish between the traffic classes and hence we do not need to reserve memory per traffic class.
- the watermarks may be configured as appropriate for a particular application. That is, watermark levels may be adjusted or removed entirely in different combinations depending on the particular implementation. For example, if certain ports require no memory reservations per tx or per tx/tc, the tx and tx/tc watermarks may be turned off. Or, if class-based tx memory reservation was not needed, the tx/tc watermark may be turned off. This allows the system designer to only allocate private memory in the memory partitions as needed.
- the drop condition of a frame is that rx private, tx private and tx/tc private allocations must all be exceeded before a frame is eligible for being dropped. This ensures that the private memory is reserved for each rx, tx, tx/tc. Also this means that the total memory used in the system is the sum of the rx private, rx shared, max (tx private, sum(tx/tc private)) which the user should ensure does not exceed the total memory of the switch.
- congestion notification block 120 generates congestion notification frames in response to the CM state generated by congestion management block 118 .
- CM block 118 generates the CM state with reference to the tx and tx/tc watermarks, i.e., the indicators of congestion at local egress ports.
- the pause capability on ingress ports described above may be further enhanced if egress scheduler 112 also has a pause capability and, in particular, support class-based pause.
- the tx/tc watermark triggers the class-based pause frame generation.
- the egress scheduler 112 is the block that determines when frames are transmitted. When a pause frame is received by a switch, the egress scheduler stops the traffic going out on the corresponding port.
- class-based pause, shared memory partitions, and bandwidth sharing groups in the egress scheduler enables a converged fabric in which best-in-class congestion management disciplines may be implemented such that the various different traffic types which are converged in the fabric don't get in each other's way.
- a plurality of counters are employed in conjunction with a plurality of ingress watermarks and a plurality of egress watermarks to monitor and control memory usage by the various ports and traffic classes.
- Each memory partition has an aggregate ingress counter which tracks the number of segments of the memory consumed by that partition.
- Each memory partition also has a private ingress counter for each port which tracks the number of memory segments consumed by that port.
- a private ingress watermark associated with each private ingress counter defines the private memory allocated to the corresponding port within the memory partition. When an ingress port's private ingress counter is below this watermark, the port is not subject to memory usage based pausing or dropping for that memory partition.
- a “hog” ingress watermark is also associated with each private ingress counter which prevents the corresponding port from consuming too much memory. If a received frame will result in the hog ingress watermark being exceeded, the frame is dropped only if the corresponding private ingress watermark is also exceeded.
- a global ingress watermark associated with the aggregate ingress counter defines the total number of segments over all ports allocated to the corresponding memory partition (not including the private memory allocations for each port). Thus, the total memory usage for a memory partition over all ports will not be allowed to exceed this watermark and the sum of the private ingress watermarks for that partition. If a received frame will result in the global ingress watermark being exceeded, the frame is dropped only if the private ingress watermark for the port on which the frame was received is also exceeded.
- a set of pause watermarks is provided relating to global memory usage and another set relating to private memory usage. These watermarks are used by congestion management circuitry to generate pause “on” and pause “off” frames on a per port and/or a per traffic class basis.
- each memory partition also has a private egress counter for each port which tracks the number of segments currently in the memory partition intended to be transmitted out on that port.
- a private egress watermark is associated with each private egress counter for each traffic class which represents the amount of memory allocated for that traffic class.
- Multiple “hog” egress watermarks are also associated with each port to prevent a single port from consuming too much memory. The different hog egress watermarks correspond to the different traffic priorities.
- mapping function 116 maps traffic classes identified by frame classifier 104 into bandwidth sharing groups among which the egress bandwidth is allocated. For example, 8 traffic classes might be mapped into two bandwidth sharing groups, each having 4 of the classes and each of which is allocated 50% of the egress bandwidth. That is, each group of 4 classes could only consume 50% of the available egress bandwidth. This could be effected, for example, using a deficit-weighted-round-robin algorithm to schedule traffic as between bandwidth groups (assuming the groups have equal priority). However, within each group there is a strict prioritization according to traffic class such that a higher class within a group could potentially starve out lower priority traffic.
- bandwidth sharing groups may also be prioritized with respect to each other. This could enable, for example, creation of a strict high priority bandwidth group which could starve all lower priority groups, and/or a strict low priority bandwidth group which could only consume bandwidth if none of the other bandwidth groups have traffic to send. Examples of bandwidth sharing groups which might be important in a typical application include inter-processor traffic, LAN traffic, storage traffic, and web traffic.
- Embodiments of the invention have been described which implement a combination of class-based pause, shared memory partitions, and an egress scheduling algorithm which allows bandwidth groups and priorities within the bandwidth groups.
- This combination enables virtual switching from a congestion management perspective, thus enabling a new class of performance for an Ethernet switch or any other protocol used to implement a converged fabric. That is, virtual domains are enabled for independent treatment of different types of traffic all the way through the switch, and therefore all the way through a multi-chip fabric based on such switches.
- the operation of a single switch or multi-chip fabric may simultaneously reflect the radically different best practices recommended by various industry segments for their different types of traffic.
- embodiments including an upper-bound limitation for specific classes of traffic facilitate desirable functionalities in systems having different types of traffic, e.g., converged fabrics. For example, in systems which carry storage traffic there are almost always large frames being transferred as a result of long backup operations. If there is no limitation on this type of relatively low priority traffic, it could interfere with higher priority traffic, e.g., inter-processor traffic, and defeat high-speed features such as “cut-through” in which frames are passed through a switch without being stored in frame memory.
- the effect of the upper-bound limitation is to pause frames of a specific class of traffic when the upper-bound for that class has been reached, regardless of whether there are any frames currently in the switch. This, in turn, reduces the statistical likelihood that a high priority frame will be delayed by the presence of a low priority, but large frame which preceded it into the switch. That is, implementing such a “non-work-preserving” scheduler reduces the probability that there will be packets on the line ahead of a packet and thereby improve the overall performance with regard to latency-sensitive traffic.
- multiple shared-memory switches designed according to the invention implement a “stateless” congestion notification (SCN) scheme in a multi-chip fabric.
- SCN stateless congestion notification
- This approach includes elements similar to conventional backward congestion notification (BCN) schemes except that the rate limiters in upstream switches toggle between 0%, i.e., pause, and 100%, i.e., pause off, i.e., go to 0% when a congestion notification message is received from a downstream congestion point, and back to 100% automatically after a random period of time.
- the random period is a function of the level of congestion and the randomness is intended to reduce oscillations in sender rates due to synchronized reception of congestion notification frames.
- this pause is effected by removing some number of tokens from the token bucket on which the rate limiter is based. That is, the number of tokens in the bucket is set to a negative number such that the specified period of time is required to bring the number of tokens high enough to generate a pause-off frame.
- an SCN system in the contextual example of a proprietary tag switched network 300 may be understood with reference to the flow diagram of FIG. 3 . It should be noted that such a system may be implemented in a wide variety of networks and that the proprietary tag switched network is merely one example.
- Frames are sampled ( 302 ) when there is congestion detected in an egress queue 304 in a switch in the network. According to a specific embodiment, random sampling is used thus obviating the need for flow state storage.
- a congestion notification frame 308 is generated ( 310 ) and transmitted back through network 300 to an upstream flow control 312 associated with the corresponding flow, e.g., the congestion management block in a remote switch from which the frame was received.
- Flow control 312 then pauses the input of the corresponding switch for a random amount of time depending on the level of congestion indicated in the congestion notification frame. Even though on a packet-by-packet basis one doesn't know which flow is genuinely causing the congestion, it is a statistical property of this approach that the sources of congestion will ultimately be adequately flow controlled. According to one approach, a random value is picked from an interval which is a function of the level of congestion, i.e. if we have twice the congestion the random time will be somewhere between R_MIN and 2*R_MAX where R_MAX is the maximum random value for half the congestion.
- the specified time period during which the upstream switches are paused may be automatically extended, e.g., an exponential back off algorithm may be applied to the negative number of tokens in the bucket.
- an exponential back off algorithm may be applied to the negative number of tokens in the bucket.
- SCN rate limited packets do not need to be tagged to indicate to the downstream congestion point that rate limiting is still in effect upstream.
- SCN enables a quick recovery from congestion in that it does not require several cycles of congestion notification frames to recover to full rate, i.e., when the random time period expires the source will resume at full rate.
- SCN has the advantage of compatibility with proprietary tag switched networks. That is, BCN does not work if frames are modified as they are in such networks. However, according to this embodiment of the invention, congestion notification frames are sent back using the source identifier and will therefore work even if the frame experiences modifications in the network. Finally, because only the flows going through the congested queue will have their frames sampled and flow controlled, flows not contributing to the congestion are not impacted.
- multiple switches designed in accordance with the present invention may be used to implement congestion notification in a virtual output queue (VOQ) fabric, referred to herein as virtual congestion notification or VCN.
- a fabric of Ethernet switches interconnects a plurality of line cards (e.g., telecom line cards) each having an on-board network processing unit (NPU) and per flow queuing.
- NPU network processing unit
- this is merely an example.
- each ingress port have an NPU, but that the ingress port has a classification function and scheduling function (that may be implemented in an NPU).
- This classification function classifies the ingress flows by egress port and priority.
- the scheduling function can respond to a VCN message by flow controlling the particular queue that goes to that output port/priority. And the device can continue to schedule other queues.
- M be the number of ports on each line card
- P be the number of ports in the overall system
- Q be the number of priorities.
- Q the number of priorities.
- each line card has M*(P ⁇ 1)Q flows, and the overall system has P 2 Q flows.
- a multicast congestion message is sent back to all the input ports. Because it is known which queue and priority is congested, the upstream switches can pause only the particular flow which is causing the congestion until the congestion is resolved. With this solution, there is no head-of-the-line blocking.
- the ability for a shared memory switch to be able to multicast these congestion notification frames at full-rate is essential to the performance of the VCN scheme.
- Each queue in VOQ fabric 400 e.g., queue 402
- Xon and Xoff i.e., transmission on and transmission off.
- a congestion notification frame is generated ( 404 ).
- the congestion notification frame 405 is an Ethernet frame with a configurable multicast address and encapsulates the queue level and an Xon/Xoff state that identifies which of the watermarks was crossed.
- the multicast address is configured per queue and allows only a known set of reaction points (e.g., 406 ) that use the queue to be targeted due to congestion at the queue. This has the advantage of limiting the bandwidth usage associated with congestion notification because a general broadcast of congestion notification frames is eliminated.
- the congestion notification frame is then sent to the set of reaction points.
- reaction points may correspond, for example, to flow control blocks in upstream switches.
- the system does not need to statistically sample frames because the reaction points will only pause the particular flow which goes to the congested egress port and priority.
- the reaction points use the information in the congestion notification frame to reduce or enhance their respective flow control (e.g., 408 ) depending on the level of congestion.
- flow control e.g. 408
- These reductions and/or enhancements may be implemented according to various embodiments of the invention described herein. For example, according to specific embodiments, only the flows that use the congested queue are paused or unpaused at the reaction points. Flows that do not use the congested queue remain unaffected.
- reaction points are implemented with large buffer capacities to prevent queue buildup in the network, and thus lead to lower overall latency in the network. And storing per flow state information at reaction points is feasible in that this information need only be locally stored for the flows using the associated ingress ports. That is, because reaction points do not need to “know” the entire flow state of the network, the VCN approach described herein provides a scalable flow state storage solution.
- embodiments of the invention may be implemented in processes and circuits which, in turn, may be represented (without limitation) in software (object code or machine code), in varying stages of compilation, as one or more netlists, in a simulation language, in a hardware description language, by a set of semiconductor processing masks, and as partially or completely realized semiconductor devices.
- software object code or machine code
- netlists netlists
- simulation language simulation language
- hardware description language hardware description language
- the various types of computer-readable media e.g., Verilog, VHDL
- simulatable representations e.g., SPICE netlist
- semiconductor processes e.g., CMOS, GaAs, SiGe, etc.
- device types e.g., packet switches
- Embodiments of the invention are described herein with reference to switching devices, and specifically with reference to packet or frame switching devices. According to such embodiments and as described above, some or all of the functionalities described may be implemented in the hardware of highly-integrated semiconductor devices, e.g., 1-Gigabit and 10-Gigabit Ethernet switches, IP routers, DSL aggregators, switch fabric interface chips, and similar devices.
- highly-integrated semiconductor devices e.g., 1-Gigabit and 10-Gigabit Ethernet switches, IP routers, DSL aggregators, switch fabric interface chips, and similar devices.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
Description
- The present invention relates to switch and multi-hop switch fabric architectures and, in particular, to flow and congestion control techniques in such architectures.
- The goal of scalable switch fabric architectures is to interconnect N switches in such a way so as to achieve as close to N times the transmission bandwidth that could be achieved with only one of the switches. Such techniques include flow control and congestion management which attempt to ensure efficient use of fabric bandwidth without latency spikes or packet loss. Conventional architectures such as, for example, those commonly used with Fibre Channel and InfiniBand protocols, are typically constructed to rely on a credit-based, input/output queued switch architecture that results in stiff flow-control which can significantly limit the bandwidth of the overall fabric. Alternatively, common architectures for fabrics in Ethernet switches often rely on statistical packet drop with very large buffers to achieve near full bandwidth operation. This has the disadvantage of penalizing applications which are highly sensitive to loss or jitter and results in a high manufacturing cost basis of the switch for off-chip memories, etc.
- Converged fabrics, i.e., switch fabrics which attempt to integrate different classes of traffic having often radically different priority and bandwidth requirements, exacerbate the problems associated with flow control and congestion management. Existing solutions have difficulty integrating such disparate types of traffic while efficiently using available fabric bandwidth.
- According to various embodiments of the present invention, a shared memory switch is provided which includes a plurality of ports configured to receive and transmit frames of data, frame classification circuitry configured to classify the frames into a plurality of traffic classes, and frame memory configured to store the frames. The frame memory includes a plurality of shared memory partitions. Each of the shared memory partitions corresponds to one or more of the traffic classes, and has a plurality of counters associated therewith. The plurality of counters includes at least one per port memory usage counter for each of the plurality of ports, and at least one aggregate memory usage counter. The counters associated with each of the shared memory partitions are independent of the counters associated with others of the shared memory partitions. Congestion management circuitry implements congestion management policies for each of the partitions independently with reference to the counters associated with each of the partitions.
- According to another set of embodiments, a shared memory switch for use in a single-chip fabric or a multi-chip fabric is provided. The switch includes a plurality of ingress ports and egress ports configured to receive and transmit frames of data, frame memory configured to store the frames, and rate limiting circuitry configured to pause individual ones of the ingress ports in response to usage of the frame memory by the individual ingress ports. The rate limiting circuitry is further configured to pause the individual ingress ports in response to congestion notification information corresponding to one or more of the egress ports or other ones of the shared memory switches in the multi-chip fabric downstream from the shared memory switch.
- According to yet another set of embodiments, a shared memory switch for use in a multi-chip fabric which includes a plurality of shared memory switches is provided. The switch includes a plurality of ports configured to receive and transmit frames of data, frame classification circuitry configured to classify the frames into a plurality of traffic classes, frame memory configured to store the frames, congestion management circuitry configured to generate and transmit class-specific pause frames to selected ones of the ports in response to states of at least some of the plurality of counters, and egress scheduling circuitry configured to facilitate transmission of the frames. The egress circuitry is further configured to pause transmission of selected ones of the frames corresponding to specific ones of the traffic classes in response to downstream congestion. Together, the congestion management circuitry and the egress scheduling circuitry enable implementation of a congestion management domain encompassing the plurality of shared memory switches.
- According to still another set of embodiments, a shared memory switch for use in a multi-chip fabric which includes a plurality of shared memory switches is provided. The shared memory switch includes a plurality of ingress ports and egress ports configured to receive and transmit frames of data. Each of the frames includes one or more segments. The switch further includes frame memory configured to store the frames and a scheduler configured to allocate and de-allocate space in the frame memory for storage of the frames on a segment-by segment basis. The scheduler is further configured to generate memory allocation status information on a segment-by-segment basis. A low latency status channel communicates the memory allocation status information on a segment-by-segment basis. A frame processing pipeline provides frame-level processing of the frames in parallel with the scheduler and frame memory and with reference to headers associated with the frames. The frame processing pipeline is further configured to maintain port information for each frame identifying one of the ingress ports on which the frame was received and at least one of the egress ports on which the frame is to be transmitted. The frame processing pipeline is further configured to receive the memory allocation status information from the scheduler via the status channel and to correlate the memory allocation status information with the port information. Congestion management circuitry is configured to effect at least one policy with reference to the correlated memory allocation status and port information.
- According to a further set of embodiments, a shared memory switch for use in a multi-chip fabric which includes a plurality of shared memory switches is provided. The shared memory switch includes a plurality of ingress ports and egress ports configured to receive and transmit frames of data and frame memory configured to store the frames. Congestion management circuitry is configured to detect congestion associated with a particular one of the egress ports, identify a flow with reference to a frame directed to the particular egress port, and generate a first congestion notification message directed to an upstream one of the shared memory switches in the multi-chip fabric from which the flow originated. The congestion management circuitry is further configured to pause a particular one of the ingress ports for a period of time in response to a second congestion notification message received from a downstream one of the shared memory switches in the multi-chip fabric, and automatically unpause the particular ingress port without a subsequent congestion notification message from the downstream shared memory switch. The congestion management circuitry is further configured to exponentially increase the period of time in response to a third congestion notification message from the downstream shared memory switch.
- According to a still further set of embodiments, a shared memory switch for use in a multi-chip fabric comprising a plurality of shared memory switches is provided. The shared memory switch includes a plurality of ingress ports and egress ports configured to receive and transmit frames of data and frame memory configured to store the frames. Congestion management circuitry is configured to detect congestion associated with a particular one of the egress ports, generate a first multi-cast congestion notification message directed to a subset of ingress ports of the shared memory switches upstream in the multi-chip fabric and associated with a first flow directed to the particular egress port and an associated priority to thereby facilitate pausing of the first flow, and generate a second multi-cast congestion notification message directed to the subset of ingress ports to thereby facilitate unpausing of the first flow. The congestion management circuitry is further configured to, in response to a third multi-cast congestion notification message received from a downstream one of the shared memory switches, pause a second flow associated with a particular one of the ingress ports and directed to a particular egress port associated with the downstream shared memory switch and an associated priority, and unpause the second flow in response to a fourth multi-cast congestion notification message from the downstream shared memory switch.
- According to yet a further set of embodiments, rate limiting circuitry is provided for use in a shared memory switch having a plurality of input ports for receiving frames of data. The rate limiting circuitry includes token bucket circuitry implementing a token bucket for each input port. The token bucket circuitry for each port is configured to add tokens to the corresponding token bucket at a specified rate, and to remove tokens from the corresponding token bucket in response to receipt of frames on the corresponding input port. The rate limiting circuitry further includes pause circuitry configured to enable a pause function for the corresponding input port in response to crossing of a minimum threshold associated with the corresponding token bucket, and to disable the pause function in response to crossing of a pause-off threshold associated with the corresponding token bucket and above the minimum threshold.
- A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings.
-
FIG. 1 is a block diagram illustrating operation of a congestion management architecture according to a specific embodiment of the invention. -
FIG. 1A is a block diagram of an example of a shared memory architecture in which embodiments of the invention may be implemented. -
FIG. 2 is a block diagram illustrating operation of an ingress rate flow control technique according to a specific embodiment of the invention. -
FIG. 3 is a block diagram illustrating operation of a stateless congestion management technique according to a specific embodiment of the invention. -
FIG. 4 is a block diagram illustrating operation of a congestion management technique in a VOQ fabric according to a specific embodiment of the invention. - Reference will now be made in detail to specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In addition, well known features may not have been described in detail to avoid unnecessarily obscuring the invention.
- According to various embodiments of the present invention, a shared memory switch is provided which employs partitions of the shared memory to implement multiple, independent virtual congestion domains. As will be described, this approach allows congestion to be handled for different classes of traffic independently. Specific embodiments of the invention will be described with reference to an Ethernet switch implementation which may be employed in multi-chip architectures such as, for example, Clos architectures, spanning trees, fat trees, etc. Examples of architectures in which embodiments of the present invention may be implemented are described in U.S. patent application Ser. No. 11/208,451 for SHARED-MEMORY SWITCH FABRIC ARCHITECTURE filed on Aug. 18, 2005 (Attorney Docket No. FULCP011), the entire disclosure of which is incorporated herein by reference for all purposes. However, it should be noted that embodiments of the present invention are not limited to the foregoing and may be implemented in a wide variety of architectures.
- As will be understood, latency is key in congestion control algorithms because there are flow control loops in which congestion information must be generated and sent back to sources which then react by changing the source scheduling of frames. It should be noted that the terms “frame” and “packet” are used interchangeably herein. Such algorithms do not work if the loop time is too long. Being able to implement congestion control in an ultra-low latency switch is therefore highly beneficial. As will be described, embodiments of the invention provide a congestion management architecture that may be implemented in such devices and in multi-chip architectures based on such devices.
- A specific embodiment of the invention will now be described with reference to
FIG. 1 which shows a portion of a shared-memory Ethernet switch 100. It should be noted that the diagram has been simplified to better illustrate important aspects of the invention. For example, only one ingress port and one egress port are shown inFIG. 1 . However, it will be understood that such a switch typically has many ports, e.g., 16, 24, or 36, each of which will have associated instances of at least some of the circuitry shown inFIG. 1 . Therefore, the scope of the invention should not be limited with reference to such simplifications. - Referring now to
FIG. 1 , Ethernet Port Logic (EPL) 102 receives an ingress frame which is classified byframe classifier 104 to determine how the frame will be treated in the switch, e.g., quality of service (QoS) and destination port. Different classes of traffic might include, for example, storage traffic, inter-processor communication traffic, LAN traffic, etc. As will be discussed, several congestion management mechanisms are affected by the classification.Congestion control 108 implements a policer which limits bandwidth by dropping or marking frames which exceed configured rate thresholds for the particular traffic class.Congestion control 108 also implements a rate limiter which handles bandwidth throttling, causing input ports to be paused if they exceed certain rate thresholds, e.g., using Ethernet “pause” and “pause off” frames. - Conventional packet discard approaches to congestion management are inappropriate in many applications, e.g., data centers, because it takes too long using the TCP/IP protocol to retransmit discarded packets. Therefore, according to various embodiments, the rate limiter included in
congestion control 108 implements a pause-pacing function which enables “lossless” rate limiting for some classes of traffic. That is, for such classes of traffic, frame transfer is generally paused rather than allowing the frame to enter the port and then discarding it. Thus,congestion control 108 integrates two different function, i.e., it can cause discard of forward going packets, thereby decreasing the ingress rate through the policing function (e.g., red, yellow, green marking), and it can facilitate lossless link-level flow control, i.e., a pause pacing function in the backward direction. According to a specific embodiment and as will be described, a modified token bucket is employed by the rate limiter to measure the input rate and then translate that into the link-level pause-pacing function to the input. This includes class-based pauses in which pausing can be done on a link for specific classes of traffic.Congestion control 108 looks at the ingress rate as defined by the token buckets and uses it to either police the flows (i.e., mark frames red, yellow, or green), or to rate limit the flows via a pause pacing function. In addition, according to some embodiment,congestion control 108 also interprets multi-hop congestion notification messages which enables it to replace or “proxy” similar functionality in a network interface card (NIC) to which it is linked, i.e., if there isn't logic in the NIC capable of facilitating a rate limiting function, or the NIC does it inefficiently, the pause pacing function may be introduced in the switch as a proxy. This enables the implementation of such functionalities with legacy NICs. - According to a specific embodiment illustrated in
FIG. 1A , a switch architecture includes a sharedmemory 152, ascheduler 154, and aframe processing pipeline 156 as described in U.S. patent application Ser. No. 11/208,451 incorporated herein by reference above. According to this embodiment, a packet is streamed into sharedmemory 152 without the possibility of blocking through a system ofcrossbars frame processing pipeline 156.Scheduler 154 allocates pointers to memory and associates them with port logic as the packets are coming in. Astatus channel 162 goes fromscheduler 154 to frameprocessing pipeline 156, and communicates the status of each segment of memory as it is being allocated to each port.Frame processing pipeline 156 maintains state on what ingress port each packet is arriving, and the egress port or ports to which the packet is directed. Such an architecture enables the communication of memory allocation information from the scheduler to the frame processing pipeline with extremely low latency, i.e., for each memory segment rather than each packet or frame which might include, for example, dozens of segments. As congestion management policies are based on the status of what memory is actually allocated in the system, and because such an architecture enables updating the status of memory allocation on a segment-by-segment basis rather than a packet-by-packet basis, flow control, i.e., the implementation of congestion management policies, may be effected and enforced much more quickly and richly than conventional approaches allow. That is, the very low latency information transfer between the switch element datapath and the frame processing pipeline is leveraged to enable rapid flow control responses within a chip and, according to some embodiment, in a multi-chip fabric, i.e., the latency of flow control loops in which one chip can communicate congestion information to upstream chips in the fabric is greatly reduced. - Referring once again to
FIG. 1 , frames stored in sharedmemory 110 are retrieved for transmission byscheduler 112 which is followed by another rate limiting mechanism inegress shaper 114. Theegress shaper 114 uses the output ofclassifier 104 together with the mapping table 116 to determine the bandwidth allocated to a particular bandwidth sharing group. According to a specific embodiment,egress shaper 114 performs this function with reference to bandwidth sharing groups (discussed below) to which the various traffic classes are mapped bymapping function 116. Frames exceeding their QoS rates are marked by the policer incongestion control 108 with reference to the configuration stored in the policer. - A set of counters and “watermarks” monitor how
frame memory 110 is used. The counters and watermarks are used for a variety of purposes including, to enable packet discard, i.e., the policing function which results in the dropping of packets because queues are full; to enable pause frame generation, i.e., link level flow control which uses a pause frame to tell the immediately upstream link partner to stop sending packets on a particular link; and to enable congestion notification frame generation, i.e., frames indicating congestion which can potentially traverse multiple hops to any upstream port in a multi-chip congestion domain. Two different modes of congestion notification are described below. The first is a uni-cast approach in which egress frames are statistically sampled and, when an egress port is found to be congested, the source and destination addresses of the frame are switched in a congestion notification frame which is then transmitted upstream to the source of the congestion. The source then interprets that information to slow down the corresponding flow (see the description of SCN and BCN below). The second is a multi-cast approach in which the congestion notification message is sent back to all input ports (see the description of VCN below). In both cases, alayer 2 address tells the frame where to go, and it's tagged so that when it gets to its destination, a compliant device can filter and interpret it properly. According to specific embodiments, and as discussed herein, these features enable policy enforcement with regard to memory usage for different traffic classes. When certain thresholds defined by some of these watermarks are exceeded, the policing and rate limiting functions of congestion control 106 are enforced. In addition, exceeding some of the watermarks may be reflected in the CM state generated byCM block 118 which may result in generation of congestion notification frames bycongestion notification block 120. These congestion notification frames are sent to link partners, e.g., neighboring switches in the switch fabric, i.e., from which the frames exceeding the threshold were transmitted, for use in determining rate adjustments (e.g., by rate adjustment block 122) to be applied by the local rate limiting function (e.g., rate limiter 108). - According to a specific embodiment,
frame memory 110 is implemented with multiple sharedmemory partitions 124 which enabling mapping of different traffic classes into different partitions, and the application of sets of watermarks accordingly. - The combination of multiple shared memory partitions, the implementation of the egress scheduler, and the use of class-based pause enables end-to-end partitioning of traffic in multiple virtual congestion domains which, in turn, enables the application of independent congestion management policies for different classes of traffic. This, in turn, enables a switch fabric in which frames in different partitions do not interfere with each other on the ingress ports, in the shared memory, or on the egress ports. For example, policies can be implemented in which LAN traffic can be allowed to be lossy (i.e., dropped frames permitted), but storage traffic, which cannot tolerate dropped frames and is latency-sensitive, can be handled in a lossless manner, and each type of traffic can be sub-divided into different priorities irrespective of the other type of traffic.
- As mentioned above and according to a specific embodiment, a rate limiter is provided which employs a token bucket to measure input rates and then translate those rates into a pause pacing function to the input using “pause” and “pause-off” frames, e.g., as defined by the IEEE Ethernet specification. This may be applied to a link as a whole or for specific classes of traffic on a link. The combination of these two features in the rate limiter enables “pause with rate control.” In addition to the rate limiting function, a congestion control algorithm is enabled to adjust the rate at which tokens are added to the token bucket.
- The operation of a specific implementation will now be described with reference to
FIG. 2 . Ingress frames received byEthernet port 102 are classified in one of a plurality of traffic classes, i.e., byclassifier 104.Rate meter 202 incongestion control 108 monitors the traffic rates for the respective classes and provides its output to bothpolicer 204 andrate limiter 206.Policer 204 uses the information provided byrate meter 202 to implement the policing function described above.Rate limiter 206 uses the information provided byrate meter 202 in conjunction with congestion notification information from other downstream switches in the congestion domain to implement the pausing function described above. That is, when traffic rates are exceeded by some classes of traffic,rate limiter 206 introduces pause frames into the upstream datapath which are communicated to the upstream link partner, e.g., represented byEthernet port 208.Port 208 may be inside or outside of a congestion domain which may be defined by a multi-chip switch fabric such as, for example, a Clos architecture or spanning tree. - According to a specific embodiment,
rate limiter 206 implements two different forms of link level, lossless rate limiting, one based on configured link level rates, and the other based on congestion notification messages at the congestion domain level. That is,rate limiter 206 allows one to specify a fixed desired link level rate thus creating a local loop which enables local rate limiting. By comparison, the congestion notification information received bycongestion control block 108 enables end-to-end or multi-hop congestion control in the congestion domain. According to specific embodiments, the congestion notification information is derived from congestion notification frames indicating congestion in downstream switches in the fabric which is determined to have resulted from frames originating from the switches to which the congestion notification frames are sent. It should be noted that these frames may be generated according to any of a wide variety of public or proprietary congestion notification algorithms. - Thus, according to specific embodiments of the invention, congestion notification messages may also be employed to enable link-level pause at the ingress boundary of a single switch or multi-hop fabric. And by spreading congestion from a congestion point to the periphery of a switch fabric, the amount of head-of-the-line blocking is greatly reduced even if the ultimate source and sink of data frames are not included in the congestion control domain. It should be noted that implementation of such an approach outside of the switch fabric, e.g., in a network interface controller (NIC), is difficult in that there might be thousands of simultaneous flows which would need to be monitored and this is extremely expensive to implement in silicon. By contrast, and according to various embodiments of the invention, the classification of
layer 2 traffic at the edges of the switch fabric followed by the monitoring of traffic rates at that level of granularity enables an optimization which, while accepting some amount of head-of-the-line blocking, does not require devices outside of the congestion domain defined by the switch fabric to implement any corresponding algorithms. As mentioned above, this enables the use of legacy NICs. - It should also be noted that the techniques described herein may be implemented in conjunction or in parallel with a variety of conventional approaches to congestion management. For example, pause frames might be independently generated and transmitted to the link partners for one or more ports when the shared memory becomes full (not shown).
- According to a specific embodiment, “lossless” rate control is implemented in
congestion control 108 using one or more token buckets, e.g., one for the link as a whole, and/or one for each class of traffic. According to one embodiment, the token buckets are implemented as part ofrate meter 202. Tokens are added to each bucket at a specified rate. Each time a frame is received, some number of tokens corresponding to the length of the frame (e.g., number of bytes) are removed from the appropriate bucket(s). When the number of tokens in a bucket reaches or drops below zero, the pause function for the link or the specific class is enabled, e.g., a pause frame is sent to the upstream link partner. Depending on the bucket, the pause frame sent may be for the entire link or just for a particular class of traffic on that link, i.e., class-based pause. When the number of tokens in a bucket reaches some threshold above zero a pause-off frame is sent to the link partner. The level of the pause-off threshold for each bucket introduces hysteresis and may be empirically determined as a balance between jitter and consumption of bandwidth by pause function frames. - According to a specific embodiment, the rate at which tokens are introduced into the token bucket(s) associated with
congestion control 108 are adjusted in response to the output ofrate meter 202 and congestion notification information derived from frames received from downstream link partners. These congestion notification frames may include information such as, for example, the level of the downstream congestion, whether the congestion is increasing or decreasing, etc. According to one set of embodiments, for rate decreases,rate adjustment 210 decreases the token rate(s) relative to the actual traffic rate(s) measured byrate meter 202 which, according to a specific embodiment, employs exponentially weighted moving averages to measure traffic rates. According to another set of embodiments,rate adjustment 210 may filter out multiple congestion notification messages that come from downstream link partners and arrive more frequently than once every minimum round trip delay of the network, thus preventing over constriction of any particular source of congestion. - By using the actual traffic rate(s) measured by
rate meter 202, the convergence time associated with the rate limiting algorithm of the described embodiment is greatly reduced in comparison with conventional rate limiters. That is, rate limiting algorithms typically employ a multiplicative decrease (or an additive increase) to converge to a new rate. According to a specific embodiment of the present invention, a metering function is implemented in which the multiplicative decrease starts from the current rate (in a time averaging sense) rather than from the predefined (and often high) line rate (as with conventional algorithms). Such an approach converges much more quickly than conventional approaches. - By contrast, and according to specific embodiments of the invention, rate increases are generated with respect to a previously stored acceptable rate in order to ensure a fast recovery to the full rate. That is, if the measured rate is used for rate increases, the new measured rate would be a function of the previous measured rate. This time dependency would then slow down the recovery.
- Referring once again to
FIG. 1 ,frame memory 110 includes multiple sharedmemory partitions 124. Every ingress frame is mapped based on its traffic class to one ofmemory partitions 124.Congestion management block 118 monitors multiple private counters (associated with frame memory 110) for each partition 124 (i.e., at least one for each port) which count the frames stored in that partition from each of the corresponding ports. This is represented by private memory blocks 126. -
Congestion management block 118 also monitors an aggregate receive (rx) counter (associated with frame memory 110) for eachpartition 124 which counts frames from all of the ports, but only when the watermarks associated with one or more of the private receive counters are exceeded. That is, specific frames are not registered by the aggregate counter as using memory beyond the private memory allocated to the corresponding port unless the watermark for that port has been exceeded. This is represented by sharedmemory block 128. Thecongestion management block 118 also keeps track of per transmission port per traffic class memory usage in transmit/traffic class (tx/tc) counters. When an aggregate receive counter exceeds its shared memory watermark, action is taken only for the ports that have private counters above their respective private memory watermarks. By tracking usage of the different memory partitions in this way, congestion management policies may be implemented independently for the different traffic classes on a per port basis. - According to specific embodiments, each counter associated with a
partition 124 has multiple watermarks. Depending on the set of watermarks exceeded, an incoming frame will be assigned to some priority, dropped, or marked. If the frame is not dropped, the level of service provided to the frame depends on the assigned priority. - According to a specific embodiment, these watermarks are used to facilitate pause, congestion notification, and packet discard. According to this embodiment, there are three types of watermarks: “per port” watermarks (rx or tx); “per port private” watermarks (which change how the shared memory is interpreted); and “global” watermarks (which span multiple ports and have actions on multiple ports). For example, an rx pause watermark and an rx hog watermark are both per port watermarks, the first of which results in pause frame being sent back out that link when that port is using more memory than it's allowed to, and the second of which results in discarding a packet when that port is using more memory than it's allowed to. Similarly, a tx hog watermark is a per port watermark which will drop a packet based on a tx port being full. A tx congestion notification watermark is a per port watermark which results in a congestion frame being sent back to the source address. A sum over all ports of the memory usage represented by the “per port” watermarks can be much greater than total memory. By contrast, the sum of the memory usage represented by the “per port private” watermarks must be less than the total memory, i.e., shared memory is the remaining portion of the total memory. In the case of rx ports, the private memory is there to minimize head-of-the-line blocking for pause. That is, if a pause is executed in response to a global watermark, instead of pausing all input ports, only ports exceeding their private watermarks will be paused, as those not exceeding theirs aren't actually contributing to congestion. According to some embodiments, there may also be private watermarks per priority or class for rx and tx to avoid starvation of lower priority classes of traffic.
- According to one embodiment, each port's receive counter has an rx watermark, and each port's tx/tc counter has a tx/tc watermark, the aggregate of the tx/tc counters is compared with the tx watermark. That is, for each shared memory partition, there is an aggregate rx counter, per port rx counters, and per port per traffic class tx counters. The purpose of the rx watermark is to prevent excessive usage of the shared memory partition by traffic from the corresponding port. When the rx watermark is exceeded, the frame is either dropped or paused depending, for example, on the traffic class. The ability to pause on a per port basis is advantageous in that, if a port is not contributing to congestion, it is undesirable to pause it. According to a specific embodiment, the pause is implemented similarly to the pause function associated with the rate limiter described above, e.g., generation of an Ethernet pause frame.
- The purpose of the tx and tx/tc watermarks is to prevent congestion of the corresponding port by frames transmitted out of the shared memory. When these watermarks are exceeded, frames are dropped. Having both the rx and tx watermarks active allows the transmission of frames between any pair of ports that are not congested independent of congestion conditions for other ports. The tx watermark is compared against the aggregate of the tx/tc counters, and is used for applications in which it is not important to distinguish between the traffic classes and hence we do not need to reserve memory per traffic class.
- According to some embodiments, the watermarks may be configured as appropriate for a particular application. That is, watermark levels may be adjusted or removed entirely in different combinations depending on the particular implementation. For example, if certain ports require no memory reservations per tx or per tx/tc, the tx and tx/tc watermarks may be turned off. Or, if class-based tx memory reservation was not needed, the tx/tc watermark may be turned off. This allows the system designer to only allocate private memory in the memory partitions as needed.
- According to a specific embodiment, the drop condition of a frame is that rx private, tx private and tx/tc private allocations must all be exceeded before a frame is eligible for being dropped. This ensures that the private memory is reserved for each rx, tx, tx/tc. Also this means that the total memory used in the system is the sum of the rx private, rx shared, max (tx private, sum(tx/tc private)) which the user should ensure does not exceed the total memory of the switch.
- As mentioned above,
congestion notification block 120 generates congestion notification frames in response to the CM state generated bycongestion management block 118.CM block 118 generates the CM state with reference to the tx and tx/tc watermarks, i.e., the indicators of congestion at local egress ports. - Referring once again to
FIG. 1 and according to some embodiments, the pause capability on ingress ports described above may be further enhanced ifegress scheduler 112 also has a pause capability and, in particular, support class-based pause. According to one such embodiment, the tx/tc watermark triggers the class-based pause frame generation. Theegress scheduler 112 is the block that determines when frames are transmitted. When a pause frame is received by a switch, the egress scheduler stops the traffic going out on the corresponding port. The combination of class-based pause, shared memory partitions, and bandwidth sharing groups in the egress scheduler enables a converged fabric in which best-in-class congestion management disciplines may be implemented such that the various different traffic types which are converged in the fabric don't get in each other's way. - According to specific embodiments, a plurality of counters are employed in conjunction with a plurality of ingress watermarks and a plurality of egress watermarks to monitor and control memory usage by the various ports and traffic classes. Each memory partition has an aggregate ingress counter which tracks the number of segments of the memory consumed by that partition. Each memory partition also has a private ingress counter for each port which tracks the number of memory segments consumed by that port.
- A private ingress watermark associated with each private ingress counter defines the private memory allocated to the corresponding port within the memory partition. When an ingress port's private ingress counter is below this watermark, the port is not subject to memory usage based pausing or dropping for that memory partition. A “hog” ingress watermark is also associated with each private ingress counter which prevents the corresponding port from consuming too much memory. If a received frame will result in the hog ingress watermark being exceeded, the frame is dropped only if the corresponding private ingress watermark is also exceeded.
- A global ingress watermark associated with the aggregate ingress counter defines the total number of segments over all ports allocated to the corresponding memory partition (not including the private memory allocations for each port). Thus, the total memory usage for a memory partition over all ports will not be allowed to exceed this watermark and the sum of the private ingress watermarks for that partition. If a received frame will result in the global ingress watermark being exceeded, the frame is dropped only if the private ingress watermark for the port on which the frame was received is also exceeded.
- According to a specific embodiment, a set of pause watermarks is provided relating to global memory usage and another set relating to private memory usage. These watermarks are used by congestion management circuitry to generate pause “on” and pause “off” frames on a per port and/or a per traffic class basis.
- According to specific embodiments, each memory partition also has a private egress counter for each port which tracks the number of segments currently in the memory partition intended to be transmitted out on that port. A private egress watermark is associated with each private egress counter for each traffic class which represents the amount of memory allocated for that traffic class. Multiple “hog” egress watermarks are also associated with each port to prevent a single port from consuming too much memory. The different hog egress watermarks correspond to the different traffic priorities.
- As mentioned above,
mapping function 116 maps traffic classes identified byframe classifier 104 into bandwidth sharing groups among which the egress bandwidth is allocated. For example, 8 traffic classes might be mapped into two bandwidth sharing groups, each having 4 of the classes and each of which is allocated 50% of the egress bandwidth. That is, each group of 4 classes could only consume 50% of the available egress bandwidth. This could be effected, for example, using a deficit-weighted-round-robin algorithm to schedule traffic as between bandwidth groups (assuming the groups have equal priority). However, within each group there is a strict prioritization according to traffic class such that a higher class within a group could potentially starve out lower priority traffic. - According to a specific embodiment, bandwidth sharing groups may also be prioritized with respect to each other. This could enable, for example, creation of a strict high priority bandwidth group which could starve all lower priority groups, and/or a strict low priority bandwidth group which could only consume bandwidth if none of the other bandwidth groups have traffic to send. Examples of bandwidth sharing groups which might be important in a typical application include inter-processor traffic, LAN traffic, storage traffic, and web traffic.
- Embodiments of the invention have been described which implement a combination of class-based pause, shared memory partitions, and an egress scheduling algorithm which allows bandwidth groups and priorities within the bandwidth groups. This combination enables virtual switching from a congestion management perspective, thus enabling a new class of performance for an Ethernet switch or any other protocol used to implement a converged fabric. That is, virtual domains are enabled for independent treatment of different types of traffic all the way through the switch, and therefore all the way through a multi-chip fabric based on such switches. Thus, the operation of a single switch or multi-chip fabric may simultaneously reflect the radically different best practices recommended by various industry segments for their different types of traffic.
- In addition, embodiments including an upper-bound limitation for specific classes of traffic (e.g., enabled using hog and/or shared memory watermarks, or by limiting bandwidth usage with an egress shaping mechanism) facilitate desirable functionalities in systems having different types of traffic, e.g., converged fabrics. For example, in systems which carry storage traffic there are almost always large frames being transferred as a result of long backup operations. If there is no limitation on this type of relatively low priority traffic, it could interfere with higher priority traffic, e.g., inter-processor traffic, and defeat high-speed features such as “cut-through” in which frames are passed through a switch without being stored in frame memory. The effect of the upper-bound limitation is to pause frames of a specific class of traffic when the upper-bound for that class has been reached, regardless of whether there are any frames currently in the switch. This, in turn, reduces the statistical likelihood that a high priority frame will be delayed by the presence of a low priority, but large frame which preceded it into the switch. That is, implementing such a “non-work-preserving” scheduler reduces the probability that there will be packets on the line ahead of a packet and thereby improve the overall performance with regard to latency-sensitive traffic.
- According to a particular embodiment, multiple shared-memory switches designed according to the invention implement a “stateless” congestion notification (SCN) scheme in a multi-chip fabric. This approach includes elements similar to conventional backward congestion notification (BCN) schemes except that the rate limiters in upstream switches toggle between 0%, i.e., pause, and 100%, i.e., pause off, i.e., go to 0% when a congestion notification message is received from a downstream congestion point, and back to 100% automatically after a random period of time. According to a specific embodiment, the random period is a function of the level of congestion and the randomness is intended to reduce oscillations in sender rates due to synchronized reception of congestion notification frames. According to a specific embodiment, this pause is effected by removing some number of tokens from the token bucket on which the rate limiter is based. That is, the number of tokens in the bucket is set to a negative number such that the specified period of time is required to bring the number of tokens high enough to generate a pause-off frame.
- The operation of an SCN system in the contextual example of a proprietary tag switched
network 300 may be understood with reference to the flow diagram ofFIG. 3 . It should be noted that such a system may be implemented in a wide variety of networks and that the proprietary tag switched network is merely one example. Frames are sampled (302) when there is congestion detected in anegress queue 304 in a switch in the network. According to a specific embodiment, random sampling is used thus obviating the need for flow state storage. Based on the sampledframe 306, acongestion notification frame 308 is generated (310) and transmitted back throughnetwork 300 to anupstream flow control 312 associated with the corresponding flow, e.g., the congestion management block in a remote switch from which the frame was received.Flow control 312 then pauses the input of the corresponding switch for a random amount of time depending on the level of congestion indicated in the congestion notification frame. Even though on a packet-by-packet basis one doesn't know which flow is genuinely causing the congestion, it is a statistical property of this approach that the sources of congestion will ultimately be adequately flow controlled. According to one approach, a random value is picked from an interval which is a function of the level of congestion, i.e. if we have twice the congestion the random time will be somewhere between R_MIN and 2*R_MAX where R_MAX is the maximum random value for half the congestion. If additional congestion notification messages are received from the same congestion point, the specified time period during which the upstream switches are paused may be automatically extended, e.g., an exponential back off algorithm may be applied to the negative number of tokens in the bucket. As will be understood, the foregoing approach allows multi-hop congestion management in that the congestion notification messages can propagate from the congestion point out to the edge of the fabric. However, it is also possible that while accounting for multiple messages in order to calculate the exponential timer, one also filters out multiple messages within a minimum round trip delay of the network to prevent over stimulation based on packets in flight. - Depending on the particular implementation, there may be several advantages of this approach as compared to BCN. For example, with SCN rate limited packets do not need to be tagged to indicate to the downstream congestion point that rate limiting is still in effect upstream. In addition, SCN enables a quick recovery from congestion in that it does not require several cycles of congestion notification frames to recover to full rate, i.e., when the random time period expires the source will resume at full rate.
- And unlike BCN solutions, SCN has the advantage of compatibility with proprietary tag switched networks. That is, BCN does not work if frames are modified as they are in such networks. However, according to this embodiment of the invention, congestion notification frames are sent back using the source identifier and will therefore work even if the frame experiences modifications in the network. Finally, because only the flows going through the congested queue will have their frames sampled and flow controlled, flows not contributing to the congestion are not impacted.
- According to specific embodiments, multiple switches designed in accordance with the present invention may be used to implement congestion notification in a virtual output queue (VOQ) fabric, referred to herein as virtual congestion notification or VCN. In an example of one embodiment, a fabric of Ethernet switches interconnects a plurality of line cards (e.g., telecom line cards) each having an on-board network processing unit (NPU) and per flow queuing. It should be noted that this is merely an example. In general, it is not required that each ingress port have an NPU, but that the ingress port has a classification function and scheduling function (that may be implemented in an NPU). This classification function classifies the ingress flows by egress port and priority. The scheduling function can respond to a VCN message by flow controlling the particular queue that goes to that output port/priority. And the device can continue to schedule other queues.
- Returning to the example, let M be the number of ports on each line card, P be the number of ports in the overall system, and Q be the number of priorities. Assume for simplicity that Q is constant across all ports. Then each line card has M*(P−1)Q flows, and the overall system has P2Q flows. When an output port in the fabric is congested, a multicast congestion message is sent back to all the input ports. Because it is known which queue and priority is congested, the upstream switches can pause only the particular flow which is causing the congestion until the congestion is resolved. With this solution, there is no head-of-the-line blocking. According to specific embodiments the ability for a shared memory switch to be able to multicast these congestion notification frames at full-rate is essential to the performance of the VCN scheme.
- The operation of an example of a VCN system implemented as an Ethernet-based, multi-cast, multi-hop, flow control algorithm for supporting VOQ fabrics may be understood with reference to the flow diagram of
FIG. 4 . Each queue inVOQ fabric 400, e.g.,queue 402, has two associated watermarks referred to herein as Xon and Xoff (i.e., transmission on and transmission off). When one of these watermarks is crossed, a congestion notification frame is generated (404). Thecongestion notification frame 405 is an Ethernet frame with a configurable multicast address and encapsulates the queue level and an Xon/Xoff state that identifies which of the watermarks was crossed. The multicast address is configured per queue and allows only a known set of reaction points (e.g., 406) that use the queue to be targeted due to congestion at the queue. This has the advantage of limiting the bandwidth usage associated with congestion notification because a general broadcast of congestion notification frames is eliminated. The congestion notification frame is then sent to the set of reaction points. These reaction points may correspond, for example, to flow control blocks in upstream switches. According to this embodiment, though not required for VCN, the system does not need to statistically sample frames because the reaction points will only pause the particular flow which goes to the congested egress port and priority. - The reaction points use the information in the congestion notification frame to reduce or enhance their respective flow control (e.g., 408) depending on the level of congestion. These reductions and/or enhancements may be implemented according to various embodiments of the invention described herein. For example, according to specific embodiments, only the flows that use the congested queue are paused or unpaused at the reaction points. Flows that do not use the congested queue remain unaffected.
- According to specific embodiments, reaction points are implemented with large buffer capacities to prevent queue buildup in the network, and thus lead to lower overall latency in the network. And storing per flow state information at reaction points is feasible in that this information need only be locally stored for the flows using the associated ingress ports. That is, because reaction points do not need to “know” the entire flow state of the network, the VCN approach described herein provides a scalable flow state storage solution.
- While the invention has been particularly shown and described with reference to specific embodiments thereof, it will be understood by those skilled in the art that changes in the form and details of the disclosed embodiments may be made without departing from the spirit or scope of the invention. For example, to promote understanding, embodiments have been described herein in which various functionalities have been described as being logically distinct from other functionalities. It will be understood, however, that such functionalities may be logically grouped or integrated in a variety of ways without departing from the scope of the invention.
- Moreover, the functionalities described herein may be implemented in a wide variety of contexts using a wide variety of technologies without departing from the scope of the invention. That is, embodiments of the invention may be implemented in processes and circuits which, in turn, may be represented (without limitation) in software (object code or machine code), in varying stages of compilation, as one or more netlists, in a simulation language, in a hardware description language, by a set of semiconductor processing masks, and as partially or completely realized semiconductor devices. The various alternatives for each of the foregoing as understood by those of skill in the art are also within the scope of the invention. For example, the various types of computer-readable media, software languages (e.g., Verilog, VHDL), simulatable representations (e.g., SPICE netlist), semiconductor processes (e.g., CMOS, GaAs, SiGe, etc.), and device types (e.g., packet switches) suitable for designing and manufacturing the processes and circuits described herein are within the scope of the invention.
- Embodiments of the invention are described herein with reference to switching devices, and specifically with reference to packet or frame switching devices. According to such embodiments and as described above, some or all of the functionalities described may be implemented in the hardware of highly-integrated semiconductor devices, e.g., 1-Gigabit and 10-Gigabit Ethernet switches, IP routers, DSL aggregators, switch fabric interface chips, and similar devices.
- Finally, although various advantages, aspects, and objects of the present invention have been discussed herein with reference to various embodiments, it will be understood that the scope of the invention should not be limited by reference to such advantages, aspects, and objects. Rather, the scope of the invention should be determined with reference to the appended claims.
Claims (29)
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/737,511 US7916718B2 (en) | 2007-04-19 | 2007-04-19 | Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics |
PCT/US2008/059668 WO2008130841A1 (en) | 2007-04-19 | 2008-04-08 | Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics |
US12/986,031 US8467342B2 (en) | 2007-04-19 | 2011-01-06 | Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/737,511 US7916718B2 (en) | 2007-04-19 | 2007-04-19 | Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/986,031 Continuation US8467342B2 (en) | 2007-04-19 | 2011-01-06 | Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics |
Publications (2)
Publication Number | Publication Date |
---|---|
US20080259798A1 true US20080259798A1 (en) | 2008-10-23 |
US7916718B2 US7916718B2 (en) | 2011-03-29 |
Family
ID=39872060
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/737,511 Active 2028-08-22 US7916718B2 (en) | 2007-04-19 | 2007-04-19 | Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics |
US12/986,031 Active 2027-12-16 US8467342B2 (en) | 2007-04-19 | 2011-01-06 | Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/986,031 Active 2027-12-16 US8467342B2 (en) | 2007-04-19 | 2011-01-06 | Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics |
Country Status (2)
Country | Link |
---|---|
US (2) | US7916718B2 (en) |
WO (1) | WO2008130841A1 (en) |
Cited By (81)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060098589A1 (en) * | 2004-10-22 | 2006-05-11 | Cisco Technology, Inc. | Forwarding table reduction and multipath network forwarding |
US20070201499A1 (en) * | 2006-02-24 | 2007-08-30 | Texas Instruments Incorporated | Device, system and/or method for managing packet congestion in a packet switching network |
US20090010162A1 (en) * | 2007-07-05 | 2009-01-08 | Cisco Technology, Inc. | Flexible and hierarchical dynamic buffer allocation |
US20090109968A1 (en) * | 2007-10-30 | 2009-04-30 | Ariel Noy | Grid router |
US20090219818A1 (en) * | 2008-03-03 | 2009-09-03 | Masahiko Tsuchiya | Node device, packet switch device, communication system and method of communicating packet data |
US20100046368A1 (en) * | 2008-08-21 | 2010-02-25 | Gideon Kaempfer | System and methods for distributed quality of service enforcement |
US20100061239A1 (en) * | 2008-09-11 | 2010-03-11 | Avanindra Godbole | Methods and apparatus for flow-controllable multi-staged queues |
US20100061390A1 (en) * | 2008-09-11 | 2010-03-11 | Avanindra Godbole | Methods and apparatus for defining a flow control signal related to a transmit queue |
US20100165843A1 (en) * | 2008-12-29 | 2010-07-01 | Thomas Philip A | Flow-control in a switch fabric |
US7830793B2 (en) | 2004-10-22 | 2010-11-09 | Cisco Technology, Inc. | Network device architecture for consolidating input/output and reducing latency |
US20110154132A1 (en) * | 2009-12-23 | 2011-06-23 | Gunes Aybay | Methods and apparatus for tracking data flow based on flow state values |
US7969971B2 (en) | 2004-10-22 | 2011-06-28 | Cisco Technology, Inc. | Ethernet extension for the data center |
US20110158082A1 (en) * | 2009-12-24 | 2011-06-30 | Contextream Ltd. | Grid routing apparatus and method |
US20110164496A1 (en) * | 2007-04-19 | 2011-07-07 | Fulcrum Microsystems Inc. | Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics |
US20110296434A1 (en) * | 2010-05-25 | 2011-12-01 | International Business Machines Corporation | Techniques for Dynamically Sharing a Fabric to Facilitate Off-Chip Communication for Multiple On-Chip Units |
US8121038B2 (en) | 2007-08-21 | 2012-02-21 | Cisco Technology, Inc. | Backward congestion notification |
US8160094B2 (en) | 2004-10-22 | 2012-04-17 | Cisco Technology, Inc. | Fibre channel over ethernet |
CN102487358A (en) * | 2010-12-01 | 2012-06-06 | 丛林网络公司 | Methods and apparatus for flow control associated with switch fabric |
US20120195308A1 (en) * | 2011-02-02 | 2012-08-02 | Fujitsu Limited | Communication processing apparatus and address learning method |
US8238347B2 (en) | 2004-10-22 | 2012-08-07 | Cisco Technology, Inc. | Fibre channel over ethernet |
US8259720B2 (en) | 2007-02-02 | 2012-09-04 | Cisco Technology, Inc. | Triple-tier anycast addressing |
US8325749B2 (en) | 2008-12-24 | 2012-12-04 | Juniper Networks, Inc. | Methods and apparatus for transmission of groups of cells via a switch fabric |
US20130028078A1 (en) * | 2011-02-16 | 2013-01-31 | Sony Corporation | Transmission terminal and transmission method |
US20130051494A1 (en) * | 2011-08-23 | 2013-02-28 | Oracle International Corporation | Method and system for responder side cut through of received data |
US20130091241A1 (en) * | 2011-10-11 | 2013-04-11 | David Goetz | Distributed Rate Limiting Of Handling Requests |
WO2013101794A1 (en) * | 2011-12-30 | 2013-07-04 | Arteris SAS | Link between chips using virtual channels and credit based flow control |
CN103326954A (en) * | 2012-03-23 | 2013-09-25 | 美国博通公司 | Reducing headroom |
US20130250762A1 (en) * | 2012-03-22 | 2013-09-26 | Avaya, Inc. | Method and apparatus for Lossless Behavior For Multiple Ports Sharing a Buffer Pool |
US8553710B1 (en) | 2010-08-18 | 2013-10-08 | Juniper Networks, Inc. | Fibre channel credit-based link flow control overlay onto fibre channel over ethernet |
US20130315054A1 (en) * | 2012-05-24 | 2013-11-28 | Marvell World Trade Ltd. | Flexible queues in a network switch |
WO2013191927A1 (en) * | 2012-06-21 | 2013-12-27 | Microsoft Corporation | Ensuring predictable and quantifiable networking performance |
US20140071826A1 (en) * | 2012-09-05 | 2014-03-13 | Thales | Transmission method in an ad hoc multi-hop ip network |
US8711697B1 (en) * | 2011-06-22 | 2014-04-29 | Marvell International Ltd. | Method and apparatus for prioritizing data transfer |
US20140198638A1 (en) * | 2013-01-14 | 2014-07-17 | International Business Machines Corporation | Low-latency lossless switch fabric for use in a data center |
US8792352B2 (en) | 2005-10-11 | 2014-07-29 | Cisco Technology, Inc. | Methods and devices for backward congestion notification |
US8811183B1 (en) | 2011-10-04 | 2014-08-19 | Juniper Networks, Inc. | Methods and apparatus for multi-path flow control within a multi-stage switch fabric |
US20140317225A1 (en) * | 2011-01-03 | 2014-10-23 | Planetary Data LLC | Community internet drive |
US8879579B2 (en) | 2011-08-23 | 2014-11-04 | Oracle International Corporation | Method and system for requester virtual cut through |
US8891376B1 (en) | 2013-10-07 | 2014-11-18 | International Business Machines Corporation | Quantized Congestion Notification—defense mode choice extension for the alternate priority of congestion points |
US20150055478A1 (en) * | 2013-08-23 | 2015-02-26 | Broadcom Corporation | Congestion detection and management at congestion-tree roots |
US9032089B2 (en) | 2011-03-09 | 2015-05-12 | Juniper Networks, Inc. | Methods and apparatus for path selection within a network based on flow duration |
US9036481B1 (en) * | 2010-05-05 | 2015-05-19 | Marvell International Ltd. | Method and apparatus for adaptive packet load balancing |
US9065773B2 (en) | 2010-06-22 | 2015-06-23 | Juniper Networks, Inc. | Methods and apparatus for virtual channel flow control associated with a switch fabric |
US20150281405A1 (en) * | 2014-03-31 | 2015-10-01 | Metaswitch Networks Limited | Spanning tree protocol |
US20160014636A1 (en) * | 2010-12-07 | 2016-01-14 | Siemens Aktiengesellschaft | Congestion Notification Element and Method for Congestion Control |
US20160226783A1 (en) * | 2013-10-14 | 2016-08-04 | Google Inc. | Pacing enhanced packet forwarding/switching and congestion avoidance |
US20160285771A1 (en) * | 2015-03-23 | 2016-09-29 | Cisco Technology, Inc. | Technique for achieving low latency in data center network environments |
US20160314012A1 (en) * | 2015-04-23 | 2016-10-27 | International Business Machines Corporation | Virtual machine (vm)-to-vm flow control for overlay networks |
US9602439B2 (en) | 2010-04-30 | 2017-03-21 | Juniper Networks, Inc. | Methods and apparatus for flow control associated with a switch fabric |
US9654483B1 (en) * | 2014-12-23 | 2017-05-16 | Amazon Technologies, Inc. | Network communication rate limiter |
US20170149688A1 (en) * | 2015-11-25 | 2017-05-25 | International Business Machines Corporation | Configuring resources to exploit elastic network capability |
US9755978B1 (en) | 2014-05-12 | 2017-09-05 | Google Inc. | Method and system for enforcing multiple rate limits with limited on-chip buffering |
US9762502B1 (en) | 2014-05-12 | 2017-09-12 | Google Inc. | Method and system for validating rate-limiter determination made by untrusted software |
US9760526B1 (en) * | 2011-09-30 | 2017-09-12 | EMC IP Holdings Company LLC | Multiprocessor messaging system |
US20180019947A1 (en) * | 2016-07-14 | 2018-01-18 | Mellanox Technologies Tlv Ltd. | Credit Loop Deadlock Detection and Recovery in Arbitrary Topology Networks |
US9923784B2 (en) | 2015-11-25 | 2018-03-20 | International Business Machines Corporation | Data transfer using flexible dynamic elastic network service provider relationships |
US9923965B2 (en) | 2015-06-05 | 2018-03-20 | International Business Machines Corporation | Storage mirroring over wide area network circuits with dynamic on-demand capacity |
US10027590B2 (en) | 2014-02-24 | 2018-07-17 | Avago Technologies General Ip (Singapore) Pte. Ltd. | End to end flow control |
US10057327B2 (en) | 2015-11-25 | 2018-08-21 | International Business Machines Corporation | Controlled transfer of data over an elastic network |
US10177993B2 (en) | 2015-11-25 | 2019-01-08 | International Business Machines Corporation | Event-based data transfer scheduling using elastic network optimization criteria |
US10216441B2 (en) | 2015-11-25 | 2019-02-26 | International Business Machines Corporation | Dynamic quality of service for storage I/O port allocation |
US10469404B1 (en) * | 2014-05-12 | 2019-11-05 | Google Llc | Network multi-level rate limiter |
US20190386924A1 (en) * | 2019-07-19 | 2019-12-19 | Intel Corporation | Techniques for congestion management in a network |
US10581680B2 (en) | 2015-11-25 | 2020-03-03 | International Business Machines Corporation | Dynamic configuration of network features |
US11070474B1 (en) * | 2018-10-22 | 2021-07-20 | Juniper Networks, Inc. | Selective load balancing for spraying over fabric paths |
CN113676423A (en) * | 2021-08-13 | 2021-11-19 | 北京东土军悦科技有限公司 | Port flow control method and device, exchange chip and storage medium |
US20220217090A1 (en) * | 2019-05-23 | 2022-07-07 | Hewlett Packard Enterprise Development Lp | System and method for facilitating data-driven intelligent network with endpoint congestion detection and control |
US11411925B2 (en) | 2019-12-31 | 2022-08-09 | Oracle International Corporation | Methods, systems, and computer readable media for implementing indirect general packet radio service (GPRS) tunneling protocol (GTP) firewall filtering using diameter agent and signal transfer point (STP) |
US11516671B2 (en) | 2021-02-25 | 2022-11-29 | Oracle International Corporation | Methods, systems, and computer readable media for mitigating location tracking and denial of service (DoS) attacks that utilize access and mobility management function (AMF) location service |
US11528251B2 (en) * | 2020-11-06 | 2022-12-13 | Oracle International Corporation | Methods, systems, and computer readable media for ingress message rate limiting |
US11553342B2 (en) | 2020-07-14 | 2023-01-10 | Oracle International Corporation | Methods, systems, and computer readable media for mitigating 5G roaming security attacks using security edge protection proxy (SEPP) |
US20230047454A1 (en) * | 2021-08-10 | 2023-02-16 | Mellanox Technologies, Ltd. | Ethernet pause aggregation for a relay device |
US11622255B2 (en) | 2020-10-21 | 2023-04-04 | Oracle International Corporation | Methods, systems, and computer readable media for validating a session management function (SMF) registration request |
US11689912B2 (en) | 2021-05-12 | 2023-06-27 | Oracle International Corporation | Methods, systems, and computer readable media for conducting a velocity check for outbound subscribers roaming to neighboring countries |
US11700510B2 (en) | 2021-02-12 | 2023-07-11 | Oracle International Corporation | Methods, systems, and computer readable media for short message delivery status report validation |
US11751056B2 (en) | 2020-08-31 | 2023-09-05 | Oracle International Corporation | Methods, systems, and computer readable media for 5G user equipment (UE) historical mobility tracking and security screening using mobility patterns |
US11770694B2 (en) | 2020-11-16 | 2023-09-26 | Oracle International Corporation | Methods, systems, and computer readable media for validating location update messages |
US11812271B2 (en) | 2020-12-17 | 2023-11-07 | Oracle International Corporation | Methods, systems, and computer readable media for mitigating 5G roaming attacks for internet of things (IoT) devices based on expected user equipment (UE) behavior patterns |
US11818570B2 (en) | 2020-12-15 | 2023-11-14 | Oracle International Corporation | Methods, systems, and computer readable media for message validation in fifth generation (5G) communications networks |
US11825310B2 (en) | 2020-09-25 | 2023-11-21 | Oracle International Corporation | Methods, systems, and computer readable media for mitigating 5G roaming spoofing attacks |
US11832172B2 (en) | 2020-09-25 | 2023-11-28 | Oracle International Corporation | Methods, systems, and computer readable media for mitigating spoofing attacks on security edge protection proxy (SEPP) inter-public land mobile network (inter-PLMN) forwarding interface |
Families Citing this family (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8108503B2 (en) * | 2009-01-14 | 2012-01-31 | International Business Machines Corporation | Dynamic load balancing between chassis in a blade center |
US8687491B2 (en) * | 2011-04-05 | 2014-04-01 | Vss Monitoring, Inc. | Systems, apparatus, and methods for managing an overflow of data packets received by a switch |
JP5593517B2 (en) * | 2011-07-15 | 2014-09-24 | 株式会社日立製作所 | Network apparatus and transmission frame control method |
WO2013167973A2 (en) * | 2012-05-10 | 2013-11-14 | Marvell World Trade Ltd. | Hybrid dataflow processor |
US8751645B2 (en) * | 2012-07-20 | 2014-06-10 | Telefonaktiebolaget L M Ericsson (Publ) | Lattice based traffic measurement at a switch in a communication network |
US8995277B2 (en) * | 2012-10-30 | 2015-03-31 | Telefonaktiebolaget L M Ericsson (Publ) | Method for dynamic load balancing of network flows on LAG interfaces |
US9014006B2 (en) * | 2013-01-31 | 2015-04-21 | Mellanox Technologies Ltd. | Adaptive routing using inter-switch notifications |
US9634940B2 (en) * | 2013-01-31 | 2017-04-25 | Mellanox Technologies, Ltd. | Adaptive routing using inter-switch notifications |
US9954781B2 (en) | 2013-03-15 | 2018-04-24 | International Business Machines Corporation | Adaptive setting of the quantized congestion notification equilibrium setpoint in converged enhanced Ethernet networks |
US9219689B2 (en) | 2013-03-15 | 2015-12-22 | International Business Machines Corporation | Source-driven switch probing with feedback request |
US9401857B2 (en) | 2013-03-15 | 2016-07-26 | International Business Machines Corporation | Coherent load monitoring of physical and virtual networks with synchronous status acquisition |
US9253096B2 (en) | 2013-03-15 | 2016-02-02 | International Business Machines Corporation | Bypassing congestion points in a converged enhanced ethernet fabric |
US9166925B2 (en) * | 2013-04-05 | 2015-10-20 | International Business Machines Corporation | Virtual quantized congestion notification |
US9548960B2 (en) | 2013-10-06 | 2017-01-17 | Mellanox Technologies Ltd. | Simplified packet routing |
US9537743B2 (en) | 2014-04-25 | 2017-01-03 | International Business Machines Corporation | Maximizing storage controller bandwidth utilization in heterogeneous storage area networks |
US9729473B2 (en) | 2014-06-23 | 2017-08-08 | Mellanox Technologies, Ltd. | Network high availability using temporary re-routing |
US9806994B2 (en) | 2014-06-24 | 2017-10-31 | Mellanox Technologies, Ltd. | Routing via multiple paths with efficient traffic distribution |
US9699067B2 (en) | 2014-07-22 | 2017-07-04 | Mellanox Technologies, Ltd. | Dragonfly plus: communication over bipartite node groups connected by a mesh network |
US9894005B2 (en) | 2015-03-31 | 2018-02-13 | Mellanox Technologies, Ltd. | Adaptive routing controlled by source node |
US9973435B2 (en) | 2015-12-16 | 2018-05-15 | Mellanox Technologies Tlv Ltd. | Loopback-free adaptive routing |
US10819621B2 (en) | 2016-02-23 | 2020-10-27 | Mellanox Technologies Tlv Ltd. | Unicast forwarding of adaptive-routing notifications |
US10708819B2 (en) | 2016-02-25 | 2020-07-07 | Telefonaktiebolaget Lm Ericsson (Publ) | Back-pressure control in a telecommunications network |
EP3420689B1 (en) | 2016-02-25 | 2019-11-13 | Telefonaktiebolaget LM Ericsson (PUBL) | Congestion control in a telecommunications network |
US9985890B2 (en) | 2016-03-14 | 2018-05-29 | International Business Machines Corporation | Identifying a local congestion control algorithm of a virtual machine |
US10425338B2 (en) | 2016-03-14 | 2019-09-24 | International Business Machines Corporation | Virtual switch-based congestion control for datacenter networks |
US9985891B2 (en) | 2016-04-07 | 2018-05-29 | Oracle International Corporation | Congestion management in distributed systems using autonomous self-regulation |
US10178029B2 (en) | 2016-05-11 | 2019-01-08 | Mellanox Technologies Tlv Ltd. | Forwarding of adaptive routing notifications |
US10511521B2 (en) | 2016-08-03 | 2019-12-17 | Anchorfree Inc. | System and method for virtual multipath data transport |
US10200294B2 (en) | 2016-12-22 | 2019-02-05 | Mellanox Technologies Tlv Ltd. | Adaptive routing based on flow-control credits |
US10644995B2 (en) | 2018-02-14 | 2020-05-05 | Mellanox Technologies Tlv Ltd. | Adaptive routing in a box |
US11005724B1 (en) | 2019-01-06 | 2021-05-11 | Mellanox Technologies, Ltd. | Network topology having minimal number of long connections among groups of network elements |
US11700209B2 (en) * | 2019-09-30 | 2023-07-11 | Intel Corporation | Multi-path packet descriptor delivery scheme |
US11575594B2 (en) | 2020-09-10 | 2023-02-07 | Mellanox Technologies, Ltd. | Deadlock-free rerouting for resolving local link failures using detour paths |
US11411911B2 (en) | 2020-10-26 | 2022-08-09 | Mellanox Technologies, Ltd. | Routing across multiple subnetworks using address mapping |
US11870682B2 (en) | 2021-06-22 | 2024-01-09 | Mellanox Technologies, Ltd. | Deadlock-free local rerouting for handling multiple local link failures in hierarchical network topologies |
US11765103B2 (en) | 2021-12-01 | 2023-09-19 | Mellanox Technologies, Ltd. | Large-scale network with high port utilization |
US20230038307A1 (en) * | 2022-04-06 | 2023-02-09 | Intel Corporation | Network interface device feedback for adaptive and failover multipath routing |
Citations (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US46496A (en) * | 1865-02-21 | 1865-02-21 | photo-l | |
US136229A (en) * | 1873-02-25 | Improvement in revivifying charcoal used in rectifying spirits | ||
US5864539A (en) * | 1996-05-06 | 1999-01-26 | Bay Networks, Inc. | Method and apparatus for a rate-based congestion control in a shared memory switch |
US5894481A (en) * | 1996-09-11 | 1999-04-13 | Mcdata Corporation | Fiber channel switch employing distributed queuing |
US6009078A (en) * | 1996-02-09 | 1999-12-28 | Nec Corporation | ATM switch device capable of favorably controlling traffic congestion |
US6160813A (en) * | 1997-03-21 | 2000-12-12 | Brocade Communications Systems, Inc. | Fibre channel switching system and method |
US6289021B1 (en) * | 1997-01-24 | 2001-09-11 | Interactic Holdings, Llc | Scaleable low-latency switch for usage in an interconnect structure |
US6456590B1 (en) * | 1998-02-13 | 2002-09-24 | Texas Instruments Incorporated | Static and dynamic flow control using virtual input queueing for shared memory ethernet switches |
US6467011B2 (en) * | 1999-03-19 | 2002-10-15 | Times N Systems, Inc. | Shared memory apparatus and method for multiprocessor systems |
US20030088694A1 (en) * | 2001-11-02 | 2003-05-08 | Internet Machines Corporation | Multicasting method and switch |
US6594234B1 (en) * | 2001-05-31 | 2003-07-15 | Fujitsu Network Communications, Inc. | System and method for scheduling traffic for different classes of service |
US20030135579A1 (en) * | 2001-12-13 | 2003-07-17 | Man-Soo Han | Adaptive buffer partitioning method for shared buffer switch and switch therefor |
US6625159B1 (en) * | 1998-11-30 | 2003-09-23 | Hewlett-Packard Development Company, L.P. | Nonblocking and fair queuing switching method and shared memory packet switch |
US6657962B1 (en) * | 2000-04-10 | 2003-12-02 | International Business Machines Corporation | Method and system for managing congestion in a network |
US6678277B1 (en) * | 1999-11-09 | 2004-01-13 | 3Com Corporation | Efficient means to provide back pressure without head of line blocking in a virtual output queued forwarding system |
US6735679B1 (en) * | 1998-07-08 | 2004-05-11 | Broadcom Corporation | Apparatus and method for optimizing access to memory |
US20040151184A1 (en) * | 2002-12-13 | 2004-08-05 | Zarlink Semiconductor V.N. Inc. | Class-based rate control using multi-threshold leaky bucket |
US20040179476A1 (en) * | 2003-03-10 | 2004-09-16 | Sung-Ha Kim | Apparatus and method for controlling a traffic switching operation based on a service class in an ethernet-based network |
US20060182112A1 (en) * | 2000-06-19 | 2006-08-17 | Broadcom Corporation | Switch fabric with memory management unit for improved flow control |
US7099275B2 (en) * | 2001-09-21 | 2006-08-29 | Slt Logic Llc | Programmable multi-service queue scheduler |
US7120117B1 (en) * | 2000-08-29 | 2006-10-10 | Broadcom Corporation | Starvation free flow control in a shared memory switching device |
US7263066B1 (en) * | 2001-12-14 | 2007-08-28 | Applied Micro Circuits Corporation | Switch fabric backplane flow management using credit-based flow control |
US7394808B2 (en) * | 2004-05-24 | 2008-07-01 | Nortel Networks Limited | Method and apparatus for implementing scheduling algorithms in a network element |
Family Cites Families (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4680701A (en) * | 1984-04-11 | 1987-07-14 | Texas Instruments Incorporated | Asynchronous high speed processor having high speed memories with domino circuits contained therein |
GB8711991D0 (en) * | 1987-05-21 | 1987-06-24 | British Aerospace | Asynchronous communication systems |
US4912348A (en) * | 1988-12-09 | 1990-03-27 | Idaho Research Foundation | Method for designing pass transistor asynchronous sequential circuits |
US5752070A (en) * | 1990-03-19 | 1998-05-12 | California Institute Of Technology | Asynchronous processors |
US5121003A (en) | 1990-10-10 | 1992-06-09 | Hal Computer Systems, Inc. | Zero overhead self-timed iterative logic |
US5434520A (en) * | 1991-04-12 | 1995-07-18 | Hewlett-Packard Company | Clocking systems and methods for pipelined self-timed dynamic logic circuits |
US5367638A (en) * | 1991-12-23 | 1994-11-22 | U.S. Philips Corporation | Digital data processing circuit with control of data flow by control of the supply voltage |
DE4214981A1 (en) * | 1992-05-06 | 1993-11-11 | Siemens Ag | Asynchronous logic circuit for 2-phase operation |
DE69430352T2 (en) * | 1993-10-21 | 2003-01-30 | Sun Microsystems Inc | Counterflow pipeline |
US5440182A (en) * | 1993-10-22 | 1995-08-08 | The Board Of Trustees Of The Leland Stanford Junior University | Dynamic logic interconnect speed-up circuit |
US6152613A (en) * | 1994-07-08 | 2000-11-28 | California Institute Of Technology | Circuit implementations for asynchronous processors |
US5642501A (en) * | 1994-07-26 | 1997-06-24 | Novell, Inc. | Computer method and apparatus for asynchronous ordered operations |
US5732233A (en) * | 1995-01-23 | 1998-03-24 | International Business Machines Corporation | High speed pipeline method and apparatus |
EP0787327B1 (en) * | 1995-08-23 | 2002-06-12 | Koninklijke Philips Electronics N.V. | Data processing system comprising an asynchronously controlled pipeline |
CN1209247A (en) * | 1996-01-03 | 1999-02-24 | 索尼电子有限公司 | Copy protection recording and playback system |
GB2310738B (en) * | 1996-02-29 | 2000-02-16 | Advanced Risc Mach Ltd | Dynamic logic pipeline control |
US5889979A (en) | 1996-05-24 | 1999-03-30 | Hewlett-Packard, Co. | Transparent data-triggered pipeline latch |
WO1999004334A1 (en) * | 1997-07-16 | 1999-01-28 | California Institute Of Technology | Improved devices and methods for asynchronous processing |
US5920899A (en) * | 1997-09-02 | 1999-07-06 | Acorn Networks, Inc. | Asynchronous pipeline whose stages generate output request before latching data |
US6038656A (en) * | 1997-09-12 | 2000-03-14 | California Institute Of Technology | Pipelined completion for asynchronous communication |
US6502180B1 (en) * | 1997-09-12 | 2002-12-31 | California Institute Of Technology | Asynchronous circuits with pipelined completion process |
US6301655B1 (en) * | 1997-09-15 | 2001-10-09 | California Institute Of Technology | Exception processing in asynchronous processor |
US5949259A (en) * | 1997-11-19 | 1999-09-07 | Atmel Corporation | Zero-delay slew-rate controlled output buffer |
US5973512A (en) * | 1997-12-02 | 1999-10-26 | National Semiconductor Corporation | CMOS output buffer having load independent slewing |
US20020136229A1 (en) | 2001-01-09 | 2002-09-26 | Lucent Technologies, Inc. | Non-blocking crossbar and method of operation thereof |
US7054312B2 (en) | 2001-08-17 | 2006-05-30 | Mcdata Corporation | Multi-rate shared memory architecture for frame storage and switching |
WO2003043272A1 (en) | 2001-11-13 | 2003-05-22 | Transwitch Corporation | Overcoming access latency inefficiency in memories for packet switched networks |
US7283557B2 (en) * | 2002-01-25 | 2007-10-16 | Fulcrum Microsystems, Inc. | Asynchronous crossbar with deterministic or arbitrated control |
US6950959B2 (en) * | 2002-02-12 | 2005-09-27 | Fulcrum Microystems Inc. | Techniques for facilitating conversion between asynchronous and synchronous domains |
EP1647030B1 (en) * | 2003-07-14 | 2009-12-16 | Fulcrum Microsystems Inc. | Asynchronous static random access memory |
US7814280B2 (en) * | 2005-01-12 | 2010-10-12 | Fulcrum Microsystems Inc. | Shared-memory switch fabric architecture |
US7916718B2 (en) * | 2007-04-19 | 2011-03-29 | Fulcrum Microsystems, Inc. | Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics |
-
2007
- 2007-04-19 US US11/737,511 patent/US7916718B2/en active Active
-
2008
- 2008-04-08 WO PCT/US2008/059668 patent/WO2008130841A1/en active Application Filing
-
2011
- 2011-01-06 US US12/986,031 patent/US8467342B2/en active Active
Patent Citations (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US136229A (en) * | 1873-02-25 | Improvement in revivifying charcoal used in rectifying spirits | ||
US46496A (en) * | 1865-02-21 | 1865-02-21 | photo-l | |
US6009078A (en) * | 1996-02-09 | 1999-12-28 | Nec Corporation | ATM switch device capable of favorably controlling traffic congestion |
US5864539A (en) * | 1996-05-06 | 1999-01-26 | Bay Networks, Inc. | Method and apparatus for a rate-based congestion control in a shared memory switch |
US5894481A (en) * | 1996-09-11 | 1999-04-13 | Mcdata Corporation | Fiber channel switch employing distributed queuing |
US6289021B1 (en) * | 1997-01-24 | 2001-09-11 | Interactic Holdings, Llc | Scaleable low-latency switch for usage in an interconnect structure |
US6160813A (en) * | 1997-03-21 | 2000-12-12 | Brocade Communications Systems, Inc. | Fibre channel switching system and method |
US6456590B1 (en) * | 1998-02-13 | 2002-09-24 | Texas Instruments Incorporated | Static and dynamic flow control using virtual input queueing for shared memory ethernet switches |
US6735679B1 (en) * | 1998-07-08 | 2004-05-11 | Broadcom Corporation | Apparatus and method for optimizing access to memory |
US6625159B1 (en) * | 1998-11-30 | 2003-09-23 | Hewlett-Packard Development Company, L.P. | Nonblocking and fair queuing switching method and shared memory packet switch |
US6467011B2 (en) * | 1999-03-19 | 2002-10-15 | Times N Systems, Inc. | Shared memory apparatus and method for multiprocessor systems |
US6678277B1 (en) * | 1999-11-09 | 2004-01-13 | 3Com Corporation | Efficient means to provide back pressure without head of line blocking in a virtual output queued forwarding system |
US6657962B1 (en) * | 2000-04-10 | 2003-12-02 | International Business Machines Corporation | Method and system for managing congestion in a network |
US20060182112A1 (en) * | 2000-06-19 | 2006-08-17 | Broadcom Corporation | Switch fabric with memory management unit for improved flow control |
US7120117B1 (en) * | 2000-08-29 | 2006-10-10 | Broadcom Corporation | Starvation free flow control in a shared memory switching device |
US6594234B1 (en) * | 2001-05-31 | 2003-07-15 | Fujitsu Network Communications, Inc. | System and method for scheduling traffic for different classes of service |
US7099275B2 (en) * | 2001-09-21 | 2006-08-29 | Slt Logic Llc | Programmable multi-service queue scheduler |
US20030088694A1 (en) * | 2001-11-02 | 2003-05-08 | Internet Machines Corporation | Multicasting method and switch |
US20030135579A1 (en) * | 2001-12-13 | 2003-07-17 | Man-Soo Han | Adaptive buffer partitioning method for shared buffer switch and switch therefor |
US7263066B1 (en) * | 2001-12-14 | 2007-08-28 | Applied Micro Circuits Corporation | Switch fabric backplane flow management using credit-based flow control |
US20040151184A1 (en) * | 2002-12-13 | 2004-08-05 | Zarlink Semiconductor V.N. Inc. | Class-based rate control using multi-threshold leaky bucket |
US20040179476A1 (en) * | 2003-03-10 | 2004-09-16 | Sung-Ha Kim | Apparatus and method for controlling a traffic switching operation based on a service class in an ethernet-based network |
US7394808B2 (en) * | 2004-05-24 | 2008-07-01 | Nortel Networks Limited | Method and apparatus for implementing scheduling algorithms in a network element |
Cited By (163)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8160094B2 (en) | 2004-10-22 | 2012-04-17 | Cisco Technology, Inc. | Fibre channel over ethernet |
US7830793B2 (en) | 2004-10-22 | 2010-11-09 | Cisco Technology, Inc. | Network device architecture for consolidating input/output and reducing latency |
US7969971B2 (en) | 2004-10-22 | 2011-06-28 | Cisco Technology, Inc. | Ethernet extension for the data center |
US8238347B2 (en) | 2004-10-22 | 2012-08-07 | Cisco Technology, Inc. | Fibre channel over ethernet |
US9246834B2 (en) | 2004-10-22 | 2016-01-26 | Cisco Technology, Inc. | Fibre channel over ethernet |
US8842694B2 (en) | 2004-10-22 | 2014-09-23 | Cisco Technology, Inc. | Fibre Channel over Ethernet |
US8532099B2 (en) | 2004-10-22 | 2013-09-10 | Cisco Technology, Inc. | Forwarding table reduction and multipath network forwarding |
US7801125B2 (en) | 2004-10-22 | 2010-09-21 | Cisco Technology, Inc. | Forwarding table reduction and multipath network forwarding |
US20060098589A1 (en) * | 2004-10-22 | 2006-05-11 | Cisco Technology, Inc. | Forwarding table reduction and multipath network forwarding |
US8565231B2 (en) | 2004-10-22 | 2013-10-22 | Cisco Technology, Inc. | Ethernet extension for the data center |
US8792352B2 (en) | 2005-10-11 | 2014-07-29 | Cisco Technology, Inc. | Methods and devices for backward congestion notification |
US20070201499A1 (en) * | 2006-02-24 | 2007-08-30 | Texas Instruments Incorporated | Device, system and/or method for managing packet congestion in a packet switching network |
US7724754B2 (en) * | 2006-02-24 | 2010-05-25 | Texas Instruments Incorporated | Device, system and/or method for managing packet congestion in a packet switching network |
US8259720B2 (en) | 2007-02-02 | 2012-09-04 | Cisco Technology, Inc. | Triple-tier anycast addressing |
US8743738B2 (en) | 2007-02-02 | 2014-06-03 | Cisco Technology, Inc. | Triple-tier anycast addressing |
US8467342B2 (en) | 2007-04-19 | 2013-06-18 | Intel Corporation | Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics |
US20110164496A1 (en) * | 2007-04-19 | 2011-07-07 | Fulcrum Microsystems Inc. | Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics |
US20090010162A1 (en) * | 2007-07-05 | 2009-01-08 | Cisco Technology, Inc. | Flexible and hierarchical dynamic buffer allocation |
US8149710B2 (en) * | 2007-07-05 | 2012-04-03 | Cisco Technology, Inc. | Flexible and hierarchical dynamic buffer allocation |
US20140347997A1 (en) * | 2007-08-21 | 2014-11-27 | Cisco Technology, Inc. | Backward congestion notification |
US8121038B2 (en) | 2007-08-21 | 2012-02-21 | Cisco Technology, Inc. | Backward congestion notification |
US8804529B2 (en) | 2007-08-21 | 2014-08-12 | Cisco Technology, Inc. | Backward congestion notification |
US8929372B2 (en) | 2007-10-30 | 2015-01-06 | Contextream Ltd. | Grid router |
US20090109968A1 (en) * | 2007-10-30 | 2009-04-30 | Ariel Noy | Grid router |
US20090219818A1 (en) * | 2008-03-03 | 2009-09-03 | Masahiko Tsuchiya | Node device, packet switch device, communication system and method of communicating packet data |
US8467295B2 (en) * | 2008-08-21 | 2013-06-18 | Contextream Ltd. | System and methods for distributed quality of service enforcement |
US20100046368A1 (en) * | 2008-08-21 | 2010-02-25 | Gideon Kaempfer | System and methods for distributed quality of service enforcement |
US20130194929A1 (en) * | 2008-08-21 | 2013-08-01 | Contextream Ltd. | System and methods for distributed quality of service enforcement |
US9344369B2 (en) * | 2008-08-21 | 2016-05-17 | Hewlett Packard Enterprise Development Lp | System and methods for distributed quality of service enforcement |
US8154996B2 (en) * | 2008-09-11 | 2012-04-10 | Juniper Networks, Inc. | Methods and apparatus for flow control associated with multi-staged queues |
US8964556B2 (en) | 2008-09-11 | 2015-02-24 | Juniper Networks, Inc. | Methods and apparatus for flow-controllable multi-staged queues |
US8218442B2 (en) * | 2008-09-11 | 2012-07-10 | Juniper Networks, Inc. | Methods and apparatus for flow-controllable multi-staged queues |
US8811163B2 (en) | 2008-09-11 | 2014-08-19 | Juniper Networks, Inc. | Methods and apparatus for flow control associated with multi-staged queues |
US8213308B2 (en) | 2008-09-11 | 2012-07-03 | Juniper Networks, Inc. | Methods and apparatus for defining a flow control signal related to a transmit queue |
US10931589B2 (en) | 2008-09-11 | 2021-02-23 | Juniper Networks, Inc. | Methods and apparatus for flow-controllable multi-staged queues |
US9876725B2 (en) | 2008-09-11 | 2018-01-23 | Juniper Networks, Inc. | Methods and apparatus for flow-controllable multi-staged queues |
US20100061238A1 (en) * | 2008-09-11 | 2010-03-11 | Avanindra Godbole | Methods and apparatus for flow control associated with multi-staged queues |
US20100061390A1 (en) * | 2008-09-11 | 2010-03-11 | Avanindra Godbole | Methods and apparatus for defining a flow control signal related to a transmit queue |
US20100061239A1 (en) * | 2008-09-11 | 2010-03-11 | Avanindra Godbole | Methods and apparatus for flow-controllable multi-staged queues |
US8593970B2 (en) | 2008-09-11 | 2013-11-26 | Juniper Networks, Inc. | Methods and apparatus for defining a flow control signal related to a transmit queue |
US8325749B2 (en) | 2008-12-24 | 2012-12-04 | Juniper Networks, Inc. | Methods and apparatus for transmission of groups of cells via a switch fabric |
US9077466B2 (en) | 2008-12-24 | 2015-07-07 | Juniper Networks, Inc. | Methods and apparatus for transmission of groups of cells via a switch fabric |
US8717889B2 (en) | 2008-12-29 | 2014-05-06 | Juniper Networks, Inc. | Flow-control in a switch fabric |
US20100165843A1 (en) * | 2008-12-29 | 2010-07-01 | Thomas Philip A | Flow-control in a switch fabric |
US8254255B2 (en) | 2008-12-29 | 2012-08-28 | Juniper Networks, Inc. | Flow-control in a switch fabric |
US20110154132A1 (en) * | 2009-12-23 | 2011-06-23 | Gunes Aybay | Methods and apparatus for tracking data flow based on flow state values |
US9264321B2 (en) | 2009-12-23 | 2016-02-16 | Juniper Networks, Inc. | Methods and apparatus for tracking data flow based on flow state values |
US9967167B2 (en) | 2009-12-23 | 2018-05-08 | Juniper Networks, Inc. | Methods and apparatus for tracking data flow based on flow state values |
US11323350B2 (en) | 2009-12-23 | 2022-05-03 | Juniper Networks, Inc. | Methods and apparatus for tracking data flow based on flow state values |
US10554528B2 (en) | 2009-12-23 | 2020-02-04 | Juniper Networks, Inc. | Methods and apparatus for tracking data flow based on flow state values |
US8379516B2 (en) | 2009-12-24 | 2013-02-19 | Contextream Ltd. | Grid routing apparatus and method |
US20110158082A1 (en) * | 2009-12-24 | 2011-06-30 | Contextream Ltd. | Grid routing apparatus and method |
US11398991B1 (en) | 2010-04-30 | 2022-07-26 | Juniper Networks, Inc. | Methods and apparatus for flow control associated with a switch fabric |
US9602439B2 (en) | 2010-04-30 | 2017-03-21 | Juniper Networks, Inc. | Methods and apparatus for flow control associated with a switch fabric |
US10560381B1 (en) | 2010-04-30 | 2020-02-11 | Juniper Networks, Inc. | Methods and apparatus for flow control associated with a switch fabric |
US9036481B1 (en) * | 2010-05-05 | 2015-05-19 | Marvell International Ltd. | Method and apparatus for adaptive packet load balancing |
US20110296434A1 (en) * | 2010-05-25 | 2011-12-01 | International Business Machines Corporation | Techniques for Dynamically Sharing a Fabric to Facilitate Off-Chip Communication for Multiple On-Chip Units |
US8346988B2 (en) * | 2010-05-25 | 2013-01-01 | International Business Machines Corporation | Techniques for dynamically sharing a fabric to facilitate off-chip communication for multiple on-chip units |
US9705827B2 (en) | 2010-06-22 | 2017-07-11 | Juniper Networks, Inc. | Methods and apparatus for virtual channel flow control associated with a switch fabric |
US9065773B2 (en) | 2010-06-22 | 2015-06-23 | Juniper Networks, Inc. | Methods and apparatus for virtual channel flow control associated with a switch fabric |
US8553710B1 (en) | 2010-08-18 | 2013-10-08 | Juniper Networks, Inc. | Fibre channel credit-based link flow control overlay onto fibre channel over ethernet |
US20170257328A1 (en) * | 2010-12-01 | 2017-09-07 | Juniper Networks, Inc. | Methods and apparatus for flow control associated with a switch fabric |
US20120140626A1 (en) * | 2010-12-01 | 2012-06-07 | Juniper Networks, Inc. | Methods and apparatus for flow control associated with a switch fabric |
CN105323185A (en) * | 2010-12-01 | 2016-02-10 | 瞻博网络公司 | Methods and apparatus for flow control associated with switch fabric |
CN102487358A (en) * | 2010-12-01 | 2012-06-06 | 丛林网络公司 | Methods and apparatus for flow control associated with switch fabric |
US11711319B2 (en) | 2010-12-01 | 2023-07-25 | Juniper Networks, Inc. | Methods and apparatus for flow control associated with a switch fabric |
EP3297234A1 (en) * | 2010-12-01 | 2018-03-21 | Juniper Networks, Inc. | Methods and apparatus for flow control associated with a switch fabric |
US9660940B2 (en) * | 2010-12-01 | 2017-05-23 | Juniper Networks, Inc. | Methods and apparatus for flow control associated with a switch fabric |
US10616143B2 (en) | 2010-12-01 | 2020-04-07 | Juniper Networks, Inc. | Methods and apparatus for flow control associated with a switch fabric |
US20160014636A1 (en) * | 2010-12-07 | 2016-01-14 | Siemens Aktiengesellschaft | Congestion Notification Element and Method for Congestion Control |
US9872200B2 (en) * | 2010-12-07 | 2018-01-16 | Siemens Aktiengesellschaft | Congestion notification element and method for congestion control |
US20140317225A1 (en) * | 2011-01-03 | 2014-10-23 | Planetary Data LLC | Community internet drive |
US9800464B2 (en) * | 2011-01-03 | 2017-10-24 | Planetary Data LLC | Community internet drive |
US11863380B2 (en) | 2011-01-03 | 2024-01-02 | Planetary Data LLC | Community internet drive |
US11218367B2 (en) * | 2011-01-03 | 2022-01-04 | Planetary Data LLC | Community internet drive |
US10177978B2 (en) * | 2011-01-03 | 2019-01-08 | Planetary Data LLC | Community internet drive |
US8804715B2 (en) * | 2011-02-02 | 2014-08-12 | Fujitsu Limited | Communication processing apparatus and address learning method |
US20120195308A1 (en) * | 2011-02-02 | 2012-08-02 | Fujitsu Limited | Communication processing apparatus and address learning method |
US20130028078A1 (en) * | 2011-02-16 | 2013-01-31 | Sony Corporation | Transmission terminal and transmission method |
US9032089B2 (en) | 2011-03-09 | 2015-05-12 | Juniper Networks, Inc. | Methods and apparatus for path selection within a network based on flow duration |
US9716661B2 (en) | 2011-03-09 | 2017-07-25 | Juniper Networks, Inc. | Methods and apparatus for path selection within a network based on flow duration |
US8711697B1 (en) * | 2011-06-22 | 2014-04-29 | Marvell International Ltd. | Method and apparatus for prioritizing data transfer |
US20130051494A1 (en) * | 2011-08-23 | 2013-02-28 | Oracle International Corporation | Method and system for responder side cut through of received data |
US9021123B2 (en) * | 2011-08-23 | 2015-04-28 | Oracle International Corporation | Method and system for responder side cut through of received data |
US9118597B2 (en) | 2011-08-23 | 2015-08-25 | Oracle International Corporation | Method and system for requester virtual cut through |
US8879579B2 (en) | 2011-08-23 | 2014-11-04 | Oracle International Corporation | Method and system for requester virtual cut through |
US10698858B1 (en) | 2011-09-30 | 2020-06-30 | EMC IP Holding Company LLC | Multiprocessor messaging system |
US9760526B1 (en) * | 2011-09-30 | 2017-09-12 | EMC IP Holdings Company LLC | Multiprocessor messaging system |
US8811183B1 (en) | 2011-10-04 | 2014-08-19 | Juniper Networks, Inc. | Methods and apparatus for multi-path flow control within a multi-stage switch fabric |
US9426085B1 (en) | 2011-10-04 | 2016-08-23 | Juniper Networks, Inc. | Methods and apparatus for multi-path flow control within a multi-stage switch fabric |
US8930489B2 (en) * | 2011-10-11 | 2015-01-06 | Rakspace US, Inc. | Distributed rate limiting of handling requests |
US20130091241A1 (en) * | 2011-10-11 | 2013-04-11 | David Goetz | Distributed Rate Limiting Of Handling Requests |
WO2013101794A1 (en) * | 2011-12-30 | 2013-07-04 | Arteris SAS | Link between chips using virtual channels and credit based flow control |
US8824295B2 (en) | 2011-12-30 | 2014-09-02 | Qualcomm Technologies, Inc. | Link between chips using virtual channels and credit based flow control |
US20130250762A1 (en) * | 2012-03-22 | 2013-09-26 | Avaya, Inc. | Method and apparatus for Lossless Behavior For Multiple Ports Sharing a Buffer Pool |
US8867360B2 (en) * | 2012-03-22 | 2014-10-21 | Avaya Inc. | Method and apparatus for lossless behavior for multiple ports sharing a buffer pool |
CN103326954A (en) * | 2012-03-23 | 2013-09-25 | 美国博通公司 | Reducing headroom |
US9438527B2 (en) * | 2012-05-24 | 2016-09-06 | Marvell World Trade Ltd. | Flexible queues in a network switch |
US9887929B2 (en) | 2012-05-24 | 2018-02-06 | Marvell World Trade Ltd. | Flexible queues in a network switch |
US20130315054A1 (en) * | 2012-05-24 | 2013-11-28 | Marvell World Trade Ltd. | Flexible queues in a network switch |
US9231869B2 (en) | 2012-06-21 | 2016-01-05 | Microsoft Technology Licensing, Llc | Ensuring predictable and quantifiable networking performance |
US8804523B2 (en) | 2012-06-21 | 2014-08-12 | Microsoft Corporation | Ensuring predictable and quantifiable networking performance |
US9537773B2 (en) | 2012-06-21 | 2017-01-03 | Microsoft Technology Licensing, Llc | Ensuring predictable and quantifiable networking performance |
US10447594B2 (en) | 2012-06-21 | 2019-10-15 | Microsoft Technology Licensing, Llc | Ensuring predictable and quantifiable networking performance |
WO2013191927A1 (en) * | 2012-06-21 | 2013-12-27 | Microsoft Corporation | Ensuring predictable and quantifiable networking performance |
CN104396200A (en) * | 2012-06-21 | 2015-03-04 | 微软公司 | Ensuring predictable and quantifiable networking performance |
US9369923B2 (en) * | 2012-09-05 | 2016-06-14 | Thales | Transmission method in an ad hoc multi-hop IP network |
US20140071826A1 (en) * | 2012-09-05 | 2014-03-13 | Thales | Transmission method in an ad hoc multi-hop ip network |
DE112013006417B4 (en) | 2013-01-14 | 2023-04-27 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd. | Low latency lossless switch fabric for use in a data center |
US9014005B2 (en) * | 2013-01-14 | 2015-04-21 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd. | Low-latency lossless switch fabric for use in a data center |
US9270600B2 (en) | 2013-01-14 | 2016-02-23 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd. | Low-latency lossless switch fabric for use in a data center |
CN105229976A (en) * | 2013-01-14 | 2016-01-06 | 联想企业解决方案(新加坡)有限公司 | Low-latency lossless switching fabric for data center |
US20140198638A1 (en) * | 2013-01-14 | 2014-07-17 | International Business Machines Corporation | Low-latency lossless switch fabric for use in a data center |
US9356868B2 (en) * | 2013-08-23 | 2016-05-31 | Broadcom Corporation | Congestion detection and management at congestion-tree roots |
US20150055478A1 (en) * | 2013-08-23 | 2015-02-26 | Broadcom Corporation | Congestion detection and management at congestion-tree roots |
US8891376B1 (en) | 2013-10-07 | 2014-11-18 | International Business Machines Corporation | Quantized Congestion Notification—defense mode choice extension for the alternate priority of congestion points |
US20160226783A1 (en) * | 2013-10-14 | 2016-08-04 | Google Inc. | Pacing enhanced packet forwarding/switching and congestion avoidance |
US9843526B2 (en) * | 2013-10-14 | 2017-12-12 | Google Llc | Pacing enhanced packet forwarding/switching and congestion avoidance |
US10498656B2 (en) | 2014-02-24 | 2019-12-03 | Avago Technologies International Sales Pte. Limited | End to end flow control |
EP2911352B1 (en) * | 2014-02-24 | 2019-10-30 | Avago Technologies International Sales Pte. Limited | Method and network system for end-to-end flow control |
US10027590B2 (en) | 2014-02-24 | 2018-07-17 | Avago Technologies General Ip (Singapore) Pte. Ltd. | End to end flow control |
US20150281405A1 (en) * | 2014-03-31 | 2015-10-01 | Metaswitch Networks Limited | Spanning tree protocol |
US10892936B2 (en) * | 2014-03-31 | 2021-01-12 | Metaswitch Networks Limited | Spanning tree protocol |
US10469404B1 (en) * | 2014-05-12 | 2019-11-05 | Google Llc | Network multi-level rate limiter |
US9762502B1 (en) | 2014-05-12 | 2017-09-12 | Google Inc. | Method and system for validating rate-limiter determination made by untrusted software |
US9755978B1 (en) | 2014-05-12 | 2017-09-05 | Google Inc. | Method and system for enforcing multiple rate limits with limited on-chip buffering |
US9654483B1 (en) * | 2014-12-23 | 2017-05-16 | Amazon Technologies, Inc. | Network communication rate limiter |
US20160285771A1 (en) * | 2015-03-23 | 2016-09-29 | Cisco Technology, Inc. | Technique for achieving low latency in data center network environments |
US9559968B2 (en) * | 2015-03-23 | 2017-01-31 | Cisco Technology, Inc. | Technique for achieving low latency in data center network environments |
US10025609B2 (en) * | 2015-04-23 | 2018-07-17 | International Business Machines Corporation | Virtual machine (VM)-to-VM flow control for overlay networks |
US20160314012A1 (en) * | 2015-04-23 | 2016-10-27 | International Business Machines Corporation | Virtual machine (vm)-to-vm flow control for overlay networks |
US10698718B2 (en) | 2015-04-23 | 2020-06-30 | International Business Machines Corporation | Virtual machine (VM)-to-VM flow control using congestion status messages for overlay networks |
US9923965B2 (en) | 2015-06-05 | 2018-03-20 | International Business Machines Corporation | Storage mirroring over wide area network circuits with dynamic on-demand capacity |
US10581680B2 (en) | 2015-11-25 | 2020-03-03 | International Business Machines Corporation | Dynamic configuration of network features |
US10608952B2 (en) * | 2015-11-25 | 2020-03-31 | International Business Machines Corporation | Configuring resources to exploit elastic network capability |
US20170149688A1 (en) * | 2015-11-25 | 2017-05-25 | International Business Machines Corporation | Configuring resources to exploit elastic network capability |
US9923784B2 (en) | 2015-11-25 | 2018-03-20 | International Business Machines Corporation | Data transfer using flexible dynamic elastic network service provider relationships |
US10216441B2 (en) | 2015-11-25 | 2019-02-26 | International Business Machines Corporation | Dynamic quality of service for storage I/O port allocation |
US10177993B2 (en) | 2015-11-25 | 2019-01-08 | International Business Machines Corporation | Event-based data transfer scheduling using elastic network optimization criteria |
US10057327B2 (en) | 2015-11-25 | 2018-08-21 | International Business Machines Corporation | Controlled transfer of data over an elastic network |
US9923839B2 (en) * | 2015-11-25 | 2018-03-20 | International Business Machines Corporation | Configuring resources to exploit elastic network capability |
US10630590B2 (en) * | 2016-07-14 | 2020-04-21 | Mellanox Technologies Tlv Ltd. | Credit loop deadlock detection and recovery in arbitrary topology networks |
US20180019947A1 (en) * | 2016-07-14 | 2018-01-18 | Mellanox Technologies Tlv Ltd. | Credit Loop Deadlock Detection and Recovery in Arbitrary Topology Networks |
US11070474B1 (en) * | 2018-10-22 | 2021-07-20 | Juniper Networks, Inc. | Selective load balancing for spraying over fabric paths |
US20220217090A1 (en) * | 2019-05-23 | 2022-07-07 | Hewlett Packard Enterprise Development Lp | System and method for facilitating data-driven intelligent network with endpoint congestion detection and control |
US11575609B2 (en) * | 2019-07-19 | 2023-02-07 | Intel Corporation | Techniques for congestion management in a network |
US20190386924A1 (en) * | 2019-07-19 | 2019-12-19 | Intel Corporation | Techniques for congestion management in a network |
US11411925B2 (en) | 2019-12-31 | 2022-08-09 | Oracle International Corporation | Methods, systems, and computer readable media for implementing indirect general packet radio service (GPRS) tunneling protocol (GTP) firewall filtering using diameter agent and signal transfer point (STP) |
US11553342B2 (en) | 2020-07-14 | 2023-01-10 | Oracle International Corporation | Methods, systems, and computer readable media for mitigating 5G roaming security attacks using security edge protection proxy (SEPP) |
US11751056B2 (en) | 2020-08-31 | 2023-09-05 | Oracle International Corporation | Methods, systems, and computer readable media for 5G user equipment (UE) historical mobility tracking and security screening using mobility patterns |
US11832172B2 (en) | 2020-09-25 | 2023-11-28 | Oracle International Corporation | Methods, systems, and computer readable media for mitigating spoofing attacks on security edge protection proxy (SEPP) inter-public land mobile network (inter-PLMN) forwarding interface |
US11825310B2 (en) | 2020-09-25 | 2023-11-21 | Oracle International Corporation | Methods, systems, and computer readable media for mitigating 5G roaming spoofing attacks |
US11622255B2 (en) | 2020-10-21 | 2023-04-04 | Oracle International Corporation | Methods, systems, and computer readable media for validating a session management function (SMF) registration request |
US11528251B2 (en) * | 2020-11-06 | 2022-12-13 | Oracle International Corporation | Methods, systems, and computer readable media for ingress message rate limiting |
US11770694B2 (en) | 2020-11-16 | 2023-09-26 | Oracle International Corporation | Methods, systems, and computer readable media for validating location update messages |
US11818570B2 (en) | 2020-12-15 | 2023-11-14 | Oracle International Corporation | Methods, systems, and computer readable media for message validation in fifth generation (5G) communications networks |
US11812271B2 (en) | 2020-12-17 | 2023-11-07 | Oracle International Corporation | Methods, systems, and computer readable media for mitigating 5G roaming attacks for internet of things (IoT) devices based on expected user equipment (UE) behavior patterns |
US11700510B2 (en) | 2021-02-12 | 2023-07-11 | Oracle International Corporation | Methods, systems, and computer readable media for short message delivery status report validation |
US11516671B2 (en) | 2021-02-25 | 2022-11-29 | Oracle International Corporation | Methods, systems, and computer readable media for mitigating location tracking and denial of service (DoS) attacks that utilize access and mobility management function (AMF) location service |
US11689912B2 (en) | 2021-05-12 | 2023-06-27 | Oracle International Corporation | Methods, systems, and computer readable media for conducting a velocity check for outbound subscribers roaming to neighboring countries |
US20230047454A1 (en) * | 2021-08-10 | 2023-02-16 | Mellanox Technologies, Ltd. | Ethernet pause aggregation for a relay device |
US11888753B2 (en) * | 2021-08-10 | 2024-01-30 | Mellanox Technologies, Ltd. | Ethernet pause aggregation for a relay device |
CN113676423A (en) * | 2021-08-13 | 2021-11-19 | 北京东土军悦科技有限公司 | Port flow control method and device, exchange chip and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2008130841A1 (en) | 2008-10-30 |
US8467342B2 (en) | 2013-06-18 |
US7916718B2 (en) | 2011-03-29 |
US20110164496A1 (en) | 2011-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7916718B2 (en) | Flow and congestion control in switch architectures for multi-hop, memory efficient fabrics | |
US11916782B2 (en) | System and method for facilitating global fairness in a network | |
US8520522B1 (en) | Transmit-buffer management for priority-based flow control | |
US7596627B2 (en) | Methods and apparatus for network congestion control | |
US8467295B2 (en) | System and methods for distributed quality of service enforcement | |
US7903552B2 (en) | Directional and priority based flow control mechanism between nodes | |
CN111788803B (en) | Flow management in a network | |
Ahammed et al. | Anakyzing the performance of active queue management algorithms | |
US20150103667A1 (en) | Detection of root and victim network congestion | |
US20180234343A1 (en) | Evading congestion spreading for victim flows | |
US10728156B2 (en) | Scalable, low latency, deep buffered switch architecture | |
US20050068798A1 (en) | Committed access rate (CAR) system architecture | |
US7286552B1 (en) | Method and apparatus for providing quality of service across a switched backplane for multicast packets | |
US7408876B1 (en) | Method and apparatus for providing quality of service across a switched backplane between egress queue managers | |
US7599292B1 (en) | Method and apparatus for providing quality of service across a switched backplane between egress and ingress queue managers | |
Wadekar | Enhanced ethernet for data center: Reliable, channelized and robust | |
Cisco | QC: Quality of Service Overview | |
Cisco | Policing and Shaping Overview | |
US20240056385A1 (en) | Switch device for facilitating switching in data-driven intelligent network | |
ME et al. | Active Network based Queue Management for providing QoS using IXP 2400 Network Processor | |
Lengyel et al. | Simulation of differentiated services in network simulator | |
Siew et al. | Congestion control based on flow-state-dependent dynamic priority scheduling | |
Habib et al. | Unresponsive flow detection and control using the differentiated services framework | |
Selvam et al. | Processor Based Active Queue Management for Providing QoS in Multimedia Application | |
Sztrik | SIMULATION OF DIFFERENTIATED SERVICES IN NETWORK SIMULATOR |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: FULCRUM MICROSYSTEMS, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LOH, ZHI-HERN;DAVIES, MICHAEL;CUMMINGS, URI;REEL/FRAME:019532/0943;SIGNING DATES FROM 20070530 TO 20070601 Owner name: FULCRUM MICROSYSTEMS, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LOH, ZHI-HERN;DAVIES, MICHAEL;CUMMINGS, URI;SIGNING DATES FROM 20070530 TO 20070601;REEL/FRAME:019532/0943 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: PAT HOLDER NO LONGER CLAIMS SMALL ENTITY STATUS, ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: STOL); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
AS | Assignment |
Owner name: INTEL CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:FULCRUM MICROSYSTEMS, INC.;REEL/FRAME:028251/0449 Effective date: 20120209 |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 12 |