Switching Fabrics with Small Internal Buffers
for Scalable Non-Blocking Interconnects
Computer Architecture and VLSI Systems (CARV) Laboratory,
Institute of Computer Science (ICS), FORTH, Heraklion, Crete, Greece
© copyright 2002-07 by FORTH, Crete, Greece
Buffered switching fabrics (bibliography) are an efficient method for building packet switches with a very large number of ports. To keep the cost down, buffers inside the fabric should be small, with large buffers being necessary only at the input ports. We research flow control, congestion management, and scheduling methods to make such buffered switching fabrics operate as efficiently as possible. Our focus is on the Benes network, which is the lowest-cost topology that is known to yield operation free of internal blocking.
1. Switching Fabrics with Small Internal Buffers
Available in PDF (1.6 MBytes); © Copyright 2007 FORTH.
ABSTRACT: This thesis considers buffered multistage interconnection networks (fabrics), and investigates methods to reduce their buffer size requirements. Our contribution is a novel flow and congestion control scheme that achieves performance close to that of per-flow queueing while requiring much less buffer space than what per-flow queues would need. The new scheme utilizes a request-grant pre-approval phase, as many contemporary bufferless networks do, but its operation is much simpler and its performance is remarkably better. Traditionally, the role of requests in bufferless networks is to reserve an available time slot on each link along a packet's route, where these time slots are contiguous in time along the path, so as to guarantee non-conflicting packet transmission. These requirements impose a very heavy toll on the scheduling unit of such bufferless fabrics. By contrast, our requests do not reserve links for a specific time duration, but instead only reserve space in the buffers at their entry points; effectively, the scheduling decisions that concern different links are decoupled among themselves, leading to a much simpler admission process. The proposed scheduling subsystem comprises independent single-resource schedulers, operating in a pipeline; they operate asynchronously to each other. In this thesis we show that the reservation of buffers in front of critical network links --links that are unable to carry the potential aggregate demand-- eliminates congestion, in the sense that traffic flows seamlessly through the network: it neither gets dropped, nor is excessively blocked waiting for downstream buffers to become available.
First, we apply request-grant scheduling to a single-stage switch, with small, shared output queues, which serves as a model for the more challenging multistage case. We demonstrate that, in principle, a very small number of fabric buffers suffices to reach high performance levels: with 12-cell buffer space per output, performance is better than in buffered crossbars, which consume N cells of buffer space per output, where N is the number of ports. In this single-stage setting, we study the impact of input contention on scheduler performance, and the related synchronization phenomena. During this work, we have introduced a novel scheduling scheme for buffered crossbar switches that makes buffer size independent of the round-trip-time between the linecards and the switch.
We then proceed to the multistage case. Our main motivation and our primary benchmark is an example next-generation fabric challenge: a 1024x1024, 3-stage, non-blocking Clos/Benes fabric, running with no internal speedup, made of 96 single-chip 32x32 buffered crosssbar switching elements (3 stages of 32 switch chips each). To eliminate congestion in the fabric, we carefully apply our request-grant scheduling protocol. We demonstrate that it is feasible to implement all schedulers centrally, in a single chip. Besides congestion elimination, our scheduler can guarantee 100 percent in-order delivery, using very small reorder buffers, which can easily fit in on-chip memory. Simulation results indicate very good delay performance, and throughput that exceeds 95% under unbalanced traffic. Most prominent is the result that, under hot-spot traffic, with almost all output ports being congested, the non-congested outputs experience negligible delay degradation. The proposed system can directly operate on variable-size packets, eliminating the padding overhead and the associated internal speed-up. We also discuss a possible distributed version of the scheduling subsystem. Our scheme is appropriate to deal with heavy congestion; in systems that need to provide very low latency under (uncongested) light traffic, one would apply this scheme when the load exceeds a given threshold.
Lastly, we consider some blocking network topologies, like the banyan. In a banyan network, besides output ports, internal links can cause congestion as well. We show a fully distributed scheduler for this network, that eliminates congestion from both internal and output-port links
- Preprint of September 2005 in PDF (300 KBytes); 6 pages - © Copyright 2005 by IEEE.
Unbuffered crossbars or switching fabrics
contain no internal buffers,
and function using only input (VOQ) and possibly output queues.
Schedulers for such switches are complex,
and introduce increased delay at medium loads,
because they have to admit at most one cell
per input and per output, during each time slot.
Buffered crossbars, on the other hand,
contain sufficient internal buffering (N2 buffers)
to allow independent schedulers to concurrently forward packets
to the same output from any number of inputs.
These architectures represent the two extremes
in a range of solutions, which we examine here;
although intermediate points in this range
are of reduced practical interest for crossbars,
they are nevertheless quite interesting for switching fabrics,
and they may be of interest for optical switches.
We find that
tolerating two cells per-output per time-slot,
using small buffers inside the switch or fabric,
sufficies for independent and efficient scheduling.
First, we introduce a novel ``request-grant'' credit protocol,
enabling N inputs to share a small switch buffer.
Then, we apply this protocol
to a switch with N such buffers, one per output,
and we consider the resulting scheduling problem.
looks like unbuffered crossbar schedulers,
but it is much simpler because it comprises independent schedulers
that can be pipelined.
We show that individual buffer sizes do not need to grow,
neither with switch size nor with propagation delay.
Through simulations, we study performance as a function of
the number of cells allowed per-output per-time-slot.
For one cell, the switch performs very close to
the iSLIP unbuffered crossbar with one iteration.
For more cells, performance improves quickly;
for 12 cells, packet delay under (smooth) uniform load
is practically as low as ideal output queueing.
Under unbalanced load, throughput is superior to buffered crossbars,
due to better buffer sharing.
N. Chrysos, M. Katevenis: "Scheduling in Switches with Small Internal Buffers: Extended Version", FORTH-ICS, Crete, Greece, September 2005
- Available in PDF (420 KBytes); 12 pages - © Copyright 2005 by FORTH.
This extended version describes the results of the above Globecom 2005 paper at greater depth. It also discusses the behavior of the switch under bursty traffic (unbalanced transients), and demonstrates the performance gains of using grant rate control (grant throttling). The results presented in this paper further substantiate that buffers as small as 12 cells per output suffice for excellent performance.
[Previous, outdated version: July 2005, 7 pages, in pdf (250 KB)].
- Preprint of January 2006 in PDF (460 KBytes) - © Copyright 2006 by IEEE.
Three-stage non-blocking switching fabrics
are the next step in scaling current crossbar switches
to many hundreds or few thousands of ports.
Congestion (output contention) management is the central
open problem --without it, performance suffers heavily
under real-world traffic patterns.
Centralized schedulers for bufferless crossbars
manage output contention
but are not scalable to high valencies and to multi-stage fabrics.
Distributed scheduling, as used in buffered crossbars,
is scalable but has never been scaled beyond crossbars.
We combine ideas from central and distributed schedulers,
from request-grant protocols and from credit-based flow control,
to propose a novel, practical architecture for scheduling
in non-blocking buffered switching fabrics.
The new architecture relies on multiple, independent,
single-resource schedulers, operating in a pipeline.
(i) does not need internal speedup;
(ii) directly operates on variable-size packets
or multi-packet segments;
(iii) isolates well-behaved from congested flows;
(iv) provides delays that successfully compete
against output queueing;
(v) provides 95 % or better throughput
under unbalanced traffic;
(vi) provides weighted max-min fairness;
(vii) resequences cells or segments using very small buffers;
and (viii) can be realistically implemented
for a 1024x1024 reference fabric
made out of 32x32 buffered crossbar switch elements
at 10 Gbps line rate.
This paper carefully studies the many intricacies
of the problem and the solution,
and provides performance simulation results.
[Previous, outdated version: August 2005, 15 pages, in pdf (460 KB)].
- Preprint of April 2006 in PDF (430 KBytes); 8 pages - © Copyright 2006 by IEEE.
We consider a switch with small output queues,
shared among the input VOQ linecards.
This has been shown to be a useful abstract model
for realistic buffered switching fabrics.
Cells are being scheduled by a central control unit, comprising
independent, single resource schedulers, working in pipeline.
This unit allocates output buffer credits to the requesting VOQs.
We show how particular unbalanced transient VOQ states,
produced by bursty traffic, affect credit reservations:
when some input temporarily constitutes a bottleneck,
too many credits may get reserved for it at once,
leading to poor overall performance.
We propose a threshold grant throttling method
to control these credit accumulations.
Then, we show how, under such grant throttling,
typical round-robin credit schedulers can get synchronized,
thus deteriorating performance.
To avoid scheduler synchronization,
we propose modified round-robin disciplines.
Simulations under both smooth and bursty traffic
demonstrate the effectiveness of the combined method:
using only a 12-cell buffers per-output,
for any switch size, N, and independent of
the number of cells in transit between the linecards and the fabric,
the performance achieved
is very close to that of pure output queueing.
We also discuss the operation of the
independent input and output schedulers inside the control unit,
their relation with PIM-like schedulers,
and their relation with buffered crossbar schedulers.
2. Benes Fabrics with Internal Backpressure
(Similar to the Communications Magazine, Jan. 2005, paper)
- Available in PDF (180 KBytes) or Postscript (290 KBytes) or gzip'ed Postscript (85 KBytes) format; © Copyright 2003 IEEE.
- The slides of the presentation are also available, in PDF format (190 KBytes).
ABSTRACT: Multistage buffered switching fabrics are the most efficient method for scaling packet switches to very large numbers of ports. The Benes network is the lowest-cost switching fabric known to yield operation free of internal blocking. Backpressure inside a switching fabric can limit the use of expensive off-chip buffer memory to just virtual-output queues (VOQ) in front of the input stage. This paper extends the known backpressure architectures to the Benes network. To achieve this, we had to successfully combine per-flow backpressure, multipath routing (inverse multiplexing), and cell resequencing. We present a flow merging scheme that is needed to bring the cost of backpressure down to O(N) per switching element. We prove freedom from deadlock for a wide class of multipath cell distribution algorithms. Using a cell-time-accurate simulator, we verify operation free of internal blocking, we evaluate various cell distribution and resequencing methods, we compare performance to that of ideal output queueing and the iSLIP crossbar scheduling algorithm, and we show that the delay of well-behaved flows remains unaffected by the presence of congested traffic to oversubscribed output ports.
Available in PDF (150 KBytes) or Postscript (450 KBytes) or gzip'ed Postscript (125 KBytes) format; © Copyright 2003 FORTH.
This is a longer version of the above HPSR-2003 paper. In addition to the more detailed presentation, it also contains the proof of freedom from deadlock.
Available in PDF (500 KBytes); © Copyright 2002 FORTH.
This is the full-length version of our research;
it covers in more details
the topics covered by the above papers.
In addition, this thesis uses simplified models of the Benes fabric to analytically show that cell distribution does not result in throughput limitations, and to identify the points of the fabric where contention is resolved. This latter point may be the key to a future design of appropriate distributed scheduling algorithms that provide weighted max-min fairness for the overall system.
Part of this work was supported by an IBM PhD Fellowship to Nikos Chrysos, 2004-2006. Vasilios Siris, Dionisios Pnevmatikatos, and Paraskevi Fragopoulou helped us shape our ideas; we thank them.
Older (outdated) papers:
Available in PDF (260 KBytes) or Postscript (490 KBytes) or gzip'ed Postscript (135 KBytes) format; © Copyright 2002 FORTH.
[Superseded by the July 2003 version, above].
Available in Postscript (1.1 MBytes) or gzip'ed Postscript (200 KBytes) format; © Copyright 2002 FORTH.
[Superseded by TR-316, above].
[This (unfinished) paper presented our August 1994 work on how to use multilane backpressure in multipath (Benes) fabrics and how to merge flows so as to keep the number of lanes manageable, while guaranteeing freedom from deadlock and maintaining non-blocking operation.]
© Copyright 2002-2007 by FORTH or the IEEE:
These papers are protected by copyright. Permission to make digital/hard copies of all or part of this material without fee is granted provided that the copies are made for personal use, they are not made or distributed for profit or commercial advantage, the FORTH or IEEE copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of the Foundation for Research & Technology -- Hellas (FORTH) or the IEEE. To copy otherwise, in whole or in part, to republish, to post on servers, or to redistribute to lists, requires prior specific written permission and/or a fee.
|Up to Packet Switch Architecture R&D at CARV-ICS-FORTH||Last updated: Apr. 2007, by M. Katevenis.|