CA2239426A1

CA2239426A1 - Shared memory system

Info

Publication number: CA2239426A1
Application number: CA 2239426
Authority: CA
Inventors: Robert J. Patenaude; Stephen P. Nordstrom
Original assignee: Newbridge Networks Corp
Current assignee: Nokia Canada Inc
Priority date: 1998-06-03
Filing date: 1998-06-03
Publication date: 1999-12-03

Abstract

A shared memory system, comprising a plurality of memory banks and a plurality of processing units. Each processing unit has memory address and data buses, and generates a command signal for requesting a shared memory access. The invention employs at least one bus switching fabric comprising at least one pass gate bus switch.
The bus switching fabric has a negligible propagation delay and is used to connect the memory address and data buses of each processing unit to each memory bank. A
shared memory controller receives the memory access request signal associated with each processing unit. The shared memory controller controls the bus switching fabric and each memory bank in order to allow each memory bank to be concurrently and asynchronously accessed by different processing units.

Description

SHARED MEMORY SYSTEM
Field of Invention The invention relates to the art of shared memory systems, wherein multiple processing devices access a common memory.
Background of Invention As processing systems increase in complexity, it often becomes necessary to divide tasks amongst multiple processing units such as microprocessors and application specific integrated circuits (ASICs). A typical way in which processing units communicate and/or share data is through the use of a common, shared pool of memory.
The shared memory of a processing system typically represents a performance bottleneck because of the contention between processing units which need to access this shared resource. The shared memory bandwidth and access latency become a determining factor in overall system performance. The shared memory bandwidth is the total amount of data transferred per time unit (e.g. second). Latency is the amount of time a given processing unit has to wait from the time it requests a shared memory access to the time when that access begins. One of the methods to improve shared memory performance is to employ multiple memory banks which are accessible concurrently by all processing units.
Since the shared memory system is important to overall system performance, it is desirable to operate it at the highest possible frequency, and with the widest data bus possible. Operating at high frequencies imposes serious design restrictions in terms of the delays presented by the switching mechanisms which enable the various processing units to access the shared memory.
20484730.3 There are various shared memory architectures and related control schemes known in the art. For example, United States patent nos. 5,142,638 and 5,247,637 disclose multiple-bank shared memory schemes which employ multiplexers to connect the processing units to the shared memory. Multiplexers, however, introduce a data path propagation delay which ultimately restricts the maximum frequency of operation of the memory bus, thereby limiting the bandwidth. This can be partly compensated by using clocked registers in the data path, however, this is at the expense of increasing latency for the processing units.
Published Japanese patent application no. 10027131 connects the processing units to the shared memory via high-impedance output buffers. High impedance output buffers face the problem of bus contention. Since these output buffers have a given turn-on and turn-off time, at clock speeds of over 50 MHz it typically takes two (2) clock cycles to switch connectivity of a memory bank from one processing unit to another. This limits bandwidth and latency. Furthermore these devices are unidirectional, therefore a pair of such devices is needed per bi-directional line and the memory controller needs to control which device of the pair is active depending on the direction of the data flow. Like multiplexers, the high impedance output buffers also introduce propagation delays which further limit bandwidth.
Finally, WIPO publication no. W096/30842 connects processing units to the shared memory via a one-cycle delayed bus switching unit. The bus switching unit described in W096/30842 introduce a one-cycle delay in the data path, which adds latency to the requesting processing unit.
The propagation delay disadvantages of the buffers and multiplexers described above can be reduced by using higher-performance logic families such as ECL
for example, however, these have many disadvantages which make them impractical.
20484730.3 ECL circuitry utilize relatively high power which results in system heat management issues, are costly, and have low part density, making them impractical in many applications due to area restrictions, and maximum physical length of printed circuit board wires.
A typical trade-off multiple bank memory scheme is that having more banks improves the shared memory bandwidth, but on the other hand the switching stages required introduce additive delays and/or capacitance which limit the maximum operating frequency of the system. The delays consist of two elements: the data path delay and the switch control delays. Typically the switch controls all change in parallel, and therefore are not additive. However the data path delay is an aggregation of the propagation delay of each component in the switching mechanism or matrix. Thus it is desirable to optimize the data path delay of the switching matrix in order to increase the shared memory bandwidth and reduce latency.
SummarX of Invention Broadly speaking, the invention provides a shared memory system, comprising a plurality of memory banks and a plurality of processing units.
Each processing unit has a memory address bus and a memory data bus, and generates a command signal for requesting a shared memory access. The invention employs at least one bus switching fabric comprising at least one pass gate bus switch. The bus switching fabric is used to connect the memory address bus and the memory data bus of each processing unit to each memory bank. A shared memory controller receives the memory access request signal associated with each processing unit. The shared memory controller controls the bus switching fabric and each memory bank in order to allow each memory bank to be concurrently and asynchronously accessed by different processing units.
20484730.3 The pass-gate bus switches used to construct the bus switching fabric have a negligible propagation delay and a switching time comparable to multiplexers.
Moreover the pass-gate bus switches are inherently bi-directional. The negligible propagation delay maximizes both bandwidth and latency, which results in higher system performance. The pass-gate bus switches are compatible with common logic families, of low-cost, and available in high-density packages.
Furthermore, the preferred embodiment employs a pass gate bus switch to select between row address and column address in the case where dynamic random access memory (DRAM) is employed for the memory banks. The relatively small delay of the bus switch allows the row address to be latched in one cycle earlier than if the typical multiplexer scheme is used, depending upon the clock frequency.
In addition, the preferred embodiment comprises an adder and a second bus switching fabric for connecting the memory data bus of each processing units from a secondary side of the switching fabric referred to above to the adder. The processing units signal the shared memory controller when an adder access is requested, and the shared memory controller includes means for controlling the bus switching fabrics, the adder and the memory banks in order to load data from a given processing unit to the adder as a first operand, load the contents of one of the memory banks at an address specified by the given processing unit into the adder as a second operand, and write an adder result back into the specified address, provided the adder is not busy.
Brief Description of Drawings The foregoing and other aspects of the invention will become more apparent from the following description of the preferred embodiment thereof and the 20484730.3 , , CA 02239426 1998-06-03 accompanying drawings which illustrate, by way of example, the principles of the invention. In the drawings:
Figure 1 is a diagram of a pass-gate bus switching unit;
Figure 2 is a diagram of a 4x2 bus switching fabric constructed out of a plurality of the bus switching units shown in Figure 1;
Figure 3 is a diagram of a 4x4 bus switching fabric constructed out of a plurality of the bus switching units shown in Figure 1;
Figure 4 is a system block diagram of a shared memory system according to the preferred embodiment comprising multiple memory banks;
Figure 5 is a system block diagram of a shared memory controller shown in Figure 4, according to the preferred embodiment;
Figure 6 is a flowchart illustrating logic employed by the shared memory controller shown in Figure 5;
Figure 7 is a diagram illustrating one memory bank (shown in Fig. 5) in greater detail;
Figure 8 is a system block diagram of a shared memory system, according to an alternative embodiment, which employs memory devices that support internal bank interleaving; and Figure 9 is a timing diagram showing the timing of memory access signals in instances where the memory banks support and do not support internal bank interleaving.
Detailed Description of Preferred Embodiments Figure 1 illustrates a pass gate bus switch 10 which is a basic building block of a shared memory system 20 according to the preferred embodiment shown in Figure 4. The bus switch 10 has four ports 12 and two control signals 13 and 14. The 20484730.3 Enable control signal 13 makes the bus switch active. If the Enable signal 13 is disabled, there is no connection between any of the ports 12. If the Enable signal 13 is active, then the connectivity of the ports 12 depends on the Exchange control signal 14. In one state of signal 14, port A is connected to port C and port B is connected to port D.
In the other state of signal 14, port A is connected to port D and port B is connected to port C. Ports A and B are referred to as the "primary side" of the bus switch 10, while ports C and D
are referred to as the "secondary side" of the bus switch 10. Internally, the connection in the bus switch 10 is made through pass gates such as FET transistors (not shown), which provides a very low impedance connection between the ports as described above.
For all practical purposes this connection looks like a low-valued resistor, such as 5 ohms, thereby making the connection bidirectional and introducing relatively small delay.
This delay, which is typically quoted at about 0.25 ns, is dependent on the total capacitance of the data line.
The bus switch 10 described above is commercially available, for example from Quality Semiconductor Inc. (part number QS34X383Q3) or Pericom Semiconductor Corp. (part number P15C34X383) of the United States. The bus switches are offered in a high-density package which includes 16 of the bus switches shown in Figure 1, with their control signals grouped by 4. Thus one of these parts can switch 16-bit wide busses.
In the case where wider busses are required, more of these parts are simply connected in parallel and their control lines tied together.
A number of bus switches 10 can be used to generate a bus switching fabric (BSF) of any desired size and width. The size of the bus switching fabric is defined by the number of primary and secondary ports it has. The width is the number of lines in each port. A non-blocking bus switching fabric 16 of size 4x2 is shown in Figure 2. The bus switching fabric primary side comprises ports (12) P1 - P4, the secondary side comprises ports ( 12) S 1 and S2, and requires 4 control signals ( 18) C 1 -20484730.3 C4, each of which consists of the Enable signal 13 and the Exchange signal 14.
Bus switches BS3 and BS4 can be a simpler form of bus switch 10 which has 2 primary ports and 1 secondary port. The width of the bus switching fabric shown in Figure 2 is 1, but the structure need only be replicated n times to generate a bus switching fabric of width n, wherein the Cx control signals 18 from each replica are tied together. A
non-blocking bus switching fabric 16' of size 4x4 is shown in Figure 3, with the same nomenclature convention used to describe the 4x2 bus switching fabric of Figure 2. By expanding vertically and horizontally, those skilled in the art will appreciate that a bus switching fabric of any size can be constructed.
The preferred shared memory system 20 is shown in Figure 4. The system comprises multiple processing units 22 that have access to a shared memory 24 which is divided into memory banks 24A and 24B. In the preferred embodiment, each memory bank is a single port synchronous dynamic random access memory (SDRAM).
In the illustrated embodiment, there are four processing units 22. Each processing unit 22 has a local controller 26 (LC) which generates a command bus 28.
Each processing unit has its address, data and byte enable busses, shown as ref. no. 30, connected to the primary side of a 4x2 bus switch fabric 34 (BSF) either directly or through local registers 36 (LR). The secondary side of the bus switching fabric 34 is connected to the memory banks 24A and 24B. Each memory bank is also connected to an adder circuit 38 through a 2x1 bus switching fabric 40. A shared memory controller (SMC) 42 controls the bus switching fabrics 34 and 40 through control signals 18', the memory banks 24A and 24B through control signals 33, and the adder 38 through control signals 39. The shared memory controller 42 interfaces to each processing unit 22 via its respective command bus 28. Each processing unit 22 optionally includes an isolated local memory 23 (LM).
20484730.3 _g_ The command bus 28 (shown in expanded form in Figure 5) comprises several signals, such as Request, Acknowledge, Read/Write, Bank Select, Adder Access, and Burst. All signals originate from the local controller 26 except for Acknowledge, which originates from the shared memory controller 42. The Request signal indicates that the corresponding processing unit 22 requests access to the shared memory 24. The Acknowledge signal indicates that the requested access has been performed. The Read/Write signal indicates whether the requested access is a read or a write.
The Bank Select signal indicates which memory bank 24A or 24B the requested access corresponds to. The Adder Access signal indicates whether or not the access is an "auto-adder"
access. Finally, the Burst signal indicates whether the access is a single access or a burst (multiple words) access.
The local controller 26 determines if a given processing unit memory access is destined to the shared memory 24 based on the address bus controlled by that processing unit. If it is, then the local controller 26 generates the appropriate signals on the command bus 30 to the shared memory controller 42. The local controller 26 prevents further access to the shared memory 24 by that processing unit until the Acknowledge signal is received from the shared memory controller 42.
The adder 38 is optional but enhances performance and ensures data integrity when processing units perform add operations to the shared memory 24. An adder access write request by a processing unit 22 is signalled via its local controller 26 to the shared memory controller 42 by utilizing an additional address line.
Specifically, the address bus includes a most significant bit (MSB) which signals whether to invoke the automatic adder (e.g., if the MSB is asserted) or directly access the shared memory for the current request (e.g., if the MSB is not asserted), and the remaining lesser significant bits of the address bus identify the memory location for this access request.
20484730.3 When an adder access write is requested by the local controller 26, the shared memory controller 42 controls the bus switching fabrics 34 and 40 in order to load the write data from the processing unit 22 as the first operand into the adder 38, reads the contents of the appropriate memory bank 24A or 24B at the address designated by the processing unit, loads this value as the second operand into the adder, and then finally writes the result of the addition back into the same location in the appropriate memory bank 24A or 24B. This performs a quick, atomic, read-add-write operation while minimizing shared memory bus utilization and processing unit waiting time. The adder control signals 39 include Load and Output Enable. Load loads the adder 38 with the operand present on bus 31 (which only consists of the data lines of bus 30), and Output Enable outputs the sum of the two operands currently in the adder 38.
Each processing unit 22 may optionally be connected to local register 36 which provides enhanced performance for write accesses. The local register 36 holds the address and byte enable data for the requested shared memory access, as well as the data word in the case of a write. For write accesses, the processing unit can post its address, data, and byte enable signals 30 into the corresponding local register 36, and proceed with local processing immediately. The corresponding local controller 26 holds back subsequent write accesses to the shared memory 24 until the previous one is completed.
In the case where the processing unit 22 has a multiplexed address and data bus, the local register 36 is necessary to retain the address during the access. In the case of a read access, the local register 36 is used as a transparent buffer from the shared memory system to the processing unit. This introduces delay, which can be traded-off by the advantage of the posted write if desired, otherwise the local register 36 can be omitted.
The shared memory controller 42 handles the arbitration between the processing units 22 and controls the bus switching fabric 40 to direct the appropriate processing unit to the desired memory bank 24A or 24B. The shared memory controller 20484730.3 42 also controls each memory bank 24A and 24B, as well as the adder 38, and the bus switching fabric 40 required for the adder 38. Any applicable methods of ensuring memory consistency can be implemented as part of the shared memory controller 42.
Furthermore the arbitration algorithm can be as simple or complex as required.
The shared memory controller 42 connects the processing units 22 to the memory banks 24A or 24B on a clock cycle-by-cycle basis, depending on which processing units request access, which accesses are currently in progress, and which memory banks are currently free. In this fashion, two processing units wanting to access different memory banks can request and start their accesses at the same time.
In addition, the processing units can access the various memory banks asynchronously. For example, if there is an access in progress to one memory bank 24A, another processing unit requesting access to memory bank 24B can begin its access immediately. This allows the quickest possible access from processing unit to memory bank.
The internal architecture of the shared memory controller 42 is shown in Figure 5. An arbiter/dispatcher section 46 makes all the arbitration decisions based on priority of processing unit requests, the business of the memory banks 24A and 24B, and the business of the adder 38. The arbiter/dispatcher 46 controls the bus switching fabrics 34 and 40 accordingly and instructs memory bank controllers 48A and 48B
(collectively 48) to perform the arbitrated accesses. Thus, the arbiter/dispatcher 46 multiplexes certain of the command bus signals 30 (Read/Write, Burst, Adder Access) from the "winning"
processing unit to the appropriate memory bank controllers 48. Since the memory 24 comprises SDRAM devices, the arbiter/dispatcher 46 also ensures that a Refresh signal is supplied at appropriate intervals.
The flow chart of Figure 6 illustrates the logic employed by the arbiter/dispatcher 46. At a first step 46A, the arbiter/dispatcher 46 filters out those 20484730.3 accesses which are presently pending from those which are currently being serviced. At steps 46B - 46D, the pending memory access requests are prioritized for each memory bank 24A and 24B. In the preferred embodiment, a simple priority level is assigned to each processing unit 22.
Once the pending shared memory requests from the processing units 22 have been prioritized, the arbiter/dispatcher 46 determines at step 46E for each memory bank 24A and 24B, whether a given memory bank is busy. If so, control is passed back to the initial step 46A and the shared memory request of the "winning"
processing unit is placed back with the pending requests. However, if the given memory bank is not busy, then at steps 46F and 46G, the arbiter/dispatcher determines (based on a timer) if the given memory bank requires a Refresh, and, if so, signals the corresponding memory controller 48 accordingly. Thereafter control is passed to the initial step 46A as described previously. At step 46H, if no memory access request is pending for the given memory bank then control is passed back to the initial step 46A as described previously.
At steps 46I and 46J, control is passed back to the initial step 46A as described previously if an adder access has been requested by the winning processes unit but the adder 38 is busy. Otherwise, at step 46K, the arbiter/dispatcher 46 controls the bus switching fabrics 34 and 40 to connect the bus 30 of the winning processing unit 22 to the requested memory bank 24A or 24B. The arbiter/dispatcher 46 also multiplexes the R/W, Burst and Adder Access signals of the command bus 28 associated with the winning processing unit to the corresponding memory bank controller 48A or 48B, and asserts a Start signal thereto.
Each memory bank controller 48A and 48B is directly connected to the corresponding SDRAM control inputs of the corresponding memory bank 24A or 248.
The memory bank controllers are finite state machines which perform the following SDRAM commands: single read, single write, burst read, burst write, refresh, and adder 20484730.3 access. The timing and execution of these commands, except for the previously described adder access, depends on the type of SDRAM employed, the particulars of which will be specified in the SDRAM data sheets. The memory bank controller also controls the row and column address select of its corresponding memory bank, as described in greater detail below.
The memory bank controllers 48 perform the accesses upon receiving the Start signal from the arbiter/dispatcher 46. The controllers 48 indicate that the access is complete by asserting a Done signal, which the arbiter/dispatcher 46 translates as the Acknowledge signal back to the appropriate processing unit. The Done signal is given early so that the Acknowledge signal is given early to the local controller 26 in a timing such that the data/address contained in the local register is replaced by information from a pending transaction as soon as the previous information is not needed anymore. The arbiter/dispatcher 46 uses the Done signal, delayed appropriately, to know which memory bank is busy.
The adder control signals of controllers 48 are logically OR'd together and connected to the adder 38 since there is only one such device. The arbiter/dispatcher 46 ensures that only one memory bank controller 48 is accessing the adder 38 at any given point in time.
In the preferred embodiment, the shared memory controller 42 is implemented in hardware, preferably with an ASIC using well known programming tools. Those skilled in the art will appreciate that each iteration of the Fig. 6 flowchart may take place during one clock cycle, so that the control signals 18' to the bus switching fabric 34 may be applied during that clock cycle, or the very next clock cycle.
20484730.3 Figure 7 illustrates memory bank 24A or 24B in greater detail. Due to the fact that an SDRAM device 60 is used as the memory element, it is necessary to latch in the row address separate from (i.e. on a different clock cycle than) the column address.
Therefore, a multiplexing function is needed to select between the row and column addresses externally. In the preferred embodiment, a 2x1 pass gate bus switch 62, under the control of the shared memory controller 42, is used to select between the row and column addresses. (With SDRAM, as opposed to DRAM, the circuit is complicated due to the special function auto-precharge (AP) pin which is also used as address line no. 10 when the row address is enabled and as the AP function when the column address is enabled.) Referring additionally to Fig. 4, it will be seen that the address bus starts from the local register 36 and the signals must propagate through the bus switching fabric 34 as well as the bus switching fabric 62. When the shared memory controller 42 starts a memory access, it switches the bus switching fabric 34 to connect the local register 36 to the memory bank. A delay is incurred from this switching. If a conventional multiplexes is used to implement the DRAM address multiplexing function, then there is another propagation delay incurred through the conventional multiplexes which, at high bus frequencies, e.g. at or over 50 MHz, will cause a cumulative signal delay greater than the clock period and hence force a one cycle delay before access starts.
However, when the bus switching fabric 62 is used to implement the DRAM address multiplexing function there is no cycle penalty due to its low propagation delay.
Accordingly, the bus switching fabric 62 enables the memory access to begin on the same clock cycle as the shared memory controller grants shared memory access to a particular processing unit.
Figure 8 illustrates an alternative embodiment wherein each memory bank 24A or 24Is supports internal bank interleaving (e.g. SDRAM). The shared memory controller 42 also controls these memory banks using the pass-gate bus switches and the SDRAM control signals. The alternative embodiment is similar to the non-interleaving preferred embodiment shown in Figure 4, except now the system includes two 4 x 2 bus 20484730.3 switching fabrics 50 and 52, one for the data and byte enable buses from the four processing units, and one for the address buses of the processing units.
Controlling the address buses separate from the data and byte enable buses enables the shared memory controller 42 to interleave memory access commands in successive clock cycles, whereby two (or more) processing units can access the same memory bank more efficiently in terms of clock cycles needed.
With reference to Figure 9(a), a non-interleaved memory access consists of many command cycles, some of which are "wait" cycles. The specific commands executed over consecutive clock cycles in the case of SDRAM having four internal banks are:
RAS (row address select) - clock in the row address from a processing unit to a particular SDRAM which selects the appropriate one of four internal banks based on the two most significant address bits CAS (column address select) - clock in the column address from the processing unit WAIT - wait for SDRAM to complete previous command DEAL - deactivates a given SDRAM internal bank Internal bank interleaving utilizes these wait cycles to interleave in the access to a different internal bank. Figure 9(b) illustrates an example where two processing units simultaneously access the same memory bank but different internal banks thereof. In the alternative embodiment, the shared memory controller 42 thus 20484730.3 interleaves both accesses in order to economize on the total number of cycles.
Since the SDRAM command cycles need address information from a different processing unit than the read data provided or output by the SDRAM, the bus switching fabric 50 for the data and byte enable buses and the bus switching fabric 52 for the address bus need to be separately controlled.
One advantage of using the bus switches 10 for memory bank selection is that there is negligible data path delay per element, which makes them especially advantageous for a large switch matrix. The only significant delays are the switch control signals, but these are all applied in parallel, and are similar to the delays incurred by the other methods (e.g. high impedance drivers, multiplexers). Thus more processing units and more memory banks can be implemented, without the typical operating speed or latency trade-off discussed previously.
The simplest memory mapping of the memory banks is a consecutive memory map. That is to say that as shared memory addresses increment, the full size of the first memory bank is addressed, then the full size of the second memory bank, and so on until the end of the memory map. This is accomplished by using the most significant address bits as Bank Selects in the command bus 28. The goal is to optimize the accesses of different processing units such that they are accessing different banks most of the time, to allow concurrent accesses. In some processing systems this can be accomplished by carefully controlling the memory access algorithms.
However, in some processing systems it is impossible to statistically control the memory location of accesses. In such cases, maximizing concurrent hits on different banks is done by using low-order address bits as Bank Selects. The lowest address bits are better, with the only constraint being the maximum burst access size in the case of a system that supports bursting. This is because bursts cannot cross memory 20484730.3 bank boundaries without considerably complicating the shared memory controller. This results in a memory map that interleaves between memory banks every number of addresses the size of a burst length. This method distributes concurrent accesses as evenly as possible over time, which is important in real-time systems.
It will be seen from the foregoing that the preferred embodiment provides the following advantages in instances where DRAM devices are used for the memory bank devices:
~ The low propagation delay of addresses through the bus switching fabrics allows the memory cycle to begin on the same or the very next cycle as the shared memory controller makes the decision to start an access from a given processing unit to a given memory bank. Prior art memory bank switching system necessitated at least one cycle of delay.
The low data propagation delay allows the system clock to be increased to over 50 MHz without a corresponding increase in latency in both read and write cases.
Burst accesses can occur with a data word every clock cycle at over 50 MHz instead of every other cycle in the case of prior art designs with many delay elements in the data path.
~ In the event memories having internal banks are used, the addresses from multiple processing units can be switched with zero-cycle overhead to maximize the internal bank interleaving bandwidth.
20484730.3 The fact that bus switches are bidirectional results in no bandwidth being wasted due to high-impedance bus turnaround. For example, accesses from different processing units to a given memory bank can occur back-to-back, thereby optimizing the bandwidth.
Those skilled in the art will appreciate that a variety of modifications and variations may be made to the embodiments disclosed herein without departing from the spirt and scope of the invention.
20484730.3

Claims

1. A shared memory system, comprising:
a plurality of memory banks;
a plurality of processing units, each processing unit generating a signal for requesting a memory access, each processing unity having a memory address bus and a memory data bus;
at least one bus switching fabric comprising at least one pass gate bus switch, said at least one bus switching fabric for connecting the memory address and data buses of each processing unit to each of the memory banks; and a shared memory controller which receives the memory access request signal associated with each processing unit and comprises means for controlling the at least one bus switching fabric and each memory bank in order to allow each memory bank to be concurrently accessed by different processing units.

2. The shared memory system according to claim 1, wherein the shared memory controller includes means for arbitrating simultaneous memory access requests to the same memory bank by different processing units.

3. The shared memory system according to claim 2, wherein the shared memory controller includes means for signalling an acknowledgement to the processing units when a memory access request has been completed.

4. The shared memory system according to claim 2, wherein each memory bank is a single port memory device.

5. The shared memory system according to claim 2, wherein the at least one bus switching device comprises one bus switching fabric for connecting the memory address and data busses of each processing unit to the memory banks.

6. The shared memory system according to claim 1, wherein:
each of the memory banks supports internal bank interleaving, the at least one bus switching fabric comprises a first bus switching fabric for connecting the memory address bus of each processing unit to each memory bank and a second bus switching fabric for connecting the memory data bus of each processing unit to each memory bank, and the shared memory controller includes means for controlling the first and second bus switching fabrics and each memory bank in order to interleave access to different internal banks of a given memory bank by different processing units.

7. The shared memory system according to claim 1, wherein the shared memory controller comprises an arbiter/dispatcher connected to a plurality of memory bank controllers, the number of memory bank controllers being equal to the number of memory banks, wherein each memory bank controller generates memory cycle signals to control access to the corresponding memory bank, and wherein the arbiter/dispatcher arbitrates amongst simultaneous processing unit requests to access a given memory bank and signals the corresponding memory bank controller to start a memory cycle in respect of a winning processing unit.

8. The shared memory system according to claim 7, wherein the memory cycle begins on the same or the very next clock cycle that the arbiter/dispatcher signals the memory bank controller to start a memory cycle.

9. The shared memory system according to claim 1, further comprising:
an adder; and a bus switching fabric for connecting the memory data bus of each processing unit from a secondary side of the at least one switching fabric to the adder;
wherein the processing units signal the share memory controller when an adder access is requested, and the shared memory controller includes means for controlling the bus switching fabrics, the adder and the memory banks in order to load data from a given processing unit to the adder as a first operand, load the contents of one of said memory banks at an address specified by the given processing unit into the adder as a second operand, and write an adder result back into the specified address, provided the adder is not busy.

10. The shared memory system according to claim 2, wherein a given memory bank is a single port memory dynamic random access memory (DRAM) device and comprises a pass gate bus switching fabric under the control of the shared memory controller for carrying out a DRAM address multiplexing function.