GB2286700A - Dynamically reconfigurable memory processor - Google Patents

Dynamically reconfigurable memory processor Download PDF

Info

Publication number
GB2286700A
GB2286700A GB9421571A GB9421571A GB2286700A GB 2286700 A GB2286700 A GB 2286700A GB 9421571 A GB9421571 A GB 9421571A GB 9421571 A GB9421571 A GB 9421571A GB 2286700 A GB2286700 A GB 2286700A
Authority
GB
United Kingdom
Prior art keywords
memory
processor
processors
group
memory devices
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB9421571A
Other versions
GB2286700B (en
GB9421571D0 (en
Inventor
Kenneth Wayne Iobst
David Robert Resnick
Kenneth Roy Wallgren
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority claimed from GB9118071A external-priority patent/GB2252185B/en
Publication of GB9421571D0 publication Critical patent/GB9421571D0/en
Publication of GB2286700A publication Critical patent/GB2286700A/en
Application granted granted Critical
Publication of GB2286700B publication Critical patent/GB2286700B/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • G06F7/575Basic arithmetic logic units, i.e. devices selectable to perform either addition, subtraction or one of several logical operations, using, at least partially, the same circuitry
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • G06F15/8015One dimensional arrays, e.g. rings, linear arrays, buses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1012Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using codes or arrangements adapted for a specific type of error
    • G06F11/1016Error in accessing a memory location, i.e. addressing error

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Multi Processors (AREA)
  • Image Processing (AREA)

Abstract

A reconfigurable memory processor, comprises a plurality of memory devices (50, 52, 54, 56); a plurality of first processors (58, 60, 62, 64) associated with said memory devices, respectively; first selector means (66) connecting the outputs of said memory devices with the inputs of said first processors, whereby an input to each first processor comprises an output from one of said memory devices; second selector means (68, 70, 72, 74) connecting the output of each of said first processors with the input of the memory device associated with said first processor, the output of each memory device further being connected with said second selector means, said second selector means comprising a plurality of multiplexers connected with said plurality of memory devices, respectively; decoder means (78) for controlling said second selector means to select as an input to said memory devices one of said memory device and first processor outputs; and a plurality of said memory devices and said first processors are arranged in a group, said group including a single first selector means and a single decoder, whereby the plurality of first processors is effectively reduced to a single processor and the amount of memory available to the single processor is increased by a factor of the number of memory devices. <IMAGE>

Description

DYNAMICALLY RECONFIGURABLE MEMORY PROCESSOR This application is divided from GB-A-2 252 185 which concerns apparatus for processing data from memory and from other processors.
Research on a Parallel SIMD Simulation Workbench (PASSWORK) has demonstrated that multiple instruct: ion multiple data (MIMD) vector machines can simulate at nearly full speed the global routing and bit-serial operations of commercially available single instruction multiple data (SIMD) machines.
Hardware gather/scatter and vector register corner turning are Icey to this kind of high performance SIMD computing on vector machines as disclosed in the pending Iobst U.S. patent application Serial No.
533,233 and titled Apparatus for Performing a Bit Serial Orthogonal Transformation Instruction. In a direct comparison between vector machines and SIMD machines, the only other significant limits to SIMD performance are memory bandwidth and the multiple logical operations required for certain kinds of arithmetic, i.e. full add on a vector machine or tallies across the processors on a SIMD machine.
Results of this research suggest that a good way to support both MIMD and SIMD computations on the same shared memory machine is to fold SIMD into conventional machines rather than design a completely new machine.
Even greater SIMD performance on conventional machines may be possible if processors and memories are integrated onto the same chip. More specifically, if one were to design a new kind of memory chip (a process-in-memory chip or PIM) that associates a single-bit processor with each column of a standard random access memory (RAM) integrated circuit (IC), the increase in SIMD performance might be several orders of magnitude. It should also be noted that this increase in performance should be possible without significant increases in electrical power, cooling and/or space requirements.
Tlis basic idea breaks the non-Neumann bottleneck between a central processing unit (CPU) and memory by directly computing in the memory and allows a natural evolution from a conventional computing environment to a mixed MIMD/SIMD computing environment. Applications in tis mixed computing environment are just now beginning to be explored.
A PIM chip combines memory and computation on the same integrated circuit that maximurnizes instruction/data bandwidth between processors and memories by eliminating most of the need for input/output across data pins. The chip contains multiple single-bit computational processors that are all driven in parallel and encompasses processor counts from a few to possilvly thousands on eacl1 chip. The chips are then put together into groups or systems oE memory banks tlia t enhance or replace existing memory sui)systems in computers from personal computers to supercomputers.
According to an aspect of the invention, there is provided a dynamically reconfigurable memory processor, comprising (a) a plurality of memory devices, each having an input and an output; (b) a plurality of first processors associated with said memory devices, respectively, each of said processors having an input and an output; (c) first selector means connecting the outputs of said memory devices with the inputs of said first processors, whereby an input to each first processor comprises an output from one of said memory devices; (d) second selector means connecting the output of each of said first processors with the input of the memory device associated with said first processor, the output of each memory device further being connected with said second selector means, said second selector means comprising a plurality of multiplexers connected with said plurality of memory devices, respectively; (e) decoder means for controlling said second selector means to select as an input to said memory devices one of said memory device and first processor outputs; and (f) a plurality of said memory devices and said first processors are arranged in a group, said group including a single first selector means and a single decoder, which are operable to reconfigure said group of memory devices and first processors between a first mode of operation wherein a single memory device is available to any number of said plurality of first processors and a second mode of operation wherein any number of said plurality of memory devices in said group is available to a single processor, whereby the plurality of first processors is effectively reduced to a single processor and the amount of memory available to the single processor is increased by a factor of the number of memory devices.
In further embodiments, the memory processor further comprises a network for implementing a generalized parallel prefix mathematical function across an arbitrary associative operator, including (a) means defining a plurality of successive levels of communication, a first level being zero; (b) means defining a plurality of successive groups of second processor within each of said levels, each group comprising 21 second processors where 1 is the level number; (c) each second processor within a group having associated therewith a single input comprising an output from a preceding group, whereby a sequence of instructions is issued corresponding to the levels from zero through level 1 to compute a parallel prefix of 21 values; and (d) the inputs in level one and subsequent levels being associated with a single second processor per group that has received all of the previous inputs.
Preferably, said groups within a level are arranged in sequential pairs, with one group of each pair sending data to the other group of said pair to define a mathematical operation of the parallel prefix.
Conveniently, the output from a last group of a level of groups can selectively drive the inputs of the first group of all levels.
In a preferred embodiment, the memory processor further comprises a plurality of networks wherein the output from the last group of a level of groups of one network can selectively drive the inputs of the first group of all levels of another network.
Preferably, the memory processor further comprises a means for detecting system errors at a memory chip level, comprising (a) means for detecting parity errors on multibit interfaces coming on to the chip and means for retaining the state thereof; (b) means for detecting errors of the memory array row decoder circuitry and means for retaining the state thereof; and (c) means for detecting and correcting single bit memory errors and means for detecting double bit memory errors and retaining the state thereof.
Conveniently, the memory processor further comprises means for subdividing a row of memory devices into correction subgroups each of which comprises a plurality of columns, the alternative columns being connected with separate error detecting correction circuits.
In preferred embodiments, the memory processor further comprises means for reading said error states from the chip and simultaneously clearing the error states.
Preferably, the memory processor further comprises means for separately maintaining the single bit error state and the multibit error state for maintenance purposes.
Other objects and advantages of the invention will become apparent from a study of the following specification when viewed in the light of the accompanying drawings, in which: Fig. 1 is a block diagram of a PIM chip.
Fig. 2 is a schematic view of a bit-serial processor of the PIM chip of Fig. 1; Fig. 3 is a diagram illustrating the global-or/parallel prefix network of the PIM chip of Fig. 1; and Fig. 4 is a block diagram of a reconfigurable memory processor for column reduction of the memory array embodying the present invention.
Referring first to Fig. 1, the architecture of a processin-memory (PIM) circuit will be described.
The basic components of the circuit are bit-serial processors 2 with an attached local memory 4. The local memory can move one bit to or a respective from' one bit-serial processor during each clock cycle through error correction circuit (ECC) logic 10.
(Thus the clock rate for a PIM design is set by the memory access plus the ECC time). Alternatively, the memory can do an external read or write during each cl ,e ve 1 e. n, after being processed through the ECC logic. There is also added logic to provide communication paths between processor elements on chip and between chips.
The memory associated with a bit-serial processor is viewed as a memory column one bit wide.
The columns are catenated together forming a memory array 6. A set of bit serial processors are similarly catenated together and are normally viewed as sitting functionally below the memory array.
This means that a single row address to the memory array will take or provide one bit to each of the bit-serial processors, all in parallel. All memory accesses, internal and external references and both read and write operations are parallel operations.
This means that during a PIM instruction, the column address bits are unused. The normal column decoders and selectors for external references are moved to allow for the difference in chip architecture and for ECC processing and the resultant change in timing. The memory array also includes an extra check column 8 as will be developed in greater detail below.
Arranged between the memory array 6 and the processors 2 is an error detection and correction circuit 10 including a row decode checker 12 which will be discussed in greater detail below.
An R register 14 is provided between the error detection and correction circuit 10 and the processors 2 to implement pipelining to overlap the loading and storing of memory data with the processing of other data.
The PIM chip can perform in two modes: as a normal read/write memory or for computation (PIM mode). Capability is added through computational processors 2 and added control lines 16 to have the processors compute a result in place of a memory access cycle.
When the chip is used for computation, an address is presented to the row decoder 18 from the chip pins. As a result, a row of data is fetched from the memory. The data is error corrected and latched into the R register at the end of the clock cycle/beginning of the next clock cycle. In the next clock cycle, the processors use the data as part of the computational sequence under control of the external control and command lines 16. If a .computed result is to be stored into memory from the processors, the memory load cycle is replaced with a store cycle. Error correction data is added to the store data on its way to the memory array.
When the chip is being used for normal writing, data is first read from memory 4, error corrected, and then merged with the write dated from a write decoder 20 before being placed into the R register 14. The contents of the R register with the new data is then routed back through the error correction logic on its way to memory. This is required because the number of bits coming onto the chip through the write port is less than the amount of data written to memory. This merge pass allows proper error correction information to be regenerated for the words being written.
When used for normal reads, a row of data is taken from memory, error corrected and placed into the R register. In the next clock cycle, column address bits choose the proper subset of bits to be sent off chip from the read selector 22.
In the illustrated embodiment, there are 256 processors, which when SECDED checkbyte columns are added, give a total of 312 columns in the memory array. Each column is expected to be 2K bits tall.
Thus, the memory will contain 2048 x 3f2 = 638,976 (624K) bits. There is no requirement that the memory array physically be built in this configuration as others will work as well.
Each processor on a PIM chip is a bit-serial computation unit. All processors are identical and are controlled in parallel; that is, all processors perform the same operation, all at the same time, all on different data. The processors thus implement a SIMD computation architecture.
Referring now to Fig. 2, a one bit-serial processor will be described in greater detail. The processor includes several multiplexers 24, 26, 27, 28, 30, 31, 32, 33, 34, 36, 37 feeding a fixed function arithmetic logic unit (ALU) 38 including means to conditionally propagate the results of computations to other processors or to memory.
The ALU 38 takes three input signals called A, B, and C and computes three fixed functional results of the three inputs. The results are Sum (A + e B $ C), Carry (A B B + A C C + B C) C) and String Compare (C + A $ B). Using the capabilities of the multiplexers, a full set of logic operations can be implemented from the Carry function. For example, by blocking the C input (force C = 0 the AND of A and B can be computed and by forcing the C input (make C = 1) the OR of A and B can be computed.
Several multiplexers choose the data paths and functions within the processor. Data sources that drive the multiplexers come from memory, other processors via internal communication networks, or internally generated and saved results.
There are three primary multiplexers 24, 26, 28 which feed the A, B, and C inputs of the ALU. Each of the multiplexers is controlled by separate control/command lines. In the drawing, control lines are shown as Fn where n is a number rrom 0 to 20. All control lines originate off chip. Each of tile multiplexers 24, 26, 28 are driven by three separate control lines. Two of t lines are decoded to select one of four inputs while the third control line inverts the state of the selected signal. The first multiplexer 24 can select, under control t the control lines, the previous output of the multiplexer 24 from the last clock cycle (this state saved y a flip-flop 40 associated with the multiplexer 24), the data just bering read from memory, either the Sum or Carry results from the ALU where the selection between these two signals is made by another multiplexer 30 driven by another control/command line, and logic zero. Any of these signals can be routed to the A input of the ALU, possibly inverted, on any clock cycle.
The second multiplexer 26 has the same data inputs as the first multiplexer 24 except that the first input is from a second level multiplexer 27 which selects from various communications paths or returns some previously calculated results. The control I ines are separate from the control lines to the first multiplexer though they serve identical functions. Just as for tile first multiplexer, data sent to tile ALU can be be inverted as required.
The third multiplexer 28 can be select from the previous output of the third multiplexer from tlie last clock cycle (this state saved by a flip-flop 42 associated with the third multiplexer), the same communication multiplexer 27 that feeds the second multiplexer 26, either the Carry or String Compare result from the ALIJ where the selection between these two signals is made by another multiplexer 32 driven by another control/command line, and logic zero. The selected datum, possibly inverted, is sent to the ALU under control of three separate control lines.
Arly SIMD machine neeis a mechanism to have some processors riot perform particular operations. The mechanism chosen for PIM is that of conditional storage. That is, instead of inhibiting some processors from performing a command, , to have all processors perform the command but not store the result(s) of tlle computation. To perform this kind of conditional control, three flip-flops 35 are added to the processor along with multiplexers 31, 33, 36 ann 37. On any cycle the multiplexer can choose any of tie three or cari choose a logic zero.
Just as iii the previous multiplexers, the state of tl1e selected i input can be inverted. Thus, for example, selecting the logic zero as input, can force the output to logic one by causing the inverted signal/command to be active.
The SIMD instructions sequence being executed, loads the old data from memory into the flip-flop associated with the A multiplexer arid routes the computed result from the ALU through the B multi plexer. IF f multiplexer 33 that is Ee(1 by tIie multiplexer 36 is outputting a logic one, B data is gated to lie memory store path; otherwise, the clata from the A multiplexer is gated.
Data is loaded into the store enable flip- flops 35, in general, from data loaded from memory through tile multiplexer 26 or from the ALU as a computed result through multiplexers 26 or 28. A command line chooses one result or the other through another multiplexer 34 and further command lines choose which (if any) store enable bits 35 is to load.
Data can le routed from each processor to networks that provide communication between the processors on and off the PIP1 chip. There are two different networks called the Global-Or network (GOR) and the Parallel Prefix network (PPN). GOR serves to communicate in a Many-to-One or One-to-Many fashion while PPN serves to allow Many-to-Many communication.
Data sent to GOR is gated with one of the store enable bits 35. This allows a particular processor to drive the GOR network by having that processor's store enable 1Bit be a logic one while the other processors have a logic zero enable bit.
Alternatively, all processors on chip can drive the GOR network and provide the global-or of all processors back to individual processors or to a higher level of off chip control. The on chip global-or across all processors is performed through the multilevel OR gate 49.
Data from both the GOR antl PPN networks are selected by multiplexer 27 controlled by separate command 1 ines. This data can be selected by either (or both) of the second and third multiplexers 26, 28.
The parallel prefix network will be described with reference to Fig. 3. This network derives its name from the mathematical function called scan or parallel prefix. The network of Fig. 3 implements this function ii a way that allows for a great deal of parallelism to speed up parallel prefix across any associative operator.
The prefix operation over addition is called scan and is defined as: Xi = Xi-1 + Yi for i = 1 to n, XO = 1 or X1 = Y1 X2 = X1 + Y2 X3 = X2 + Y3 X4 = X3 + Y4 Note the chaining of the operations. When stated this way, each result depends on all previous results. But the equations can be expanded to: X1 = Y1 X2 = Y1 + Y2 X3 = Y1 + Y2 + Y3 X4 = Y1 + Y2 + Y3 + Y4 Each processor starts with a single data item Y1 through n. The PPN allows the processor holding the copy of Y2 to send its data to the processor holding y1 and at the same time allows the processor holding Y4 to send its data to the processor holding Y3 r etc. Each processor will perform the required operation on the data (addition, in this example) and will then make the partial result available for further computation, in parallel with other similar operations, until all processors have a result - X in processor one, X2 in processor two, etc.
By implementing this network in hardware and then using it for general processor communication, two benefits are obtained. First the network allows some functions to be done in a parallel manner that would otherwise be forced into serial execution and, second the network can be implemented very efficiently in silicon, taking little chip routing space for the amount of parallelism achieved.
The network is implemented at all logarithmic levels across the processors. The first level allows processors to send data one processor to the left while receiving data from the processor on its right. The next level allows specific processors to send data to the next two processors on the left.
Succeeding levels double the number of processors receiving the data while cutting in half the number of processors sending data. All processors receive data from all levels. Control lines whose state is controlled by an executing program running externally choose the required level. All processors select the same level.
There are some extensions from a base implementation of PPN. Thus, the connections required to make a level complete are implemented. That is, for example, at level 0 the even numbered processors can send data to tlie processor on their leEt even though that is not required by the PPN function. In addi tioIi, another level 0 is added to the PPN network which implements data motion in the reverse direc- tion, i.e. to the right. Furthermore, multiplexers 46 ancl 48 are added to the end of the move data right and left connections that enable cominvnication to be done in an extended mode or in a circular mode. In tlie circular mode the last processor on chip drives the first processor (and the Eirst drives tlie last for data moving in the other direction). In the extended mode, the end processors receive data from off clip. This lets communications networks be built that are larger than one chip.
Because of tulle number oE processors and limits set by a maximum practical chip size, tulle amount of memory available to each processor is limited. Also there will be programs and algorithms that will riot be able to make full use of the available number oE processors. An attempt: to solve both problems at t11e same time is referred to as column reduction which will now lie described in connection with Fig.
4.
Processors are grouped together so that the formerly private memory for each processor is shared among the group. Additional control lines which serve as additional address lines route requested data from a particular memory column to all processors within the group. Each processor within the group thus computes on the same data (remember that all processors, whether part of the group or not all perform tlie same function). When data is to be store, the processor that corresponds with the address of the data to be store is enabled to send the newly computed result to memory while tlie processors within the group t)iat do not correspond to the store address, copies back the old data that was previously fetched from the store address.
More particularly, a plurality of memory devIces 50, 52, 54, 56 have processors 58, 60, 62, 64 associated therewith, respectively. A first selector 66 connects the outputs of the memory devices with tlie inputs oE the processors so that each processor receives as an input the output from orie of the memories. A plurality of multiplexers 68, 70, 72, 74 connect th outputs of each processor with the input of the memory device associated therewith. The output of each memory is also connected with the associated multiplexer through a feedback line 76. A decoder 78 controls the multiplexers 68, 70, 72, 74 to select as an input to tile memories one of tiie memory and processor outputs Thus, the plurality of processors is effectively reduced to a single processor arid the amount of memory available to the single processor is increased by a factor of the number oE memories.
A plurality of the memory devices and processors can be arranged in a group which includes a single Se 1ector and a single decoder.
The implementation discussed above could be replaced with logic that routes all the memory from a process group to one processor and routes the result from that processor back to the right store address. This implementation, while functionally correct, introduces extra timing skew into the logic path and would greatly complicate implementation of conditional storage of data discussed above.
Replacing normal, external, error correction is a set of internal SECDED locks that correct all data being read from memory (including external reads) an(l gene will be seen as two single recoverable errors instead of as a double-bit unrecoverable error. As a trade-off, interleaved 72 bit groups could be considered. A memory group would be 144 columns (144 = 2(64+8)). There would be two memory groups (instead of the four proposed groups) for a total of 288 columns instead of 312.
There is also some other on-chip error detecting logic. The parity of both received data and addresses are separately checked on receipt as is the parity of the SIMD command. The parity of read data from the chip is sent along with the data.
There is also an accessed row parity check. The parity of the row portion of the received address is compared to the contents of a special memory column whose contents are the parity of the row actually accessed. Any error detected by any parity or SECDED failure is set into chip status registers.
Chip status may be ascertained through the normal read path or may be accessed through the chip maintenance port.
External read and write timing is affected by the error correction logic. On a read operation, data is read from memory, error corrected, and then put into the R register. The first two address bits are resolved on the way into this register. On a second cycle the addressing selection is completed and data is driven off the part. The addressing and data paths are such that the 64 data columns of an interleaved SECDED group drive one data bit on and off chip.
For external writes, the word at the read address is read, error corrected and then merged with the four write bits into the R register. On the next clock cycle, checkbits are generated from the data held in the register and the whole 312 bits are written. There are registers that hold the external address valid from the second memory cycle so that data and address at the chip pins need only be valid for one clock period.
The last two paragraphs point out that a PIM chip presents a synchronous interface to the outside world. In the case of reading, data becomes valid after the second clock edge from the clock that starts the read operation. At least at the chip level, a new read cycle can be started every clock cycle except that if there is a data error it is desirable to write the corrected data back to memory which would then take another clock cycle. In the write case, the chip is busy for two clock cycles even though data does not need to be valid for both cycles. Of course there is nothing here that should be taken to imply that the PIM chip clock has the same clock rate as that of the remainder of the computer system.
In addition, the PIM chip has several error detection mechanisms. They include: Data Parity Detect and Generate. A fifth bit accompanies the four bit data interface on both reads and writes.
Address Parity. A parity bit is checked for every received address whether for an external read or write for a PIM mode reference.
Commatid rarity. A parity bit is checked on every SIMD command.
Row parity. A special column is added to the memory array whose contents are the parity of tlie referenced row. This bit is compared to tlie parity of the received row address.
Nothing changes here for column reduction mode.
All these errors along with single-bit and multiple-bit errors detected by tile SECDED logic are put into Pill status flip-flops. These may be read through the normal memory access lines or may be read through the chip maintenance port.
Tlie maintenance port is to be JTAG/IEEE 1149.1.
In addition to chip status, sortie chip test information will be accessed through this port.
There are. various bits buried in the chip for control of some oE tlie data paths and to implement some diagnostic features that would otherwise be very difficult (or impossible) to test. Control bits are provided for turning off checkbyte generation. This allows checking the SECDED logic.
What is done is to force the write checkbytes to the same value as would be generated on an all zero data word. Control bits also allow for inverting the compare within tile row parity logic. Any PIM reference should then set the row parity error status bit. Other bits provide for PPN data routing.
In summary, the method for detecting system errors at the memory chip level includes the steps of detecting parity errors on multibit interfaces coming on to the chip and retaining the state of each of the detected parity errors. The errors of the memory array row decoder circuitry are next detected and the state of the errors is retained.
Single bit memory errors are detected and corrected and double bit memory errors are detected and the states thereof are retained.
A row of memory devices is subdivided into correction subgroups, each of which comprises a plurality of columns, the alternate columns being connected with separate error detection correction circuits. The error states from the chip are then read and simultaneously cleared. The single bit error state and the multibit error state are separately maintained for maintenance purposes.
PIM mode execution is very similar to ordinary read/write control in that the R/W line is used to distinguish whether the memory reference is a read or a write. In the PIM read mode, the address lines are used for control and the data lines are used to return status/control information to the CPU (one bit per PIM data line). In the PIM write mode, the data lines are used for PIM control and the address lines are used to specify row select across the processors.
While in accordance with the provisions ot tile patent statute the preferred forms and embodiments have been illustrated and described, it will be apparent to those of ordinary skill in the art that various changes and modifications may be made without deviating from the inventive concepts set Eorl1 above.

Claims (9)

  1. CLAIMS: 1. A dynamically reconfigurable memory processor, comprising (a) a plurality of memory devices, each having an input and an output; (b) a plurality of first processors associated with said memory devices, respectively, each of said processors having an input and an output; (c) first selector means connecting the outputs of said memory devices with the inputs of said first processors, whereby an input to each first processor comprises an output from one of said memory devices; (d) second selector means connecting the output of each of said first processors with the input of the memory device associated with said first processor, the output of each memory device further being connected with said second selector means, said second selector means comprising a plurality of multiplexers connected with said plurality of memory devices, respectively; (e) decoder means for controlling said second selector means to select as an input to said memory devices one of said memory device and first processor outputs; and (f) a plurality of said memory devices and said first processors are arranged in a group, said group including a single first selector means and a single decoder, which are operable to reconfigure said group of memory devices and first processors between a first mode of operation wherein a single memory device is available to any number of said plurality of first processors and a second mode of operation wherein any number of said plurality of memory devices in said group is available to a single processor, whereby the plurality of first processors is effectively reduced to a single processor and the amount of memory available to the single processor is increased by a factor of the number of memory devices.
  2. 2. A reconfigurable memory processor according to claim 1 and further comprising a network for implementing a generalized parallel prefix mathematical function across an arbitrary associative operator, including (a) means defining a plurality of successive levels of communication, a first level being zero; (b) means defining a plurality of successive groups of second processors within each of said levels, each group comprising 2' second processors where 1 is the level number; (c) each second processor within a group having associated therewith a single input comprising an output from a preceding group, whereby a sequence of instructions is issued corresponding to the levels from zero through level 1 to compute a parallel prefix of 2 values; and (d) the inputs in level one and subsequent levels being associated with a single second processor per group that has received all of the previous inputs.
  3. 3. A reconfigurable memory processor according to claim 2, wherein said groups within a level are arranged in sequential pairs, with one group of each pair sending data to the other group of said pair to define a mathematical operation of the parallel prefix.
  4. 4. A reconfigurable memory processor according to claim 2 or 3 wherein the output from a last group of a level of groups can selectively drive the inputs of the first group of all levels.
  5. 5. A reconfigurable memory processor according to any of claims 2 to 4, and further comprising a plurality of networks wherein the output from the last group of a level of groups of one network can selectively drive the inputs of the first group of all levels of another network.
  6. 6. A reconfigurable memory processor according to any preceding claim, further comprising a means for detecting system errors at a memory chip level, comprising (a) means for detecting parity errors on multibit interfaces coming on to the chip and means for retaining the state thereof; (b) means for detecting errors of the memory array row decoder circuity and means for retaining the state thereof; and (c) means for detecting and correcting single bit memory errors and means for detecting double bit memory errors and retaining the state thereof.
  7. 7. A reconfigurable memory processor according to claim 6, further comprising means for subdividing a row of memory devices into correction subgroups, each of which comprises a plurality of columns, the alterative columns being connected with separate error detecting correction circuits.
  8. 8. A reconfigurable memory processor according to claims 6 or 7, further comprising means for reading said error states from the chip and simultaneously clearing the error states.
  9. 9. A reconfigurable memory processor according to any of claims 6 to 8 further comprising means for separately maintaining the single bit error state and the multibit error state for maintenance purposes.
    9. A reconfigurable memory processor according to any of claims 6 to 8, further comprising means for separately maintaining the single bit error state and the multibit error state for maintenance purposes.
    10. A reconfigurable memory processor substantially as herein described and/or illustrated with reference to the accompanying drawings.
    Amendments to the claims have been filed as follows
    1. A dynamically reconfigurable memory processor comprising: (a) a plurality of memory devices, each having an input and an output; (b) a plurality of first processors associated with said memory devices, respectively, each of said processors having an input and an output; (c) first selector means connecting the outputs of said memory devices with the inputs of said first processors, whereby an input to each first processor comprises an output from one of said memory devices; (d) second selector means connecting the output of each of said first processors with the input of the memory device associated with said first processor, the output of each memory device further being connected with said second selector means, (e) means for controlling said second selector means to select as an input to said memory devices one of said memory device and first processor outputs, whereby the plurality of first processors is effectively reduced to a single processor and the amount of memory available to the single processor is increased by a factor of the number of memory devices.
    2. A reconfigurable memory processor according to claim 1 further comprising a network for implementing a generalized parallel prefix mathematical function across an arbitrary associative operator, including (a) means defining a plurality of successive levels of communication, a first level being zero; (b) means defining a plurality of successive groups of second processors within each of said levels, each group comprising 2t second processors where 1 is the level number; (c) each second processor within a group having associated therewith a single input comprising an output from a preceding group, whereby a sequence of instructions is issued corresponding to the levels from zero through level 1 to compute a parallel prefix of 21 values; and (d) the inputs in level one and subsequent levels being associated with a single second processor per group that has received all of the previous inputs.
    3. A reconfigurable memory processor according to claim 2 wherein said groups within a level are arranged in sequential pairs, with one group of each pair sending data to the other group of said pair to define a mathematical operation of the parallel prefix.
    4. A reconfigurable memory processor according to claim 2 or 3 wherein the output from a last group of a level of groups can selectively drive the inputs of the first group of all levels.
    5. A reconfigurable memory processor according to any of claims 2 to 4 further comprising a plurality of networks wherein the output from the last group of a level of groups of one network can selectively drive the inputs of the first group of all levels of another network.
    6. A reconfigurable memory processor according to any preceding claim further comprising a means for detecting system errors at a memory chip level, comprising: (a) means for detecting parity errors on multibit interfaces coming on to the chip and means for retaining the state thereof; (b) means for detecting errors of the memory array row decoder circuity and means for retaining the state thereof; and (c) means for detecting and correcting single bit memory errors and means for detecting double bit memory errors and retaining the state thereof.
    7. A reconfigurable memory processor according to claim 6 further comprising means for subdividing a row of memory devices into correction subgroups, each of which comprises a plurality of columns, alternate columns being connected with separate error detecting correction circuits.
    8. A reconfigurable memory processor according to claims 6 or 7 further comprising means for reading said error states from the chip and simultaneously clearing the error states.
GB9421571A 1991-01-18 1991-08-21 Dynamically reconfigurable memory processor Expired - Fee Related GB2286700B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US64363391A 1991-01-18 1991-01-18
GB9118071A GB2252185B (en) 1991-01-18 1991-08-21 Apparatus for processing data from memory and from other processors

Publications (3)

Publication Number Publication Date
GB9421571D0 GB9421571D0 (en) 1994-12-14
GB2286700A true GB2286700A (en) 1995-08-23
GB2286700B GB2286700B (en) 1995-10-18

Family

ID=26299433

Family Applications (1)

Application Number Title Priority Date Filing Date
GB9421571A Expired - Fee Related GB2286700B (en) 1991-01-18 1991-08-21 Dynamically reconfigurable memory processor

Country Status (1)

Country Link
GB (1) GB2286700B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2399899A (en) * 2003-03-27 2004-09-29 Micron Technology Inc Active memory with three control units

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2399899A (en) * 2003-03-27 2004-09-29 Micron Technology Inc Active memory with three control units
GB2399899B (en) * 2003-03-27 2005-06-22 Micron Technology Inc Active memory command engine and method
US7181593B2 (en) 2003-03-27 2007-02-20 Micron Technology, Inc. Active memory command engine and method
US7404066B2 (en) 2003-03-27 2008-07-22 Micron Technology, Inc. Active memory command engine and method
US7793075B2 (en) 2003-03-27 2010-09-07 Micron Technology, Inc. Active memory command engine and method
US8195920B2 (en) 2003-03-27 2012-06-05 Micron Technology, Inc. Active memory command engine and method
US9032185B2 (en) 2003-03-27 2015-05-12 Micron Technology, Inc. Active memory command engine and method

Also Published As

Publication number Publication date
GB2286700B (en) 1995-10-18
GB9421571D0 (en) 1994-12-14

Similar Documents

Publication Publication Date Title
US5396641A (en) Reconfigurable memory processor
US10782980B2 (en) Generating and executing a control flow
US11842191B2 (en) Apparatus and methods related to microcode instructions indicating instruction types
CN109147842B (en) Apparatus and method for simultaneous computational operations in a data path
US11048428B2 (en) Apparatuses and methods for memory alignment
US4984235A (en) Method and apparatus for routing message packets and recording the roofing sequence
US4945512A (en) High-speed partitioned set associative cache memory
US4831519A (en) Cellular array processor with variable nesting depth vector control by selective enabling of left and right neighboring processor cells
US4835729A (en) Single instruction multiple data (SIMD) cellular array processing apparatus with on-board RAM and address generator apparatus
US5148547A (en) Method and apparatus for interfacing bit-serial parallel processors to a coprocessor
JP6791522B2 (en) Equipment and methods for in-data path calculation operation
US4783782A (en) Manufacturing test data storage apparatus for dynamically reconfigurable cellular array processor chip
EP0234146A2 (en) Cellular array processing apparatus employing dynamically reconfigurable vector bit slices
US5457789A (en) Method and apparatus for performing memory protection operations in a single instruction multiple data system
US20060143428A1 (en) Semiconductor signal processing device
US6175942B1 (en) Variable bit width cache memory architecture
US11474965B2 (en) Apparatuses and methods for in-memory data switching networks
US6948045B2 (en) Providing a register file memory with local addressing in a SIMD parallel processor
US20200241844A1 (en) Matrix normal/transpose read and a reconfigurable data processor including same
CA2073185A1 (en) Parallel processor memory system
WO2005088640A9 (en) Improvements relating to orthogonal data memory
US4697233A (en) Partial duplication of pipelined stack with data integrity checking
US6266796B1 (en) Data ordering for cache data transfer
US4733393A (en) Test method and apparatus for cellular array processor chip
EP0166192A2 (en) High-speed buffer store arrangement for fast transfer of data

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 19960118