WO2021138189A1 - Processor for configurable parallel computations - Google Patents

Processor for configurable parallel computations Download PDF

Info

Publication number
WO2021138189A1
WO2021138189A1 PCT/US2020/066823 US2020066823W WO2021138189A1 WO 2021138189 A1 WO2021138189 A1 WO 2021138189A1 US 2020066823 W US2020066823 W US 2020066823W WO 2021138189 A1 WO2021138189 A1 WO 2021138189A1
Authority
WO
WIPO (PCT)
Prior art keywords
processor
data stream
stream
circuits
configurable
Prior art date
Application number
PCT/US2020/066823
Other languages
French (fr)
Inventor
Wensheng Hua
Original Assignee
Star Ally International Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Star Ally International Limited filed Critical Star Ally International Limited
Priority to JP2022539692A priority Critical patent/JP2023508503A/en
Priority to EP20908751.9A priority patent/EP4085354A4/en
Priority to KR1020227025841A priority patent/KR20220139304A/en
Priority to CN202080090121.3A priority patent/CN115280297A/en
Publication of WO2021138189A1 publication Critical patent/WO2021138189A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • G06F15/7871Reconfiguration support, e.g. configuration loading, configuration switching, or hardware OS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/10Program control for peripheral devices
    • G06F13/12Program control for peripheral devices using hardware independent of the central processor, e.g. channel or peripheral processor
    • G06F13/124Program control for peripheral devices using hardware independent of the central processor, e.g. channel or peripheral processor where hardware is a sequential transfer control unit, e.g. microprocessor, peripheral processor or state-machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4022Coupling between buses using switching circuits, e.g. switching matrix, connection or expansion network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • G06F15/7885Runtime interface, e.g. data exchange, runtime control
    • G06F15/7889Reconfigurable logic implemented as a co-processor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/82Architectures of general purpose stored program computers data or demand driven
    • G06F15/825Dataflow computers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to processor architecture.
  • the present invention relates to architecture of a processor having numerous processing units and data paths that are configurable and reconfigurable to allow parallel computing and data forwarding operations to be carried out in the processing units.
  • the operations of a single neuron may be implemented as a series of add, multiply and compare instructions, with each instruction being required to fetch operands from registers or memory, perform the operation in an arithmetic-logic unit (ALU), and write back the result or results of the operations back to registers or memory,
  • ALU arithmetic-logic unit
  • the set of instructions, or the execution sequence of instructions may vary with data or the application.
  • these operations may be repeated hundreds of millions of times, enormous efficiencies can be attained in a processor with an appropriate architecture.
  • a processor includes (i) a plurality of configurable processors interconnected by modular interconnection fabric circuits that are configurable to partition the configurable processors into one or more groups, for parallel execution, and to interconnect the configurable processors in any order for pipelined operations,
  • each configurable processor may include (i) a control circuit; (ii) a plurality of configurable arithmetic logic circuits; and (iii) configurable interconnection fabric circuits for interconnecting the configurable arithmetic logic circuits.
  • each configurable arithmetic logic circuits may include (i) a plurality of arithmetic or logic operator circuits; and (ii) a configurable interconnection fabric circuit.
  • each configurable interconnection fabric circuit may include (i) a Benes network and (ii) a plurality of configurable first-in-first-out (FIFO) registers.
  • Figure 1 shows processor 100 that includes 4x 4 array of stream processing units (SPU)lOl-l, 101-2, 101-3, ..., and 101-16, according to one embodiment of the present invention.
  • SPU stream processing units
  • Figure 3(b) shows an enable signal generated by each operator to signal that its output data stream is ready for processing by the next operator.
  • Figure 4 shows a generalized, representative implementation 400 of any of PLF unit 102-1, 102-2, 102-3. and 102 -4 and PLF subunit 202, according to one embodiment of the present i n vention .
  • FIG. 1 shows a processor 100 that includes, for example, 4x 4 array of stream processing units (SPU) 101-1, 101-2, 101-3, ..., and 101-16, according to one embodiment of the present invention.
  • SPU stream processing units
  • the SPUs are interconnected among themselves by configurable pipeline fabric (PLF) 102 that allow computational results from a given SPU to be provided or “streamed” to another SPU.
  • PPF configurable pipeline fabric
  • the 4 x 4 array of SPUs in processor 100 may be configured at run time into one or more groups of SPUs, with each group of SPUs configured as pipeline stages for a pipelined computational task.
  • PLF 102 is shown to include PLF unit 102-1, 102-2, 102-3 and 102-4, each may be configured to provide data paths among the four SPUs in one of four quadrants of the 4 x 4 array.
  • PLF units 102-1, 102-2, 102-3 and 102-4 may also be interconnected by suitably configuring PLF unit 102-5, thereby allowing computational results from any of SPUs 101-1, 101-2, 101-3, ..., and 101-16 to be forwarded to any other one of SPUs 101-1, 101-2, 101-3, ..., and 101-16.
  • the PLF units of processor 100 may be organized in a hierarchical manner.
  • a host CPU (not shown) configures and reconfigures processor 100 over global bus 104 in real time during an operation.
  • Interrupt bus 105 is provided to allow each SPU to raise an interrupt to the host CPU to indicate task completion or any of numerous exceptional conditions.
  • Input data buses 106-1 and 106-2 stream input data into processor 100.
  • processor 100 may serve as a digital baseband circuit that processes in real time digitized samples from a radio frequency (RF) front-end circuit.
  • RF radio frequency
  • the input data samples received into processor 100 at input data buses 106-1 and 106-2 are in-phase and quadrature components of a signal received at an antenna, after signal processing at the RF front-end circuit.
  • the received signal includes the navigation signals transmitted from numerous positioning satellites.
  • FIG. 2 shows SPU 200 in one implementation of an SPU in processor 100, according to one embodiment of the present invention.
  • SPU 200 includes 2 x 4 array of arithmetic and logic units, each referred herein as an “arithmetic pipeline complex” (APC) to highlight that (i) each APC is reconfigurable via a set of configuration registers for any of numerous arithmetic and logic operations; and (ii) the APCs may be configurable in any of numerous manners to stream results any APC to another APC in SPU 200.
  • APC arithmetic pipeline complex
  • APCs 201-1, 201-2, ..., 201-8 in the 2 x 4 array of APCs in SPU 200 are provided data paths among themselves on PLF subunit 202, which is an extension from its corresponding PLF unit 101-1, 101-2, 101-3 or 101-4.
  • SPU 200 includes control unit 203, which executes a small set of instructions from instruction memory 204, which is loaded by host CPU over global bus 104.
  • Internal processor bus 209 is accessible by host CPU over global bus 104, during a configuration phase, and by control unit 203 during a computation phase. Switching between the configuration and computational phases is achieved by an enable signal asserted from the host CPU. When the enable signal is de-asserted, any clock signal to an APC - and, hence, any data valid signal to any operator with the APC — is gated off to save power.
  • Any SPU may be disabled by the host CPU by gating off the power supply signals to the SPU. Sri some embodiments, power supply signals to an APC may also be gated. Likewise, any PLF may also be gated off, when appropriate, to save power.
  • Idle enable signal to an APC may be memory-mapped to allow it to be accessed over internal process bus 209.
  • the host CPU or SPU 200 may control enabling the APCs in the proper order - ⁇ e.g., enabling the APCs in the reverse order of the data flow in the pipeline, such that all the APCs are ready for data processing when the first APC in the data flow is enabled.
  • Multiplexer 205 switches control of internal processor bus 209 between the host CPU and control unit 203,
  • SPU 200 includes memory blocks 207-1, 207-2, 207-3 and 207-4, which are accessible over internal processor bus 209 by the host CPU or SPU 200, and by APC 201-1, 201-2, .,,, 201-8 over internal data bus during the computation phase.
  • Switches 208-1, 208-2, 208-3 and 208-4 each switch access to memory blocks 207-1, 207-2, 207-3 and 207-4 between internal processor bus 209 and a corresponding one of internal data bus 210-1, 210-2, 210-3 and 210-4.
  • the host CPU may configure any element in SPU 200 by writing into configuration registers over global bus 104, which is extended into internal processor bus 209 by multiplexer 205 at this time.
  • control unit 203 may control operation of SPU 200 over internal processor bus 209, including one or more clock signals that that allow APCs 201-1, 201-2,
  • APCs 201-1, 201-2, ..., 201-8 may raise an interrupt on interrupt bus 211, which is received into SPU 200 for service.
  • SPU may forward the interrupt signals and its own interrupt signals to the host CPU over interrupt bus 105.
  • Scratch memory 206 is provided to support instruction execution in control unit 203, such as for storing intermediate results, flags and interrupts. Switching between the configuration phase and the computation phase is controlled by the host, CPU.
  • memory blocks 207-1 , 207-2, 207-3 and 207-4 are accessed by control unit 203 using a local address space, which may be mapped into an allocated part of a global address space of processor 100.
  • Configuration registers of APCs 201-1, 201-2, ..., 201-8 are also likewise accessible from both the local address space and the global address space.
  • APCs 201-1, 201-2, ..., 201-8 and memory blocks 207-1, 207-2, 207-3 and 207-4 may also be directly accessed by the host CPU over global bus 104.
  • Control unit. 203 may be a microprocessor of a type referred to by those of ordinary skill in the art as a minimal instruction set computer (MLSCs processor, which operates under supervision of the host CPU.
  • control unit 203 manages lower level resources (e.g., APC 201-1, 201-2, 201-3 and 201-4) by servicing certain interrupts and by configuring locally configuration registers in the resources, thereby reducing the supervisory requirements of these resources on the host CPU.
  • the resources may operate without participation by control unit 203, i.e., the host CPU may directly service the interrupts and the configuration registers.
  • the host CPU may control the entire data processing pipeline directly.
  • FIG. 3(a) shows APC 300 in one implementation of one of APC 201-1, 201-2, 201- 3 and 201-4 of Figure 2, according to one embodiment of the present invention.
  • APC 300 includes representative operator units 301- 1 , 301-2, 301-3, and 301 -4.
  • Each operator unit may include one or more arithmetic or logic circuits (e.g., adders, multipliers, shifters, suitable combinational logic circuit, suitable sequential logic circuits, or combinations thereof),
  • APC PLF 302 allows creation of data paths 303 among the operators in any suitable manner by the host CPU over internal processor bus 209.
  • APC PLF 302 and operators 301-1, 301-2, 301-3 and 301-4 are each configurable over internal processor bus 209 by both the host CPU and control unit 203, such that the operators may be organized to operate on the data stream in a pipeline fashion.
  • valid signal 401 is generated by each operator to signal that, when asserted, its output data stream (402) is valid for processing by the next operator.
  • An operator in the pipeline may he configured to generate an interrupt signal upon detecting the failing edge of valid signal 401 to indicate that processing of its input data stream is complete.
  • the interrupt signal may be serviced by control unit 203 or the host CPU.
  • Data into and out of APC 300 are provided over data paths in PLF subunit 202 of Figure 2. Some operators may be configured to access an associated memory block (i.e., memory blocks 207-1, 207-2, 207-3 or 207-4).
  • one operator may read data from the associated memory block and writes the data onto its output data stream into the pipeline.
  • One operator may read data from its inpi.it data stream in the pipeline and writes the data into the associated memory block, in either case, the address of the memory location is provided to the operator in its input data stream.
  • One or more buffer operators may be provided in an ARC
  • a buffer operator may be configured to read or write from a local buffer (e.g., a FIFO buffer).
  • a congestion occurs at. a buffer operator
  • the buffer operator may assert a pause signal to pause the current pipeline.
  • the pause signal disables all related APCs until the congestion subsides.
  • the buffer operator then resets the pause signal to resume the pipeline operation
  • FIG. 4 shows a generalized, representative implementation 400 of any of PLF unit 102-1, 102-2, 102-3. and 102-4 and PLF subunit 202, according to one embodiment of the present invention.
  • PIP implementation 400 includes Benes network 401, which receives n M- bit input data streams 403-1, 403-2, ,,,, 403-n and provides n M- bit output data streams 404-1, 404-2. 404-n.
  • Benes network 401 is a non-blocking n x n Benes network that can he configured to allow the input data streams to be mapped and routed to the output data streams stream in any desired permutation programmed into its configuration register.
  • Output data streams 404-1 , 404-2, 404-n are then each provided to a corresponding configurable first-in-first-out. (FIFO) register in FIFO registers 402, so that the FIFO output data streams 405-1, 405-2, ,,, whatsoever 405-n. are properly aligned in time for their respective receiving units according to their respective configuration registers.
  • Control buses 410 and 411 represents the configuration signals into the configuration registers of Benes network 401 and FIFO registers 402, respectively.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Mathematical Physics (AREA)
  • Advance Control (AREA)
  • Multi Processors (AREA)

Abstract

A flexible processor includes (i) numerous configurable processors interconnected by modular interconnection fabric circuits that are configurable to partition the configurable processors into one or more groups, for parallel execution, and to interconnect the configurable processors in any order for pipelined operations, Each configurable processor may include (i) a control circuit; (ii) numerous configurable arithmetic logic circuits; and (iii) configurable interconnection fabric circuits for interconnecting the configurable arithmetic logic circuits.

Description

PROCESSOR FOR CONFIGURABLE PARALLEL COMPUTATIONS
BACKGROUND OF THE INVENTION
1, Field of the Invention
The present invention relates to processor architecture. In particular, the present invention relates to architecture of a processor having numerous processing units and data paths that are configurable and reconfigurable to allow parallel computing and data forwarding operations to be carried out in the processing units.
2. Discussion of the Related Art
Many applications (e.g., signal processing, navigation, matrix inversion, machine learning, large data set searches) require enormous amount of repetitive computation steps that are best carried out by numerous processors operating in parallel. Current microprocessors, whether the conventional “central processing units” (CPUs) that power desktop or mobile computers, or the more numerically-oriented conventional “graphics processing units” (GPUs), are suited for such tasks. A CPU or GPU, even if provided numerous cores, are inflexible in their hardware configurations. For example, signal processing applications often require sets of large number of repetitive floating-point arithmetic operations (e.g., add and multiply). As implemented in a conventional CPU or GPU, the operations of a single neuron may be implemented as a series of add, multiply and compare instructions, with each instruction being required to fetch operands from registers or memory, perform the operation in an arithmetic-logic unit (ALU), and write back the result or results of the operations back to registers or memory, Although the nature of such operations are well-known, the set of instructions, or the execution sequence of instructions, may vary with data or the application. Thus, because of the manner in which memory, register files and ALUs are organized in a conventional CPU or GPU, it is difficult to achieve a high- degree of parallel processing and streamlining of data flow without the flexibility of reconfiguring the data paths that shuttle operands between memory, register files and ALUs. In many applications, as these operations may be repeated hundreds of millions of times, enormous efficiencies can be attained in a processor with an appropriate architecture.
SUMMARY
According to one embodiment of the present invention, a processor includes (i) a plurality of configurable processors interconnected by modular interconnection fabric circuits that are configurable to partition the configurable processors into one or more groups, for parallel execution, and to interconnect the configurable processors in any order for pipelined operations,
According to one embodiment, each configurable processor may include (i) a control circuit; (ii) a plurality of configurable arithmetic logic circuits; and (iii) configurable interconnection fabric circuits for interconnecting the configurable arithmetic logic circuits.
According to one embodiment of the present invention, each configurable arithmetic logic circuits may include (i) a plurality of arithmetic or logic operator circuits; and (ii) a configurable interconnection fabric circuit.
According to one embodiment of the present invention, each configurable interconnection fabric circuit may include (i) a Benes network and (ii) a plurality of configurable first-in-first-out (FIFO) registers.
The present invention is better understood upon consideration of the detailed description below with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 shows processor 100 that includes 4x 4 array of stream processing units (SPU)lOl-l, 101-2, 101-3, ..., and 101-16, according to one embodiment of the present invention.
Figure 2 shows SPU 200 in one implementation of an SPU in processor 100 of Figure 1 , according to one embodiment of the present invention.
Figure 3(a) shows APC 300 in one implementation of one of APC 201-1, 201-2, 201- 3 and 201-4 of Figure 2, according to one embodiment of the present invention.
Figure 3(b) shows an enable signal generated by each operator to signal that its output data stream is ready for processing by the next operator.
Figure 4 shows a generalized, representative implementation 400 of any of PLF unit 102-1, 102-2, 102-3. and 102 -4 and PLF subunit 202, according to one embodiment of the present i n vention .
To facilitate cross-referencing between figures, like elements in the figures are provided like reference numerals. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Figure 1 shows a processor 100 that includes, for example, 4x 4 array of stream processing units (SPU) 101-1, 101-2, 101-3, ..., and 101-16, according to one embodiment of the present invention. Of course, the 4 x 4 array is selected for illustrative purpose in this detailed description. A practical implementation may have any number of SPUs. The SPUs are interconnected among themselves by configurable pipeline fabric (PLF) 102 that allow computational results from a given SPU to be provided or “streamed” to another SPU. With this arrangement, the 4 x 4 array of SPUs in processor 100 may be configured at run time into one or more groups of SPUs, with each group of SPUs configured as pipeline stages for a pipelined computational task.
In the embodiment shown in Figure 1, PLF 102 is shown to include PLF unit 102-1, 102-2, 102-3 and 102-4, each may be configured to provide data paths among the four SPUs in one of four quadrants of the 4 x 4 array. PLF units 102-1, 102-2, 102-3 and 102-4 may also be interconnected by suitably configuring PLF unit 102-5, thereby allowing computational results from any of SPUs 101-1, 101-2, 101-3, ..., and 101-16 to be forwarded to any other one of SPUs 101-1, 101-2, 101-3, ..., and 101-16. In one embodiment, the PLF units of processor 100 may be organized in a hierarchical manner. (The organization shown in Figure 1 may be considered a 2-level hierarchy with PLF 102-1, 102-2, 102-3 and 102-4 forming one level and PLF 102-5 being a second level.) In this embodiment, a host CPU (not shown) configures and reconfigures processor 100 over global bus 104 in real time during an operation. Interrupt bus 105 is provided to allow each SPU to raise an interrupt to the host CPU to indicate task completion or any of numerous exceptional conditions. Input data buses 106-1 and 106-2 stream input data into processor 100.
In one satellite positioning application, processor 100 may serve as a digital baseband circuit that processes in real time digitized samples from a radio frequency (RF) front-end circuit. In that application, the input data samples received into processor 100 at input data buses 106-1 and 106-2 are in-phase and quadrature components of a signal received at an antenna, after signal processing at the RF front-end circuit. The received signal includes the navigation signals transmitted from numerous positioning satellites.
Figure 2 shows SPU 200 in one implementation of an SPU in processor 100, according to one embodiment of the present invention. As shown in Figure 2, SPU 200 includes 2 x 4 array of arithmetic and logic units, each referred herein as an “arithmetic pipeline complex” (APC) to highlight that (i) each APC is reconfigurable via a set of configuration registers for any of numerous arithmetic and logic operations; and (ii) the APCs may be configurable in any of numerous manners to stream results any APC to another APC in SPU 200. As shown in Figure 2, APCs 201-1, 201-2, ..., 201-8 in the 2 x 4 array of APCs in SPU 200 are provided data paths among themselves on PLF subunit 202, which is an extension from its corresponding PLF unit 101-1, 101-2, 101-3 or 101-4.
As shown in Figure 2, SPU 200 includes control unit 203, which executes a small set of instructions from instruction memory 204, which is loaded by host CPU over global bus 104. Internal processor bus 209 is accessible by host CPU over global bus 104, during a configuration phase, and by control unit 203 during a computation phase. Switching between the configuration and computational phases is achieved by an enable signal asserted from the host CPU. When the enable signal is de-asserted, any clock signal to an APC - and, hence, any data valid signal to any operator with the APC — is gated off to save power. Any SPU may be disabled by the host CPU by gating off the power supply signals to the SPU. Sri some embodiments, power supply signals to an APC may also be gated. Likewise, any PLF may also be gated off, when appropriate, to save power.
Idle enable signal to an APC may be memory-mapped to allow it to be accessed over internal process bus 209. Through this arrangement, when multiple APCs are configured in a pipeline, the host CPU or SPU 200, as appropriate, may control enabling the APCs in the proper order -■ e.g., enabling the APCs in the reverse order of the data flow in the pipeline, such that all the APCs are ready for data processing when the first APC in the data flow is enabled.
Multiplexer 205 switches control of internal processor bus 209 between the host CPU and control unit 203, SPU 200 includes memory blocks 207-1, 207-2, 207-3 and 207-4, which are accessible over internal processor bus 209 by the host CPU or SPU 200, and by APC 201-1, 201-2, .,,, 201-8 over internal data bus during the computation phase. Switches 208-1, 208-2, 208-3 and 208-4 each switch access to memory blocks 207-1, 207-2, 207-3 and 207-4 between internal processor bus 209 and a corresponding one of internal data bus 210-1, 210-2, 210-3 and 210-4. During the configuration phase, the host CPU may configure any element in SPU 200 by writing into configuration registers over global bus 104, which is extended into internal processor bus 209 by multiplexer 205 at this time. During the computation phase, control unit 203 may control operation of SPU 200 over internal processor bus 209, including one or more clock signals that that allow APCs 201-1, 201-2,
..., 201-8 to operate synchronously with each other. At appropriate times, one or more of APCs 201-1, 201-2, ..., 201-8 may raise an interrupt on interrupt bus 211, which is received into SPU 200 for service. SPU may forward the interrupt signals and its own interrupt signals to the host CPU over interrupt bus 105. Scratch memory 206 is provided to support instruction execution in control unit 203, such as for storing intermediate results, flags and interrupts. Switching between the configuration phase and the computation phase is controlled by the host, CPU. in one embodiment, memory blocks 207-1 , 207-2, 207-3 and 207-4 are accessed by control unit 203 using a local address space, which may be mapped into an allocated part of a global address space of processor 100. Configuration registers of APCs 201-1, 201-2, ..., 201-8 are also likewise accessible from both the local address space and the global address space. APCs 201-1, 201-2, ..., 201-8 and memory blocks 207-1, 207-2, 207-3 and 207-4 may also be directly accessed by the host CPU over global bus 104. Setting multiplexer 205 through a memory-mapped register, the host CPU can connect and allocate internal processor bus 209 to become part of global bus 104.
Control unit. 203 may be a microprocessor of a type referred to by those of ordinary skill in the art as a minimal instruction set computer (MLSCs processor, which operates under supervision of the host CPU. In one embodiment, control unit 203 manages lower level resources (e.g., APC 201-1, 201-2, 201-3 and 201-4) by servicing certain interrupts and by configuring locally configuration registers in the resources, thereby reducing the supervisory requirements of these resources on the host CPU. In one embodiment, the resources may operate without participation by control unit 203, i.e., the host CPU may directly service the interrupts and the configuration registers. Furthermore, when a configured data processing pipeline requires participation by multiple SPUs, the host CPU may control the entire data processing pipeline directly.
Figure 3(a) shows APC 300 in one implementation of one of APC 201-1, 201-2, 201- 3 and 201-4 of Figure 2, according to one embodiment of the present invention. As shown in Figure 3(a), for illustrative purpose only, APC 300 includes representative operator units 301- 1 , 301-2, 301-3, and 301 -4. Each operator unit may include one or more arithmetic or logic circuits (e.g., adders, multipliers, shifters, suitable combinational logic circuit, suitable sequential logic circuits, or combinations thereof), APC PLF 302 allows creation of data paths 303 among the operators in any suitable manner by the host CPU over internal processor bus 209. APC PLF 302 and operators 301-1, 301-2, 301-3 and 301-4 are each configurable over internal processor bus 209 by both the host CPU and control unit 203, such that the operators may be organized to operate on the data stream in a pipeline fashion.
Within a configured pipeline, the output data stream of each operator is pro vided as the input data stream for the next operator. As shown in Figure 3(b), valid signal 401 is generated by each operator to signal that, when asserted, its output data stream (402) is valid for processing by the next operator. An operator in the pipeline may he configured to generate an interrupt signal upon detecting the failing edge of valid signal 401 to indicate that processing of its input data stream is complete. The interrupt signal may be serviced by control unit 203 or the host CPU. Data into and out of APC 300 are provided over data paths in PLF subunit 202 of Figure 2. Some operators may be configured to access an associated memory block (i.e., memory blocks 207-1, 207-2, 207-3 or 207-4). For example, one operator may read data from the associated memory block and writes the data onto its output data stream into the pipeline. One operator may read data from its inpi.it data stream in the pipeline and writes the data into the associated memory block, in either case, the address of the memory location is provided to the operator in its input data stream.
One or more buffer operators may be provided in an ARC A buffer operator may be configured to read or write from a local buffer (e.g., a FIFO buffer). When a congestion occurs at. a buffer operator, the buffer operator may assert a pause signal to pause the current pipeline. The pause signal disables all related APCs until the congestion subsides. The buffer operator then resets the pause signal to resume the pipeline operation
Figure 4 shows a generalized, representative implementation 400 of any of PLF unit 102-1, 102-2, 102-3. and 102-4 and PLF subunit 202, according to one embodiment of the present invention. As shown in Figure 4, PIP implementation 400 includes Benes network 401, which receives n M- bit input data streams 403-1, 403-2, ,,,, 403-n and provides n M- bit output data streams 404-1, 404-2. 404-n. Benes network 401 is a non-blocking n x n Benes network that can he configured to allow the input data streams to be mapped and routed to the output data streams stream in any desired permutation programmed into its configuration register. Output data streams 404-1 , 404-2, 404-n are then each provided to a corresponding configurable first-in-first-out. (FIFO) register in FIFO registers 402, so that the FIFO output data streams 405-1, 405-2, ,,„ 405-n. are properly aligned in time for their respective receiving units according to their respective configuration registers. Control buses 410 and 411 represents the configuration signals into the configuration registers of Benes network 401 and FIFO registers 402, respectively.
The above detailed description is provided to illustrate specific embodiments of the present invention and is not intended to be limiting. Numerous modifications and variations within the scope of the invention are possible. The present invention is set forth in the accompanying claims.

Claims

CLAIMS I claim:
1. A processor receiving a system input data stream, comprising: a plurality of stream processors each receiving an input data stream and providing an output data stream, wherein the input data stream of a selected one of the stream processors comprises the input data stream; a plurality of configurable interconnection circuits each configurable to route the output data stream of one of the stream processors as the input data stream of another one of the stream processors; and a global bus providing access to and being accessible by the stream processors and the configurable interconnection circuits.
2. The processor of Claim 1 , wherein the configurable interconnection circuits are hierarchically organized.
3. The processor of Claim 1, wherein the stream processors are included in a system that further comprises a host processor that configures the stream processors and the configurable interconnection circuits over the global bus.
4. The processor of Claim 3, wherein the host processor provides an enable signal in each stream processor that initiates a computational phase in the stream processor.
5. The processor of Claim 4 wherein, when the enable signal of the stream processor is de-asserted, selected circuits in the stream processor are power-gated to conserve power.
6. The processor of Claim 3, further comprising an interrupt bus which allows each stream processor to raise an interrupt to the host computer.
7. The processor of Claim 6, wherein each stream processor comprises: a plurality of arithmetic logic circuits each receiving an input data stream and providing an output data stream, wherein the input data stream of one of the arithmetic logic circuits comprises the input data stream of the stream processor and wherein the output data stream of another one of the arithmetic logic circuits comprises the output data stream of the stream processor; a plurality of configurable interconnection circuits, wherein each configurable interconnection circuit is configurable to route the output data stream of one of the arithmetic logic circuits as the input data stream of another one of the arithmetic logic circuits; a processor bus providing access to or accessible from the arithmetic logic circuits; and a control processor providing and receiving control and configuration signals to and from the arithmetic logic circuits over the processor bus.
8. The processor of Claim 7, wherein the control processor processes selected interrupts on the interrupt bus.
9. The processor of Claim 7, wherein each stream processor further comprises a plurality of memory circuits each accessible directly from one or more of the arithmetic logic circuits of the stream processor and over the processor bus,
10. The processor of Claim 7, wherein each arithmetic logic circuit or configurable interconnection circuit comprises a plurality of configuration registers accessible by the host processor over the global bus or the control processor on the processor bus for storing values of control parameters of the arithmetic logic circuit or configurable interconnection circuit.
11. The processor of Claim 10, wherein the stream processor further comprises an instruction memory accessible by the control processor, the instruction memory storing instructions executable by the control processor.
12. The processor of Claim 11, wherein the instruction memory is accessible by the host processor to store instructions for execution by the control processor over the global bus.
13. The processor of Claim 10, further comprising a processor bus multiplexer which is configurable by the host processor to connect a portion of the global bus to the processor bus.
14. The processor of Claim 7, wherein each arithmetic logic circuit receives an enable signal form the host processor or the control processor and wherein, when the enable signal is de-asserted, clock signals associated with the arithmetic logic circuit are gated off, thereby suspending operations within the arithmetic logic circuit.
15. The processor of Claim 7, wherein each arithmetic logic circuit comprises: a plurality of operator circuits each receiving an input data stream and providing an output data stream; and a configurable interconnection circuit configurable to route (i) the input data stream of the arithmetic logic circuit as the input data stream of one of the operator circuits; (ii) the output data stream of any of the operator circuits as the input data stream of any other one of the operator circuits, and (iii) the output data stream of one of the operator circuits as the output data stream of one of the arithmetic logic circuit.
16. The processor of Claim 15, wherein each operator circuit comprises one or more arithmetic circuits or logic circuits.
17. The processor of Claim 16, wherein each arithmetic circuit comprises one or more of: an adder, a multiplier, or a divider.
18. The processor of Claim 16, wherein the logic circuits each comprise one or more of shifters, combinational logic circuits, sequential logic circuits, and any combination thereof.
19. The processor of Claim 15, wherein each operator circuit provides a valid signal to indicate validity of its output data stream.
20. The processor of Claim 15, wherein at least one operator circuit comprises a memory operator.
21. The processor of Claim 15, wherein at least one operator circuit comprises a buffer operator.
22. The processor of Claim 1 , wherein each configurable interconnection circuit comprises a non-blocking network receiving one or more input data streams and provided one or more output data streams.
23. The processor of Claim 22, wherein the non-blocking network comprises anN x N Benes network.
24. The processor of Claim 22, the configurable interconnection circuit further comprises a plurality of first-in- first-out memory each receiving a selected one of the output data streams of the non-blocking network to provide a delayed output data stream corresponding to the selected output data stream of the non-blocking network delayed by a configurable delay value.
25. The processor of Claim 1, wherein the processor serves as a digital baseband circuit that processes in real time digitized samples from a radio frequency (RF) front-end circuit.
26. The processor of Claim 25, wherein the input data stream of the processor comprises in-phase and quadrature components of a signal received at an antenna, after signal processing at the RF front-end circuit.
27. The processor Claim 26, wherein the received signal includes navigation signals transmitted from numerous positioning satellites.
PCT/US2020/066823 2019-12-30 2020-12-23 Processor for configurable parallel computations WO2021138189A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2022539692A JP2023508503A (en) 2019-12-30 2020-12-23 Processor for configurable parallel computing
EP20908751.9A EP4085354A4 (en) 2019-12-30 2020-12-23 Processor for configurable parallel computations
KR1020227025841A KR20220139304A (en) 2019-12-30 2020-12-23 Processor for Configurable Parallel Computing
CN202080090121.3A CN115280297A (en) 2019-12-30 2020-12-23 Processor for configurable parallel computing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962954952P 2019-12-30 2019-12-30
US62/954,952 2019-12-30

Publications (1)

Publication Number Publication Date
WO2021138189A1 true WO2021138189A1 (en) 2021-07-08

Family

ID=76547640

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/066823 WO2021138189A1 (en) 2019-12-30 2020-12-23 Processor for configurable parallel computations

Country Status (6)

Country Link
US (2) US11789896B2 (en)
EP (1) EP4085354A4 (en)
JP (1) JP2023508503A (en)
KR (1) KR20220139304A (en)
CN (1) CN115280297A (en)
WO (1) WO2021138189A1 (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030039262A1 (en) * 2001-07-24 2003-02-27 Leopard Logic Inc. Hierarchical mux based integrated circuit interconnect architecture for scalability and automatic generation
US20040098562A1 (en) * 2002-11-15 2004-05-20 Anderson Adrian John Configurable processor architecture
US20040225790A1 (en) * 2000-09-29 2004-11-11 Varghese George Selective interrupt delivery to multiple processors having independent operating systems
US20080133899A1 (en) * 2006-12-04 2008-06-05 Samsung Electronics Co., Ltd. Context switching method, medium, and system for reconfigurable processors
US20090179794A1 (en) * 2006-05-08 2009-07-16 Nxp B.V. Gps rf front end and related method of providing a position fix, storage medium and apparatus for the same
US7600143B1 (en) * 2004-08-19 2009-10-06 Unisys Corporation Method and apparatus for variable delay data transfer
US20110051670A1 (en) * 2006-07-07 2011-03-03 Broadcom Corporation Integrated blocker filtering rf front end
US7982497B1 (en) * 2010-06-21 2011-07-19 Xilinx, Inc. Multiplexer-based interconnection network
US20110314233A1 (en) * 2010-06-22 2011-12-22 Sap Ag Multi-core query processing using asynchronous buffers
US20120191967A1 (en) * 2009-01-21 2012-07-26 Shanghai Xin Hao Micro Electronics Co. Ltd. Configurable data processing system and method

Family Cites Families (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5056000A (en) * 1988-06-21 1991-10-08 International Parallel Machines, Inc. Synchronized parallel processing with shared memory
US5594866A (en) * 1989-01-18 1997-01-14 Intel Corporation Message routing in a multi-processor computer system with alternate edge strobe regeneration
US5680400A (en) * 1995-05-31 1997-10-21 Unisys Corporation System for high-speed transfer of a continuous data stream between hosts using multiple parallel communication links
IT1288076B1 (en) * 1996-05-30 1998-09-10 Antonio Esposito ELECTRONIC NUMERICAL MULTIPROCESSOR PARALLEL MULTIPROCESSOR WITH REDUNDANCY OF COUPLED PROCESSORS
AU2271201A (en) * 1999-12-14 2001-06-25 General Instrument Corporation Hardware filtering of input packet identifiers for an mpeg re-multiplexer
WO2002013418A1 (en) * 2000-08-03 2002-02-14 Morphics Technology, Inc. Flexible tdma system architecture
US7840777B2 (en) * 2001-05-04 2010-11-23 Ascenium Corporation Method and apparatus for directing a computational array to execute a plurality of successive computational array instructions at runtime
US7159099B2 (en) * 2002-06-28 2007-01-02 Motorola, Inc. Streaming vector processor with reconfigurable interconnection switch
US20060168637A1 (en) * 2005-01-25 2006-07-27 Collaboration Properties, Inc. Multiple-channel codec and transcoder environment for gateway, MCU, broadcast and video storage applications
US20070113229A1 (en) * 2005-11-16 2007-05-17 Alcatel Thread aware distributed software system for a multi-processor
CN101359284B (en) * 2006-02-06 2011-05-11 威盛电子股份有限公司 Multiplication accumulate unit for treating plurality of different data and method thereof
US8145650B2 (en) * 2006-08-18 2012-03-27 Stanley Hyduke Network of single-word processors for searching predefined data in transmission packets and databases
EP2056212B1 (en) * 2006-08-23 2013-04-10 NEC Corporation Mixed mode parallel processor system and method
JP2009134391A (en) * 2007-11-29 2009-06-18 Renesas Technology Corp Stream processor, stream processing method, and data processing system
US7958341B1 (en) * 2008-07-07 2011-06-07 Ovics Processing stream instruction in IC of mesh connected matrix of processors containing pipeline coupled switch transferring messages over consecutive cycles from one link to another link or memory
KR101275698B1 (en) * 2008-11-28 2013-06-17 상하이 신하오 (브레이브칩스) 마이크로 일렉트로닉스 코. 엘티디. Data processing method and device
US8612711B1 (en) * 2009-09-21 2013-12-17 Tilera Corporation Memory-mapped data transfers
CN101739242B (en) * 2009-11-27 2013-07-31 深圳中微电科技有限公司 Stream data processing method and stream processor
US9558247B2 (en) * 2010-08-31 2017-01-31 Samsung Electronics Co., Ltd. Storage device and stream filtering method thereof
WO2014190263A2 (en) * 2013-05-24 2014-11-27 Coherent Logix, Incorporated Memory-network processor with programmable optimizations
US9985996B2 (en) * 2013-09-09 2018-05-29 Avago Technologies General Ip (Singapore) Pte. Ltd. Decoupling audio-video (AV) traffic processing from non-AV traffic processing
US10404624B2 (en) * 2013-09-17 2019-09-03 Avago Technologies International Sales Pte. Limited Lossless switching of traffic in a network device
US9712442B2 (en) * 2013-09-24 2017-07-18 Broadcom Corporation Efficient memory bandwidth utilization in a network device
EP3400688B1 (en) * 2016-01-04 2020-05-20 Gray Research LLC Massively parallel computer, accelerated computing clusters, and two dimensional router and interconnection network for field programmable gate arrays, and applications
CN109032668B (en) * 2017-06-09 2023-09-19 超威半导体公司 Stream processor with high bandwidth and low power vector register file
US10936597B2 (en) * 2017-11-21 2021-03-02 Gto Llc Systems and methods for generating customized filtered-and-partitioned market-data feeds
US10824467B2 (en) * 2018-08-07 2020-11-03 Arm Limited Data processing system with protected mode of operation for processing protected content

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040225790A1 (en) * 2000-09-29 2004-11-11 Varghese George Selective interrupt delivery to multiple processors having independent operating systems
US20030039262A1 (en) * 2001-07-24 2003-02-27 Leopard Logic Inc. Hierarchical mux based integrated circuit interconnect architecture for scalability and automatic generation
US20040098562A1 (en) * 2002-11-15 2004-05-20 Anderson Adrian John Configurable processor architecture
US7600143B1 (en) * 2004-08-19 2009-10-06 Unisys Corporation Method and apparatus for variable delay data transfer
US20090179794A1 (en) * 2006-05-08 2009-07-16 Nxp B.V. Gps rf front end and related method of providing a position fix, storage medium and apparatus for the same
US20110051670A1 (en) * 2006-07-07 2011-03-03 Broadcom Corporation Integrated blocker filtering rf front end
US20080133899A1 (en) * 2006-12-04 2008-06-05 Samsung Electronics Co., Ltd. Context switching method, medium, and system for reconfigurable processors
US20120191967A1 (en) * 2009-01-21 2012-07-26 Shanghai Xin Hao Micro Electronics Co. Ltd. Configurable data processing system and method
US7982497B1 (en) * 2010-06-21 2011-07-19 Xilinx, Inc. Multiplexer-based interconnection network
US20110314233A1 (en) * 2010-06-22 2011-12-22 Sap Ag Multi-core query processing using asynchronous buffers

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4085354A4 *

Also Published As

Publication number Publication date
KR20220139304A (en) 2022-10-14
US11789896B2 (en) 2023-10-17
JP2023508503A (en) 2023-03-02
EP4085354A4 (en) 2024-03-13
EP4085354A1 (en) 2022-11-09
US20210200710A1 (en) 2021-07-01
US20230418780A1 (en) 2023-12-28
CN115280297A (en) 2022-11-01

Similar Documents

Publication Publication Date Title
US11915057B2 (en) Computational partition for a multi-threaded, self-scheduling reconfigurable computing fabric
US9760373B2 (en) Functional unit having tree structure to support vector sorting algorithm and other algorithms
US7533244B2 (en) Network-on-chip dataflow architecture
US10102001B2 (en) Parallel slice processor shadowing states of hardware threads across execution slices
US20050289327A1 (en) Reconfigurable processor and semiconductor device
Kartashev et al. A multicomputer system with dynamic architecture
US7734896B2 (en) Enhanced processor element structure in a reconfigurable integrated circuit device
EP1535189B1 (en) Programmable pipeline fabric utilizing partially global configuration buses
EP4283481A1 (en) Reconfigurable processor and configuration method
US11789896B2 (en) Processor for configurable parallel computations
CN112074810B (en) Parallel processing apparatus
JP7131115B2 (en) DATA PROCESSING APPARATUS, DATA PROCESSING METHOD, AND PROGRAM
Ferreira et al. A low cost and adaptable routing network for reconfigurable systems
US7043710B2 (en) Method for early evaluation in micropipeline processors
Feng et al. Design and evaluation of a novel reconfigurable ALU based on FPGA
Einstein Mercury Computer Systems' modular heterogeneous RACE (R) multicomputer
US20090113083A1 (en) Means of control for reconfigurable computers
JPH05324694A (en) Reconstitutable parallel processor
WO2021014017A1 (en) A reconfigurable architecture, for example a coarse-grained reconfigurable architecture as well as a corresponding method of operating such a reconfigurable architecture
Ferreira et al. Reducing interconnection cost in coarse-grained dynamic computing through multistage network
RU2292075C1 (en) Synergetic computing system
WO2024134679A1 (en) Polymorphic computing fabric for static dataflow execution of computation operations represented as dataflow graphs (dfgs)
Kasim et al. HDL Based Design for High Bandwidth Application
JPWO2021138189A5 (en)
US20130046955A1 (en) Local Computation Logic Embedded in a Register File to Accelerate Programs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20908751

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022539692

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020908751

Country of ref document: EP

Effective date: 20220801