CN113407483B - Dynamic reconfigurable processor for data intensive application - Google Patents

Dynamic reconfigurable processor for data intensive application Download PDF

Info

Publication number
CN113407483B
CN113407483B CN202110703118.0A CN202110703118A CN113407483B CN 113407483 B CN113407483 B CN 113407483B CN 202110703118 A CN202110703118 A CN 202110703118A CN 113407483 B CN113407483 B CN 113407483B
Authority
CN
China
Prior art keywords
data
register
configuration
time
operator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110703118.0A
Other languages
Chinese (zh)
Other versions
CN113407483A (en
Inventor
刘大江
朱蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202110703118.0A priority Critical patent/CN113407483B/en
Publication of CN113407483A publication Critical patent/CN113407483A/en
Application granted granted Critical
Publication of CN113407483B publication Critical patent/CN113407483B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • G06F15/8023Two dimensional arrays, e.g. mesh, torus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • G06F15/7871Reconfiguration support, e.g. configuration loading, configuration switching, or hardware OS
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Logic Circuits (AREA)
  • Multi Processors (AREA)

Abstract

The application provides a dynamic reconfigurable processor for data-intensive application, wherein the method comprises the following steps: a dynamic reconfigurable processor for data intensive applications comprises a processing unit array, an on-chip multi-bank note type memory and a configuration memory, wherein the processing unit array is formed by m x n processing units PE in a two-dimensional array mode, m and n are positive integers, the same row of PE is connected to the same bus, and each bus accesses m banks in the note type memory through a cross selection matrix unit. The method provided by the application enables reusable data to flow in the processing unit array efficiently, avoids repeated access of the data in the same storage position, reduces the data access amount from the source, and greatly improves the circulating water performance of the dynamic reconfigurable processor.

Description

Dynamic reconfigurable processor for data intensive application
Technical Field
The application relates to the technical field of integrated circuits, in particular to the field of a dynamic reconfigurable processor.
Background
With the development of technologies such as cloud computing, big data, internet of things and the like, the popularization of various intelligent terminal devices is continuously accelerated, and the requirements on high-performance chips are increasingly urgent. A dynamically reconfigurable processor is a new processor architecture with energy efficiency approaching that of an application specific integrated circuit (ASIC, application Specific Integrated Circuit) without sacrificing too much programming flexibility, one of the ideal architectures to speed up data intensive applications. Unlike conventional general purpose processors (General Purpose Processor, GPP), dynamically reconfigurable processors do not have the latency and energy overhead of finger fetching and decoding operations; unlike an ASIC, the dynamically reconfigurable processor can dynamically configure the functions of the circuit at run-time, with better flexibility; unlike field programmable gate arrays (FPGA, field Programmable Gate Array), dynamically reconfigurable processors have coarse-grained configuration, reducing configuration information costs and having higher computational energy efficiency.
A typical dynamically reconfigurable processor is generally composed of an array of processing units, a data memory and a configuration memory. The processing unit array is made up of a plurality of processing units (PEs, processing Element), the functionality of the entire array being defined by configuring connectivity and operation modes of each PE. The configuration is mainly derived from the mapping of a specific compilation algorithm. Modulo scheduling loop software pipelining is one of the most common methods for performing mapping optimization in compilation, which improves application parallel execution performance by minimizing the start-up interval (II, initiation Interval) of loop iterations. However, high computational parallelism allows large amounts of data to be accessed in parallel between the dataram and the processing element array. To address the problem of parallel data access pressure, existing reconfigurable processors typically use one on-chip multi-bank scratch pad memory (SPM, scratched Pad Memory) to provide data in parallel for an array of processing units. A 4 x 4 processing unit array is typically provided with a 4-bank SPM, in which each row of the processing unit array has parallel access to each bank of the SPM, but different PEs within the same row can only access data serially because of the shared data bus. To further enhance parallel data access capabilities, HReA [ L.Liu, Z.Li, C.Yang, C.Deng, S.Yin and S.Wei, "HReA: an Energy-Efficient Embedded Dynamically Reconfigurable Fabric for-Dwarfs Processing," in IEEE Transactions on Circuits and Systems II: express Briefs, vol.65, no.3, pp.381-385, march 2018 ] provides a 4×4 processing unit array with a 16-bank SPM. In the architecture, all PEs can access each bank of the SPM in parallel, so that the parallel data access and storage capacity is greatly enhanced. But also increases the difficulty of managing the bank, and increases the chip area and power consumption. In order to fully utilize the limited bandwidth of SPM, literature [ S.Yin, X.Yao, T.Lu, D.Liu, J.Gu, L.Liu, and S.Wei, "Conflict-free loop mapping for coarse-grained reconfigurable architecture with multibank memory," IEEE Transactions on Parallel and Distributed Systems, vol.28, no.9, pp.2471-2485,2017 ] proposes a collision-free cyclic mapping method from the compiling perspective. According to the method, access and storage operators in the DFG are firstly scheduled to different time steps to reduce the parallel data access amount, then storage positions of data in the SPM are reasonably organized through a storage division algorithm, and finally the conflict of data access and storage is reduced. However, as the start-up interval of the circulating pipeline becomes smaller and smaller, multiple memory operators have to be executed simultaneously, and finally performance can only be lost to ensure collision-free data access. The above work allows the original memory operators in the application to operate conflict-free from the architecture or compiling point of view, but does not essentially change the number of memory operators, thereby limiting the extent of memory conflict optimization.
From the perspective of data-intensive applications, there are many opportunities for data reuse, such as template computation, in the cyclic kernel of the application. Although there are many memory accesses in its loop core, some of them actually read the same data. If the same data is obtained once and used multiple times, then the access conflict may be reduced by reducing the access memory operators. An unavoidable problem is how to route such reusable data between PEs.
In a conventional processing unit array, a single channel network is arranged between processing units, that is, data can be transferred between PEs only by inputting data through two Multiplexer (MUX) input ends of the PEs and then outputting data through an output register, and the PEs cannot perform any other operation except the data routing operation, which is very wasteful of resources. If the data is not routed out, but the data is retained in the local register file (LRF, local Register File) of the PE for multiple cycles, the mapping scheme of the operator on the processing unit array is very limited in terms of compilation, and the operator can only map to a certain PE in order to use the data in the LRF of this PE. Therefore, how to innovate and provide a framework to efficiently route data between PEs, fully utilize PE resources, and make the compiling mode of operators flexible in compiling, which is a technical problem to be solved by the technicians in the field. The problem of high-efficiency routing of the data is solved, reusable data can be fully utilized, so that access operators are reduced, access conflicts are reduced, and finally the execution performance of the dynamic reconfigurable processor is improved.
Disclosure of Invention
The present application aims to solve at least one of the technical problems in the related art to some extent.
Therefore, a first object of the present application is to provide a dynamic reconfigurable processor for data-intensive applications, so as to improve the parallel data access capability of the reconfigurable processor.
A second object of the application is to propose a non-transitory computer readable storage medium.
To achieve the above object, an embodiment of the first aspect of the present application provides a dynamically reconfigurable processor for data-intensive applications, where the dynamically reconfigurable processor includes a processing unit array, an on-chip multi-bank note memory SPM, and a configuration memory, where the processing unit array is formed by m×n processing units PE in a two-dimensional array, where m and n are positive integers, where the same row of PEs are connected to the same bus, and each bus accesses m banks in the note memory through a cross selection matrix unit.
Optionally, in an embodiment of the present application, each PE includes a functional unit FU for performing various fixed-point operations, a local register file RF having two multiplexers at its input for accessing data of different origin, and an output register and a configuration register, where r is a positive integer, each register selects data from the functional unit FU or a previous register, information of the configuration register in each processing unit PE is derived from a configuration memory, which is connected to various components inside the processing unit PE, and the distribution configuration stream configures the selection signal of each multiplexer, the function of the functional unit FU and the read-write enable of the registers.
Optionally, in one embodiment of the application, the processing unit PE is in a two-channel network comprising a result network for carrying out the transfer of the calculation result of the functional unit FU and a value network for carrying out the transfer of the values of the local register file RF.
Alternatively, in one embodiment of the application, when a value obtained from memory by an access operator requires multiple references in a short time, the value is distributed over the network of values to other processing units PE requiring the reference value.
Optionally, in one embodiment of the application, serial shifted data lanes are added at the internal registers of the local register file RF.
Optionally, in one embodiment of the present application, the method comprises the steps of:
step 1, converting application pseudo code into original data dependency graph due to data x [ i+2 ]]With data x [ i+1 ] after one clock cycle]And data x [ i ] after two clock cycles]Is the same data, remove L 2 ,L 3 Two operators, adding two new reuse dependent edges (L 1 ,*),(L 1 (L) obtaining a new data dependency graph 1 Representing operator L 1 The acquired data is transmitted to a multiplier for consumption through a numerical network, (L) 1 The +) represents the memory operator L 1
The obtained data is transmitted to an addition operator "+" through a numerical network for consumption. And obtaining a compiling scheme through a compiling tool, and generating a configuration information stream to obtain a layout result of II=1. Memory operator L 1 Layout at PE1, multiplier "+";
layout at PE2, addition operator "+" layout at PE3, memory operator "S 1 "lay out on PE 4;
step 2, at time t, under the drive of configuration stream, L at time t 1 After the operator finishes taking the data, placing the data in a last register of the PE 1;
step 3, at time t+1, under the drive of configuration stream, L at time t 1 The data is transferred out through the multiplexer selector Mb of PE1, through the multiplexer selectors Ma and M1 of PE2, and reaches the first register R1 of PE2 at time L of t+1 1 After the operator finishes taking the data, placing the data in an output register of the PE1, and simultaneously placing the data in a last register R1 of the PE 1;
step 4, at time t+2, under the drive of configuration stream, L at time t 1 The data is transmitted out through the multiplexing selector Mb of PE2, passes through the multiplexing selectors Ma and M1 of PE3, and reaches the first PE3A number of registers R1; at the same time, under the drive of the configuration stream, L at time t 1 Data is also passed to the second input port of the FU through multiplexers Md and Mf before the FU of PE 2; meanwhile, under the driving of the configuration flow, the data in the output register of the PE1 reaches the first input port through the multiplexer Me before the FU of the PE2, and under the driving of the configuration flow, the FU performs multiplication operation, and the result is stored in the output register of the PE 2;
step 5, at time t+3, under the drive of configuration stream, L at time t 1 The data is transmitted to the second register R2 of PE3 through the output port of the first register R1 of PE3 and the multiplexer M2, meanwhile, the FU of PE2 performs multiplication operation, and the result is stored in the output register of PE 2;
step 6, at time t+4, under the drive of configuration stream, L at time t 1 The data reach the second input port of the FU through the multiplexer Md and Mf before the FU of the PE3, meanwhile, the data in the output register of the PE2 are transmitted to the first input port of the FU through the multiplexer Me before the FU of the PE3, under the drive of the configuration flow, the FU performs addition "+" operation, and the result is stored in the output register of the PE 3;
at the time of step 7, t+5, under the driving of the configuration stream, the data in the output register of PE3 is transferred to the first input port of FU of PE4 through the multiplexer Me, and the data is stored in Bank through the bus.
To achieve the above object, an embodiment of a third aspect of the present application provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for transferring reusable data according to an embodiment of the first aspect of the present application.
Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a schematic diagram of a dynamic reconfigurable processor for data-intensive applications according to an embodiment of the present application;
fig. 2 is a schematic diagram of an implementation of the architecture according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present application and should not be construed as limiting the application.
A dynamically reconfigurable processor for data-intensive applications according to embodiments of the present application is described below with reference to the accompanying drawings.
FIG. 1 is a schematic flow chart of a dynamic reconfigurable processor for data-intensive applications according to an embodiment of the present application.
As shown in fig. 1, to achieve the above object, an embodiment of the present application proposes a dynamically reconfigurable processor for data-intensive applications, where the dynamically reconfigurable processor includes a processing unit array, an on-chip multi-note memory SPM, and a configuration memory, where the processing unit array is formed by m×n processing units PE in a two-dimensional array, where m and n are positive integers, where the same row of PEs are connected to the same bus, and each bus accesses m banks in the note memory through a cross selection matrix unit.
In one embodiment of the present application, each PE further comprises a functional unit FU for performing various fixed point operations, a local register file RF having two multiplexers at its inputs for accessing data of different origin, an output register and a configuration register, the local register file RF being divided into r individual registers, where r is a positive integer, each register selecting data originating from the functional unit FU or from a previous register, the configuration register information in each processing unit PE originating from a configuration memory connecting the various components inside the processing unit PE, distributing configuration stream configuration selection signals of each multiplexer, functions of the functional unit FU and read-write enabling of the registers.
In one embodiment of the application, further, the processing unit PE is in a two-channel network comprising a result network for carrying out the transfer of the calculation result of the functional unit FU and a value network for carrying out the transfer of the values of the local register file RF.
In one embodiment of the application, further, when a value obtained from the memory by an access operator requires a plurality of references in a short time, the value is distributed over the network of values to other processing units PE requiring the reference value.
In one embodiment of the application, further, serial shifted data channels are added at the internal registers of the local register file RF.
In particular, in one embodiment of the present application, the FU may perform various fixed-point operations including logical operations such as addition, subtraction, multiplication, and the like. The input of FU has two multiplexers (6-select 1 multiplexers Me, mf) which have access to data from different sources, e.g. FU of neighboring PE, respective local registers of the own register file RF, and memory. The output of FU has three directions: memory, output registers, and individual registers of the register file RF.
RF is divided into R individual registers (R1, R2 … Rr, R is a positive integer) each preceded by a 1-by-2 multiplexer (M1, M2 … Mr) which can select data from FU or the previous register (the previous register of the first register R1 is a certain register of the adjacent RF). The output port of each register is connected to its subsequent register (the last register Rr is not connected to a register) and is connected to three multiplexers (Mb, mc, md), all of which are r 1-select multiplexers, through which the data in one register can be selected for calculation to the local FU. The data in one register can be selected to the first register (R1) of the adjacent RF by means of a multiplexer Mb. Meanwhile, the first register R1 of the RF selects data from one adjacent RF through the multiplexer Ma of 4-select 1.
The information of the configuration register in each PE is sourced from a configuration memory, and is connected with each part inside the PE, and a distribution configuration stream configures the selection signal of each multiplexer, the function of FU and the read-write enabling of the register.
In one embodiment of the application, specifically, the two-channel interconnection between the processing units changes from the original single-channel network to a two-channel network, one for performing the transmission of the calculation result of the FU (result network) and the other for performing the numerical transmission of the RF (numerical network). The result network passes the calculated result of FU through the output register to the surviving PE or through the bus to the memory. The numerical network flexibly configures the output direction of data by passing the numerical values in the RF through a multiplexer, and transfers the data to the first register of adjacent RF or other local registers.
When a value obtained from a memory by one access operator needs to be referenced for a plurality of times in a short time, the value can be distributed to other PE needing to be referenced by the built value network, so that the data multiplexing capability of the reconfigurable computing array is enhanced. The two-channel interconnection network can reduce the number of access operators in the data flow graph from the source, thereby reducing the access conflict of the data memory.
In one embodiment of the present application, specifically, the register interconnect within the processing unit, while the dual-path inter-PE interconnect network design enhances flexibility in data transfer, the correctness of the computation function is guaranteed only when the number of clock cycles that the data should reach (RT, required Time) and the number of actual clock cycles that the data should reach (AT, arrival Time) are the same in the pipelined mode. An AT reusing data relies on the manhattan distance between the data producer PE and consumer PE, and it is difficult to ensure that the manhattan distance between producer and consumer matches the RT during compilation. Therefore, we add a serially shifted data path to the RF internal registers, i.e. each register in the RF internal is connected end to end, so that the reuse data can still be kept for multiple clock cycles in the RF internal of the same PE in the pipelined mode.
After adding the serial shift data path, each register may select data from FU or from the previous register, since there is a 2-to-1 multiplexer before each register. The RF can now operate in either a normal mode or a shift register mode. When operating in the normal mode, the register may register data for the next time period. When operating in shift register mode, the registers of all processing units form a chain of registers, the length of which can be selected by a multiplexer, and the number of clock cycles for data flow is flexibly configured. Assuming the Manhattan distance between the data producer and consumer is M1 and the number of individual RF internal registers is r, then the AT tuning range of the data is [ M1+1, rX (M1+1) ], greatly enhancing the synchronous arrival capability of the data. Therefore, the inter-PE register interconnection network structure can provide a hardware basis for data synchronization and provide flexibility guarantee for subsequent compiling mapping.
The application has the technical effects that: the embodiment of the application provides a dynamic reconfigurable processor for data intensive application, the architecture of the application aims at the disadvantage that only Functional Units (FU) in a traditional PE interconnection network are connected through output registers, and direct interconnection channels are not arranged among register files of PEs, so that the interconnection Function of the RF of the register files of each PE in a processing Unit array is increased, reusable data can efficiently flow in the processing Unit array, repeated access of data in the same storage position is avoided, the data access quantity is reduced from the source, and the circulating water property of the dynamic reconfigurable processor is greatly improved.
As shown in fig. 2, in one embodiment of the present application, further, the method includes the steps of:
step 1, converting application pseudo codesChanging to original data dependency graph, finding data x [ i+2 ]]With data x [ i+1 ] after one clock cycle]And data x [ i ] after two clock cycles]Is the same data, remove L 2 ,L 3 Two operators, adding two new reuse dependent edges (L 1 ,*),(L 1 And (4) obtaining a new data dependency graph. (L) 1 Representing operator L 1 The acquired data is transmitted to a multiplier for consumption through a numerical network, (L) 1 The +) represents the memory operator L 1 The obtained data is transmitted to an addition operator "+" through a numerical network for consumption. And obtaining a compiling scheme through a compiling tool, and generating a configuration information stream to obtain a layout result of II=1. Memory operator L 1 Laying out PE1, laying out multiplication operator in PE2, addition operator in PE3, and access operator S 1 Layout on PE 4;
step 2, at time t, under the drive of configuration stream, L at time t 1 After the operator finishes taking the data, placing the data in a last register of the PE 1;
step 3, t+1 time, t time L under the drive of configuration flow 1 The data is transferred out through the multiplexer selector Mb of PE1, through the multiplexer selectors Ma and M1 of PE2, and reaches the first register R1 of PE2 at time L of t+1 1 After the operator finishes taking the data, placing the data in an output register of the PE1 and simultaneously placing the data in a last register of the PE 1;
step 4, at time t+2, under the drive of configuration stream, time t L 1 The data is transmitted out through the multiplexing selector Mb of PE2, passes through the multiplexing selectors Ma and M1 of PE3, and reaches the first register R1 of PE 3; at the same time, under the drive of the configuration stream, time t L 1 Data is also passed to the second input port of the FU through multiplexers Md and Mf before the FU of PE 2; meanwhile, under the driving of the configuration flow, the data in the output register of the PE1 reaches the first input port through the multiplexer Me before the FU of the PE2, and under the driving of the configuration flow, the FU performs multiplication (operation) and the result is stored in the output register of the PE 2;
at the time of step 5, t+3, the stream is configuredDriven by (1), time t L 1 The data is transmitted to the second register R2 of PE3 through the output port of the first register R1 of PE3 and the multiplexer M2, meanwhile, the FU of PE2 performs multiplication operation, and the result is stored in the output register of PE 2;
step 6, at time t+4, under the drive of configuration stream, time t L 1 The data reach the second input port of the FU through the multiplexer Md and Mf before the FU of the PE3, meanwhile, the data in the output register of the PE2 are transmitted to the first input port of the FU through the multiplexer Me before the FU of the PE3, under the drive of the configuration flow, the FU performs addition "+" operation, and the result is stored in the output register of the PE 3;
at the time of step 7, t+5, under the driving of the configuration stream, the data in the output register of PE3 is transferred to the first input port of FU of PE4 through the multiplexer Me, and the data is stored in Bank through the bus.
In one embodiment of the application, specifically, after 6 clock cycles, one complete iteration in the circulating pipeline has been performed. Since ii=1, the processing unit array for each clock cycle is executed with the same configuration information.
In one embodiment of the present application, specifically, FIG. 2 (a) is an exemplary cyclic pseudocode; (b) the original DDG plot from (a); (c) reusing the DDG from the data obtained in (b); (d) example (m=2, n=2, r=2) application architecture; (e) the data obtained at time L1 is schematically transmitted on the example application architecture.
The application has the technical effects that: the embodiment of the application provides a dynamic reconfigurable processor for data intensive application, the architecture of the application aims at the disadvantage that only Functional Units (FU) in a traditional PE interconnection network are connected through output registers, and direct interconnection channels are not arranged among register files of PEs, so that the interconnection Function of the RF of the register files of each PE in a processing Unit array is increased, reusable data can efficiently flow in the processing Unit array, repeated access of data in the same storage position is avoided, the data access quantity is reduced from the source, and the circulating water property of the dynamic reconfigurable processor is greatly improved.
In order to implement the above-mentioned embodiments, the present application also proposes a non-transitory computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements a method for transferring reusable data according to an embodiment of the first aspect of the present application.
Although the application has been disclosed in detail with reference to the accompanying drawings, it is to be understood that such description is merely illustrative and is not intended to limit the application of the application. The scope of the application is defined by the appended claims and may include various modifications, alterations and equivalents of the application without departing from the scope and spirit of the application.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
Logic and/or steps represented in the schematic diagrams or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (5)

1. A dynamic reconfigurable processor for data intensive applications, wherein the dynamic reconfigurable processor comprises a processing unit array, an on-chip multi-bank note memory and a configuration memory, the processing unit array is composed of m x n processing units PE in the form of a two-dimensional array, m and n are positive integers, wherein the same row of PE is connected to the same bus, and each bus accesses m banks in the note memory through a cross selection matrix unit;
each PE comprises a functional unit FU, a local register file RF, an output register and a configuration register, the functional unit FU is configured to perform various fixed-point operations, the input end of the functional unit FU is provided with two multiplexers, the multiplexers are configured to access data from different sources, the local register file RF is divided into r independent registers, r is a positive integer, each register selects data from the functional unit FU or a previous register, information of the configuration register in each processing unit PE is derived from a configuration memory, the configuration memory is connected with each component inside the processing unit PE, and a configuration stream is distributed to configure a selection signal of each multiplexer, a function of the functional unit FU and read-write enabling of the register;
wherein the method for delivering reusable data based on the dynamically reconfigurable processor comprises the following steps:
step 1, converting application pseudo code into original data dependency graph due to data x [ i+2 ]]With data x [ i+1 ] after one clock cycle]And data x [ i ] after two clock cycles]Is the same data, remove L 2 ,L 3 Two operators, adding two new reuse dependent edges (L 1 ,*),(L 1 (L) obtaining a new data dependency graph 1 Representing operator L 1 The acquired data is transmitted to a multiplier for consumption through a numerical network, (L) 1 The +) represents operator L 1 The obtained data are transmitted to an addition operator "+" for consumption through a numerical network, a compiling scheme is obtained through a compiling tool, a configuration information stream is generated, a layout result of II=1 is obtained, and an operator L is accessed 1 Laying out PE1, laying out multiplication operator in PE2, addition operator in PE3, and access to storage operator S 1 Layout on PE 4;
step 2, at time t, under the drive of configuration stream, L at time t 1 After the operator finishes fetching the data, the data is placed in the last register of PE1In the device;
step 3, at time t+1, under the drive of configuration stream, L at time t 1 The data is transferred out through the multiplexer selector Mb of PE1, through the multiplexer selectors Ma and M1 of PE2, and reaches the first register R1 of PE2 at time L of t+1 1 After the operator finishes taking the data, placing the data in an output register of the PE1 and simultaneously placing the data in a last register of the PE 1;
step 4, at time t+2, under the drive of configuration stream, L at time t 1 The data is transmitted out through the multiplexing selector Mb of PE2, passes through the multiplexing selectors Ma and M1 of PE3, and reaches the first register R1 of PE 3; at the same time, under the drive of the configuration stream, L at time t 1 Data is also passed to the second input port of the FU through multiplexers Md and Mf before the FU of PE 2; meanwhile, under the driving of the configuration flow, the data in the output register of the PE1 reaches the first input port through the multiplexer Me before the FU of the PE2, and under the driving of the configuration flow, the FU performs multiplication operation, and the result is stored in the output register of the PE 2;
step 5, at time t+3, under the drive of configuration stream, L at time t 1 The data is transmitted to the second register R2 of PE3 through the output port of the first register R1 of PE3 and the multiplexer M2, meanwhile, PE2 performs multiplication operation, and the result is stored in the output register of PE 2;
step 6, at time t+4, under the drive of configuration stream, L at time t 1 The data reach the second input port of the FU through the multiplexer Md and Mf before the FU of the PE3, meanwhile, the data in the output register of the PE2 are transmitted to the first input port of the FU through the multiplexer Me before the FU of the PE3, under the drive of the configuration flow, the FU performs addition "+" operation, and the result is stored in the output register of the PE 3;
at the time of step 7, t+5, under the driving of the configuration stream, the data in the output register of PE3 is transferred to the first input port of FU of PE4 through the multiplexer Me, and the data is stored in Bank through the bus.
2. The dynamically reconfigurable processor according to claim 1, wherein the processing unit PE is in a two-channel network comprising a result network for carrying out the transfer of the calculation result of the functional unit FU and a numerical network for carrying out the numerical transfer of the local register file RF.
3. The dynamically reconfigurable processor according to claim 2, wherein when a value obtained from the memory by one access operator requires a plurality of references in a short time, the value is distributed to other processing units PE requiring the reference value through the value network.
4. A dynamically reconfigurable processor according to claim 2 or 3, wherein serially shifted data channels are added at internal registers of the local register file RF.
5. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of delivering reusable data as claimed in claim 1.
CN202110703118.0A 2021-06-24 2021-06-24 Dynamic reconfigurable processor for data intensive application Active CN113407483B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110703118.0A CN113407483B (en) 2021-06-24 2021-06-24 Dynamic reconfigurable processor for data intensive application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110703118.0A CN113407483B (en) 2021-06-24 2021-06-24 Dynamic reconfigurable processor for data intensive application

Publications (2)

Publication Number Publication Date
CN113407483A CN113407483A (en) 2021-09-17
CN113407483B true CN113407483B (en) 2023-12-12

Family

ID=77683003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110703118.0A Active CN113407483B (en) 2021-06-24 2021-06-24 Dynamic reconfigurable processor for data intensive application

Country Status (1)

Country Link
CN (1) CN113407483B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838498B (en) * 2021-09-27 2023-02-28 华中科技大学 Data multiplexing operation circuit and method for memory calculation
CN116627887A (en) * 2022-02-14 2023-08-22 华为技术有限公司 Method and chip for processing graph data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253921A (en) * 2011-06-14 2011-11-23 清华大学 Dynamic reconfigurable processor
CN103218347A (en) * 2013-04-28 2013-07-24 清华大学 Multiparameter fusion performance modeling method for reconfigurable array
CN105468568A (en) * 2015-11-13 2016-04-06 上海交通大学 High-efficiency coarse granularity reconfigurable computing system
CN112506853A (en) * 2020-12-18 2021-03-16 清华大学 Reconfigurable processing unit array of zero-buffer flow and zero-buffer flow method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070067380A2 (en) * 2001-12-06 2007-03-22 The University Of Georgia Research Foundation Floating Point Intensive Reconfigurable Computing System for Iterative Applications
KR101553648B1 (en) * 2009-02-13 2015-09-17 삼성전자 주식회사 A processor with reconfigurable architecture
KR101869749B1 (en) * 2011-10-05 2018-06-22 삼성전자 주식회사 Coarse-grained reconfigurable array based on a static router

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253921A (en) * 2011-06-14 2011-11-23 清华大学 Dynamic reconfigurable processor
CN103218347A (en) * 2013-04-28 2013-07-24 清华大学 Multiparameter fusion performance modeling method for reconfigurable array
CN105468568A (en) * 2015-11-13 2016-04-06 上海交通大学 High-efficiency coarse granularity reconfigurable computing system
CN112506853A (en) * 2020-12-18 2021-03-16 清华大学 Reconfigurable processing unit array of zero-buffer flow and zero-buffer flow method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Fast shared on-chip memory architecture for efficient hybird computing with CGRAs;Jongeun Lee等;2013 Design, Automation & Test in Europe Conference & Exhibition (DATE);正文第1页第2段-第4页第5段 *

Also Published As

Publication number Publication date
CN113407483A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
EP3005139B1 (en) Incorporating a spatial array into one or more programmable processor cores
US8055881B2 (en) Computing nodes for executing groups of instructions
US9275002B2 (en) Tile-based processor architecture model for high-efficiency embedded homogeneous multicore platforms
US20110231616A1 (en) Data processing method and system
CN113407483B (en) Dynamic reconfigurable processor for data intensive application
JP6502616B2 (en) Processor for batch thread processing, code generator and batch thread processing method
JP2008539485A (en) Reconfigurable instruction cell array
CN102122275A (en) Configurable processor
WO2021055233A1 (en) Performance estimation-based resource allocation for reconfigurable architectures
CN102402415B (en) Device and method for buffering data in dynamic reconfigurable array
US20140075153A1 (en) Reducing issue-to-issue latency by reversing processing order in half-pumped simd execution units
US20230076473A1 (en) Memory processing unit architecture mapping techniques
EP4031985A1 (en) Efficient execution of operation unit graphs on reconfigurable architectures based on user specification
US20180324112A1 (en) Joining data within a reconfigurable fabric
Wang et al. A partially reconfigurable architecture supporting hardware threads
CN112559954B (en) FFT algorithm processing method and device based on software-defined reconfigurable processor
CN112074810A (en) Parallel processing apparatus
CN112463218B (en) Instruction emission control method and circuit, data processing method and circuit
US7260709B2 (en) Processing method and apparatus for implementing systolic arrays
CN112506853A (en) Reconfigurable processing unit array of zero-buffer flow and zero-buffer flow method
US7673117B2 (en) Operation apparatus
KR20210021587A (en) Processor memory access
CN117009287A (en) Dynamic reconfigurable processor stored in elastic queue
WO2022141321A1 (en) Dsp and parallel computing method therefor
Prashank et al. Enhancements for variable N-point streaming FFT/IFFT on REDEFINE, a runtime reconfigurable architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant