CN113407483B

CN113407483B - Dynamic reconfigurable processor for data intensive application

Info

Publication number: CN113407483B
Application number: CN202110703118.0A
Authority: CN
Inventors: 刘大江; 朱蓉
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2023-12-12
Anticipated expiration: 2041-06-24
Also published as: CN113407483A

Abstract

The application provides a dynamic reconfigurable processor for data-intensive application, wherein the method comprises the following steps: a dynamic reconfigurable processor for data intensive applications comprises a processing unit array, an on-chip multi-bank note type memory and a configuration memory, wherein the processing unit array is formed by m x n processing units PE in a two-dimensional array mode, m and n are positive integers, the same row of PE is connected to the same bus, and each bus accesses m banks in the note type memory through a cross selection matrix unit. The method provided by the application enables reusable data to flow in the processing unit array efficiently, avoids repeated access of the data in the same storage position, reduces the data access amount from the source, and greatly improves the circulating water performance of the dynamic reconfigurable processor.

Description

Dynamic reconfigurable processor for data intensive application

Technical Field

The application relates to the technical field of integrated circuits, in particular to the field of a dynamic reconfigurable processor.

Background

With the development of technologies such as cloud computing, big data, internet of things and the like, the popularization of various intelligent terminal devices is continuously accelerated, and the requirements on high-performance chips are increasingly urgent. A dynamically reconfigurable processor is a new processor architecture with energy efficiency approaching that of an application specific integrated circuit (ASIC, application Specific Integrated Circuit) without sacrificing too much programming flexibility, one of the ideal architectures to speed up data intensive applications. Unlike conventional general purpose processors (General Purpose Processor, GPP), dynamically reconfigurable processors do not have the latency and energy overhead of finger fetching and decoding operations; unlike an ASIC, the dynamically reconfigurable processor can dynamically configure the functions of the circuit at run-time, with better flexibility; unlike field programmable gate arrays (FPGA, field Programmable Gate Array), dynamically reconfigurable processors have coarse-grained configuration, reducing configuration information costs and having higher computational energy efficiency.

A typical dynamically reconfigurable processor is generally composed of an array of processing units, a data memory and a configuration memory. The processing unit array is made up of a plurality of processing units (PEs, processing Element), the functionality of the entire array being defined by configuring connectivity and operation modes of each PE. The configuration is mainly derived from the mapping of a specific compilation algorithm. Modulo scheduling loop software pipelining is one of the most common methods for performing mapping optimization in compilation, which improves application parallel execution performance by minimizing the start-up interval (II, initiation Interval) of loop iterations. However, high computational parallelism allows large amounts of data to be accessed in parallel between the dataram and the processing element array. To address the problem of parallel data access pressure, existing reconfigurable processors typically use one on-chip multi-bank scratch pad memory (SPM, scratched Pad Memory) to provide data in parallel for an array of processing units. A 4 x 4 processing unit array is typically provided with a 4-bank SPM, in which each row of the processing unit array has parallel access to each bank of the SPM, but different PEs within the same row can only access data serially because of the shared data bus. To further enhance parallel data access capabilities, HReA [ L.Liu, Z.Li, C.Yang, C.Deng, S.Yin and S.Wei, "HReA: an Energy-Efficient Embedded Dynamically Reconfigurable Fabric for-Dwarfs Processing," in IEEE Transactions on Circuits and Systems II: express Briefs, vol.65, no.3, pp.381-385, march 2018 ] provides a 4×4 processing unit array with a 16-bank SPM. In the architecture, all PEs can access each bank of the SPM in parallel, so that the parallel data access and storage capacity is greatly enhanced. But also increases the difficulty of managing the bank, and increases the chip area and power consumption. In order to fully utilize the limited bandwidth of SPM, literature [ S.Yin, X.Yao, T.Lu, D.Liu, J.Gu, L.Liu, and S.Wei, "Conflict-free loop mapping for coarse-grained reconfigurable architecture with multibank memory," IEEE Transactions on Parallel and Distributed Systems, vol.28, no.9, pp.2471-2485,2017 ] proposes a collision-free cyclic mapping method from the compiling perspective. According to the method, access and storage operators in the DFG are firstly scheduled to different time steps to reduce the parallel data access amount, then storage positions of data in the SPM are reasonably organized through a storage division algorithm, and finally the conflict of data access and storage is reduced. However, as the start-up interval of the circulating pipeline becomes smaller and smaller, multiple memory operators have to be executed simultaneously, and finally performance can only be lost to ensure collision-free data access. The above work allows the original memory operators in the application to operate conflict-free from the architecture or compiling point of view, but does not essentially change the number of memory operators, thereby limiting the extent of memory conflict optimization.

From the perspective of data-intensive applications, there are many opportunities for data reuse, such as template computation, in the cyclic kernel of the application. Although there are many memory accesses in its loop core, some of them actually read the same data. If the same data is obtained once and used multiple times, then the access conflict may be reduced by reducing the access memory operators. An unavoidable problem is how to route such reusable data between PEs.

In a conventional processing unit array, a single channel network is arranged between processing units, that is, data can be transferred between PEs only by inputting data through two Multiplexer (MUX) input ends of the PEs and then outputting data through an output register, and the PEs cannot perform any other operation except the data routing operation, which is very wasteful of resources. If the data is not routed out, but the data is retained in the local register file (LRF, local Register File) of the PE for multiple cycles, the mapping scheme of the operator on the processing unit array is very limited in terms of compilation, and the operator can only map to a certain PE in order to use the data in the LRF of this PE. Therefore, how to innovate and provide a framework to efficiently route data between PEs, fully utilize PE resources, and make the compiling mode of operators flexible in compiling, which is a technical problem to be solved by the technicians in the field. The problem of high-efficiency routing of the data is solved, reusable data can be fully utilized, so that access operators are reduced, access conflicts are reduced, and finally the execution performance of the dynamic reconfigurable processor is improved.

Disclosure of Invention

The present application aims to solve at least one of the technical problems in the related art to some extent.

Therefore, a first object of the present application is to provide a dynamic reconfigurable processor for data-intensive applications, so as to improve the parallel data access capability of the reconfigurable processor.

A second object of the application is to propose a non-transitory computer readable storage medium.

To achieve the above object, an embodiment of the first aspect of the present application provides a dynamically reconfigurable processor for data-intensive applications, where the dynamically reconfigurable processor includes a processing unit array, an on-chip multi-bank note memory SPM, and a configuration memory, where the processing unit array is formed by m×n processing units PE in a two-dimensional array, where m and n are positive integers, where the same row of PEs are connected to the same bus, and each bus accesses m banks in the note memory through a cross selection matrix unit.

Optionally, in an embodiment of the present application, each PE includes a functional unit FU for performing various fixed-point operations, a local register file RF having two multiplexers at its input for accessing data of different origin, and an output register and a configuration register, where r is a positive integer, each register selects data from the functional unit FU or a previous register, information of the configuration register in each processing unit PE is derived from a configuration memory, which is connected to various components inside the processing unit PE, and the distribution configuration stream configures the selection signal of each multiplexer, the function of the functional unit FU and the read-write enable of the registers.

Optionally, in one embodiment of the application, the processing unit PE is in a two-channel network comprising a result network for carrying out the transfer of the calculation result of the functional unit FU and a value network for carrying out the transfer of the values of the local register file RF.

Alternatively, in one embodiment of the application, when a value obtained from memory by an access operator requires multiple references in a short time, the value is distributed over the network of values to other processing units PE requiring the reference value.

Optionally, in one embodiment of the application, serial shifted data lanes are added at the internal registers of the local register file RF.

Optionally, in one embodiment of the present application, the method comprises the steps of:

step 1, converting application pseudo code into original data dependency graph due to data x [ i+2 ]]With data x [ i+1 ] after one clock cycle]And data x [ i ] after two clock cycles]Is the same data, remove L ₂ ，L ₃ Two operators, adding two new reuse dependent edges (L ₁ ，*)，(L ₁ (L) obtaining a new data dependency graph ₁ Representing operator L ₁ The acquired data is transmitted to a multiplier for consumption through a numerical network, (L) ₁ The +) represents the memory operator L ₁ ”

The obtained data is transmitted to an addition operator "+" through a numerical network for consumption. And obtaining a compiling scheme through a compiling tool, and generating a configuration information stream to obtain a layout result of II=1. Memory operator L ₁ Layout at PE1, multiplier "+";

layout at PE2, addition operator "+" layout at PE3, memory operator "S ₁ "lay out on PE 4;

step 2, at time t, under the drive of configuration stream, L at time t ₁ After the operator finishes taking the data, placing the data in a last register of the PE 1;

step 3, at time t+1, under the drive of configuration stream, L at time t ₁ The data is transferred out through the multiplexer selector Mb of PE1, through the multiplexer selectors Ma and M1 of PE2, and reaches the first register R1 of PE2 at time L of t+1 ₁ After the operator finishes taking the data, placing the data in an output register of the PE1, and simultaneously placing the data in a last register R1 of the PE 1;

step 4, at time t+2, under the drive of configuration stream, L at time t ₁ The data is transmitted out through the multiplexing selector Mb of PE2, passes through the multiplexing selectors Ma and M1 of PE3, and reaches the first PE3A number of registers R1; at the same time, under the drive of the configuration stream, L at time t ₁ Data is also passed to the second input port of the FU through multiplexers Md and Mf before the FU of PE 2; meanwhile, under the driving of the configuration flow, the data in the output register of the PE1 reaches the first input port through the multiplexer Me before the FU of the PE2, and under the driving of the configuration flow, the FU performs multiplication operation, and the result is stored in the output register of the PE 2;

step 5, at time t+3, under the drive of configuration stream, L at time t ₁ The data is transmitted to the second register R2 of PE3 through the output port of the first register R1 of PE3 and the multiplexer M2, meanwhile, the FU of PE2 performs multiplication operation, and the result is stored in the output register of PE 2;

step 6, at time t+4, under the drive of configuration stream, L at time t ₁ The data reach the second input port of the FU through the multiplexer Md and Mf before the FU of the PE3, meanwhile, the data in the output register of the PE2 are transmitted to the first input port of the FU through the multiplexer Me before the FU of the PE3, under the drive of the configuration flow, the FU performs addition "+" operation, and the result is stored in the output register of the PE 3;

at the time of step 7, t+5, under the driving of the configuration stream, the data in the output register of PE3 is transferred to the first input port of FU of PE4 through the multiplexer Me, and the data is stored in Bank through the bus.

To achieve the above object, an embodiment of a third aspect of the present application provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for transferring reusable data according to an embodiment of the first aspect of the present application.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a dynamic reconfigurable processor for data-intensive applications according to an embodiment of the present application;

fig. 2 is a schematic diagram of an implementation of the architecture according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present application and should not be construed as limiting the application.

A dynamically reconfigurable processor for data-intensive applications according to embodiments of the present application is described below with reference to the accompanying drawings.

FIG. 1 is a schematic flow chart of a dynamic reconfigurable processor for data-intensive applications according to an embodiment of the present application.

As shown in fig. 1, to achieve the above object, an embodiment of the present application proposes a dynamically reconfigurable processor for data-intensive applications, where the dynamically reconfigurable processor includes a processing unit array, an on-chip multi-note memory SPM, and a configuration memory, where the processing unit array is formed by m×n processing units PE in a two-dimensional array, where m and n are positive integers, where the same row of PEs are connected to the same bus, and each bus accesses m banks in the note memory through a cross selection matrix unit.

In one embodiment of the present application, each PE further comprises a functional unit FU for performing various fixed point operations, a local register file RF having two multiplexers at its inputs for accessing data of different origin, an output register and a configuration register, the local register file RF being divided into r individual registers, where r is a positive integer, each register selecting data originating from the functional unit FU or from a previous register, the configuration register information in each processing unit PE originating from a configuration memory connecting the various components inside the processing unit PE, distributing configuration stream configuration selection signals of each multiplexer, functions of the functional unit FU and read-write enabling of the registers.

In one embodiment of the application, further, the processing unit PE is in a two-channel network comprising a result network for carrying out the transfer of the calculation result of the functional unit FU and a value network for carrying out the transfer of the values of the local register file RF.

In one embodiment of the application, further, when a value obtained from the memory by an access operator requires a plurality of references in a short time, the value is distributed over the network of values to other processing units PE requiring the reference value.

In one embodiment of the application, further, serial shifted data channels are added at the internal registers of the local register file RF.

In particular, in one embodiment of the present application, the FU may perform various fixed-point operations including logical operations such as addition, subtraction, multiplication, and the like. The input of FU has two multiplexers (6-select 1 multiplexers Me, mf) which have access to data from different sources, e.g. FU of neighboring PE, respective local registers of the own register file RF, and memory. The output of FU has three directions: memory, output registers, and individual registers of the register file RF.

RF is divided into R individual registers (R1, R2 … Rr, R is a positive integer) each preceded by a 1-by-2 multiplexer (M1, M2 … Mr) which can select data from FU or the previous register (the previous register of the first register R1 is a certain register of the adjacent RF). The output port of each register is connected to its subsequent register (the last register Rr is not connected to a register) and is connected to three multiplexers (Mb, mc, md), all of which are r 1-select multiplexers, through which the data in one register can be selected for calculation to the local FU. The data in one register can be selected to the first register (R1) of the adjacent RF by means of a multiplexer Mb. Meanwhile, the first register R1 of the RF selects data from one adjacent RF through the multiplexer Ma of 4-select 1.

The information of the configuration register in each PE is sourced from a configuration memory, and is connected with each part inside the PE, and a distribution configuration stream configures the selection signal of each multiplexer, the function of FU and the read-write enabling of the register.

In one embodiment of the application, specifically, the two-channel interconnection between the processing units changes from the original single-channel network to a two-channel network, one for performing the transmission of the calculation result of the FU (result network) and the other for performing the numerical transmission of the RF (numerical network). The result network passes the calculated result of FU through the output register to the surviving PE or through the bus to the memory. The numerical network flexibly configures the output direction of data by passing the numerical values in the RF through a multiplexer, and transfers the data to the first register of adjacent RF or other local registers.

When a value obtained from a memory by one access operator needs to be referenced for a plurality of times in a short time, the value can be distributed to other PE needing to be referenced by the built value network, so that the data multiplexing capability of the reconfigurable computing array is enhanced. The two-channel interconnection network can reduce the number of access operators in the data flow graph from the source, thereby reducing the access conflict of the data memory.

In one embodiment of the present application, specifically, the register interconnect within the processing unit, while the dual-path inter-PE interconnect network design enhances flexibility in data transfer, the correctness of the computation function is guaranteed only when the number of clock cycles that the data should reach (RT, required Time) and the number of actual clock cycles that the data should reach (AT, arrival Time) are the same in the pipelined mode. An AT reusing data relies on the manhattan distance between the data producer PE and consumer PE, and it is difficult to ensure that the manhattan distance between producer and consumer matches the RT during compilation. Therefore, we add a serially shifted data path to the RF internal registers, i.e. each register in the RF internal is connected end to end, so that the reuse data can still be kept for multiple clock cycles in the RF internal of the same PE in the pipelined mode.

After adding the serial shift data path, each register may select data from FU or from the previous register, since there is a 2-to-1 multiplexer before each register. The RF can now operate in either a normal mode or a shift register mode. When operating in the normal mode, the register may register data for the next time period. When operating in shift register mode, the registers of all processing units form a chain of registers, the length of which can be selected by a multiplexer, and the number of clock cycles for data flow is flexibly configured. Assuming the Manhattan distance between the data producer and consumer is M1 and the number of individual RF internal registers is r, then the AT tuning range of the data is [ M1+1, rX (M1+1) ], greatly enhancing the synchronous arrival capability of the data. Therefore, the inter-PE register interconnection network structure can provide a hardware basis for data synchronization and provide flexibility guarantee for subsequent compiling mapping.

The application has the technical effects that: the embodiment of the application provides a dynamic reconfigurable processor for data intensive application, the architecture of the application aims at the disadvantage that only Functional Units (FU) in a traditional PE interconnection network are connected through output registers, and direct interconnection channels are not arranged among register files of PEs, so that the interconnection Function of the RF of the register files of each PE in a processing Unit array is increased, reusable data can efficiently flow in the processing Unit array, repeated access of data in the same storage position is avoided, the data access quantity is reduced from the source, and the circulating water property of the dynamic reconfigurable processor is greatly improved.

As shown in fig. 2, in one embodiment of the present application, further, the method includes the steps of:

step 1, converting application pseudo codesChanging to original data dependency graph, finding data x [ i+2 ]]With data x [ i+1 ] after one clock cycle]And data x [ i ] after two clock cycles]Is the same data, remove L ₂ ，L ₃ Two operators, adding two new reuse dependent edges (L ₁ ，*)，(L ₁ And (4) obtaining a new data dependency graph. (L) ₁ Representing operator L ₁ The acquired data is transmitted to a multiplier for consumption through a numerical network, (L) ₁ The +) represents the memory operator L ₁ The obtained data is transmitted to an addition operator "+" through a numerical network for consumption. And obtaining a compiling scheme through a compiling tool, and generating a configuration information stream to obtain a layout result of II=1. Memory operator L ₁ Laying out PE1, laying out multiplication operator in PE2, addition operator in PE3, and access operator S ₁ Layout on PE 4;

step 3, t+1 time, t time L under the drive of configuration flow ₁ The data is transferred out through the multiplexer selector Mb of PE1, through the multiplexer selectors Ma and M1 of PE2, and reaches the first register R1 of PE2 at time L of t+1 ₁ After the operator finishes taking the data, placing the data in an output register of the PE1 and simultaneously placing the data in a last register of the PE 1;

step 4, at time t+2, under the drive of configuration stream, time t L ₁ The data is transmitted out through the multiplexing selector Mb of PE2, passes through the multiplexing selectors Ma and M1 of PE3, and reaches the first register R1 of PE 3; at the same time, under the drive of the configuration stream, time t L ₁ Data is also passed to the second input port of the FU through multiplexers Md and Mf before the FU of PE 2; meanwhile, under the driving of the configuration flow, the data in the output register of the PE1 reaches the first input port through the multiplexer Me before the FU of the PE2, and under the driving of the configuration flow, the FU performs multiplication (operation) and the result is stored in the output register of the PE 2;

at the time of step 5, t+3, the stream is configuredDriven by (1), time t L ₁ The data is transmitted to the second register R2 of PE3 through the output port of the first register R1 of PE3 and the multiplexer M2, meanwhile, the FU of PE2 performs multiplication operation, and the result is stored in the output register of PE 2;

step 6, at time t+4, under the drive of configuration stream, time t L ₁ The data reach the second input port of the FU through the multiplexer Md and Mf before the FU of the PE3, meanwhile, the data in the output register of the PE2 are transmitted to the first input port of the FU through the multiplexer Me before the FU of the PE3, under the drive of the configuration flow, the FU performs addition "+" operation, and the result is stored in the output register of the PE 3;

In one embodiment of the application, specifically, after 6 clock cycles, one complete iteration in the circulating pipeline has been performed. Since ii=1, the processing unit array for each clock cycle is executed with the same configuration information.

In one embodiment of the present application, specifically, FIG. 2 (a) is an exemplary cyclic pseudocode; (b) the original DDG plot from (a); (c) reusing the DDG from the data obtained in (b); (d) example (m=2, n=2, r=2) application architecture; (e) the data obtained at time L1 is schematically transmitted on the example application architecture.

In order to implement the above-mentioned embodiments, the present application also proposes a non-transitory computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements a method for transferring reusable data according to an embodiment of the first aspect of the present application.

Although the application has been disclosed in detail with reference to the accompanying drawings, it is to be understood that such description is merely illustrative and is not intended to limit the application of the application. The scope of the application is defined by the appended claims and may include various modifications, alterations and equivalents of the application without departing from the scope and spirit of the application.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

Logic and/or steps represented in the schematic diagrams or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. A dynamic reconfigurable processor for data intensive applications, wherein the dynamic reconfigurable processor comprises a processing unit array, an on-chip multi-bank note memory and a configuration memory, the processing unit array is composed of m x n processing units PE in the form of a two-dimensional array, m and n are positive integers, wherein the same row of PE is connected to the same bus, and each bus accesses m banks in the note memory through a cross selection matrix unit;

each PE comprises a functional unit FU, a local register file RF, an output register and a configuration register, the functional unit FU is configured to perform various fixed-point operations, the input end of the functional unit FU is provided with two multiplexers, the multiplexers are configured to access data from different sources, the local register file RF is divided into r independent registers, r is a positive integer, each register selects data from the functional unit FU or a previous register, information of the configuration register in each processing unit PE is derived from a configuration memory, the configuration memory is connected with each component inside the processing unit PE, and a configuration stream is distributed to configure a selection signal of each multiplexer, a function of the functional unit FU and read-write enabling of the register;

wherein the method for delivering reusable data based on the dynamically reconfigurable processor comprises the following steps:

step 1, converting application pseudo code into original data dependency graph due to data x [ i+2 ]]With data x [ i+1 ] after one clock cycle]And data x [ i ] after two clock cycles]Is the same data, remove L ₂ ，L ₃ Two operators, adding two new reuse dependent edges (L ₁ ，*)，(L ₁ (L) obtaining a new data dependency graph ₁ Representing operator L ₁ The acquired data is transmitted to a multiplier for consumption through a numerical network, (L) ₁ The +) represents operator L ₁ The obtained data are transmitted to an addition operator "+" for consumption through a numerical network, a compiling scheme is obtained through a compiling tool, a configuration information stream is generated, a layout result of II=1 is obtained, and an operator L is accessed ₁ Laying out PE1, laying out multiplication operator in PE2, addition operator in PE3, and access to storage operator S ₁ Layout on PE 4;

step 2, at time t, under the drive of configuration stream, L at time t ₁ After the operator finishes fetching the data, the data is placed in the last register of PE1In the device;

step 3, at time t+1, under the drive of configuration stream, L at time t ₁ The data is transferred out through the multiplexer selector Mb of PE1, through the multiplexer selectors Ma and M1 of PE2, and reaches the first register R1 of PE2 at time L of t+1 ₁ After the operator finishes taking the data, placing the data in an output register of the PE1 and simultaneously placing the data in a last register of the PE 1;

step 4, at time t+2, under the drive of configuration stream, L at time t ₁ The data is transmitted out through the multiplexing selector Mb of PE2, passes through the multiplexing selectors Ma and M1 of PE3, and reaches the first register R1 of PE 3; at the same time, under the drive of the configuration stream, L at time t ₁ Data is also passed to the second input port of the FU through multiplexers Md and Mf before the FU of PE 2; meanwhile, under the driving of the configuration flow, the data in the output register of the PE1 reaches the first input port through the multiplexer Me before the FU of the PE2, and under the driving of the configuration flow, the FU performs multiplication operation, and the result is stored in the output register of the PE 2;

step 5, at time t+3, under the drive of configuration stream, L at time t ₁ The data is transmitted to the second register R2 of PE3 through the output port of the first register R1 of PE3 and the multiplexer M2, meanwhile, PE2 performs multiplication operation, and the result is stored in the output register of PE 2;

2. The dynamically reconfigurable processor according to claim 1, wherein the processing unit PE is in a two-channel network comprising a result network for carrying out the transfer of the calculation result of the functional unit FU and a numerical network for carrying out the numerical transfer of the local register file RF.

3. The dynamically reconfigurable processor according to claim 2, wherein when a value obtained from the memory by one access operator requires a plurality of references in a short time, the value is distributed to other processing units PE requiring the reference value through the value network.

4. A dynamically reconfigurable processor according to claim 2 or 3, wherein serially shifted data channels are added at internal registers of the local register file RF.

5. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of delivering reusable data as claimed in claim 1.