WO2023030054A1 - 一种计算器件、计算***及计算方法 - Google Patents

一种计算器件、计算***及计算方法 Download PDF

Info

Publication number
WO2023030054A1
WO2023030054A1 PCT/CN2022/113709 CN2022113709W WO2023030054A1 WO 2023030054 A1 WO2023030054 A1 WO 2023030054A1 CN 2022113709 W CN2022113709 W CN 2022113709W WO 2023030054 A1 WO2023030054 A1 WO 2023030054A1
Authority
WO
WIPO (PCT)
Prior art keywords
computing
target
array
reconfigurable
function
Prior art date
Application number
PCT/CN2022/113709
Other languages
English (en)
French (fr)
Inventor
郭一欣
刘琦
周骏
唐秦伟
Original Assignee
西安紫光国芯半导体有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西安紫光国芯半导体有限公司 filed Critical 西安紫光国芯半导体有限公司
Publication of WO2023030054A1 publication Critical patent/WO2023030054A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • G06F15/7871Reconfiguration support, e.g. configuration loading, configuration switching, or hardware OS
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C11/00Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
    • G11C11/21Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
    • G11C11/34Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
    • G11C11/40Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
    • G11C11/401Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming cells needing refreshing or charge regeneration, i.e. dynamic cells
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of integrated chips, and in particular to a computing device, computing system and method.
  • the in-memory computing system of the 3D chip as a computing device is an effective means to overcome the storage wall.
  • the upper system can use the standard DDR (Double Rate Synchronous Dynamic Random Access Memory) interface, and the DDR interface can be DDR1, DDR2, DDR3, DDR4, DDR5 and LPDDR2 , LPDDR3, LPDDR4, LPDDR5 and GDDR1, GDDR2, GDDR3, GDDR4, GDDR5, GDDR6, etc., write data and configure control commands to the in-memory computing system.
  • the upper system retrieves the calculation result.
  • the input and output of calculation results are transmitted through the external interface of the computing system, which needs to go through the storage wall; the storage access in the middle of the calculation process is completed in the in-memory computing system.
  • Most of the storage access is in the in-memory computing system, and the computing data sharing during the execution of multiple computing steps can reduce the storage wall barrier, that is, reduce the increase in power consumption and bandwidth reduction caused by storage access through the storage wall.
  • the next computing array passes the local memory access in the corresponding next storage array.
  • the internal storage access executes the next stage of calculation, and completes all calculation processes step by step.
  • the calculation result of the previous calculation array is usually part of the input data of the next calculation array.
  • Computing data With the change of the position of the calculation array (conversion of the calculation array), Computing data also requires data transfer, and data cascading exists between adjacent computing arrays. In the calculation process, as the amount of data transfer increases, it will bring a huge overhead of global internal storage access, and then the calculation efficiency will decrease.
  • the embodiment of the present application provides a computing device, a computing system and a computing method, which can improve the storage access structure of the existing three-dimensional chip as a computing device, avoid frequent movement of data in the data storage array, reduce global internal storage access costs, and improve computing performance. efficiency.
  • the first aspect of the embodiments of the present application provides a computing device, including: a data storage chip assembly, including at least one layer of data storage chips, the data storage chip includes a plurality of data storage arrays, and the data storage arrays are used to store Target data and target instructions; dynamically reconfigurable memory chip assembly, including at least one layer of dynamically reconfigurable memory chips, the dynamically reconfigurable memory chip includes a plurality of dynamically reconfigurable memory arrays, and the dynamically reconfigurable memory arrays are used to store and calculate Functional configuration file; reconfigurable computing chip assembly, including at least one layer of transiently reconfigurable computing chip and at least one layer of transiently reconfigurable computing chip, the transiently reconfigurable computing chip includes a plurality of transiently reconfigurable computing arrays, the transiently reconfigurable The chip includes a plurality of transient reconfiguration arrays, and the transient reconfiguration arrays are used to obtain at least one target computing function configuration file through the dynamic reconfiguration storage array according to the instruction sequence of the target instruction, and
  • the second aspect of the embodiments of the present application provides a computing system, including: the computing device as described in the first aspect and a host system, the computing device includes an external storage access interface; the host system is connected to the external storage access An interface for delivering target instructions and target data to the computing device through the external storage access interface.
  • the third aspect of the embodiment of the present application provides a calculation method for a computing device, which is applied to the computing device as described in the first aspect.
  • the method includes: according to the target instruction, the data storage array of the data storage chip component stores the target data and the The above target instruction; the instantaneous reconfigurable array of the reconfigurable computing chip component obtains at least one corresponding target according to at least one target computing function recorded in the instruction sequence of the target command by dynamically reconfiguring the dynamically reconfigurable storage array of the memory chip component Computing function configuration file; at least one target computing function configuration file obtained by the configuration of the transient reconfiguration array; the transient reconfiguration computing array executes the target computing function based on the target data and in the order of the target instructions, Get the corresponding result data.
  • the computing device, computing system, and computing method provided by the embodiments of the present application set the data storage array in the data storage chip assembly to store the target instructions and target data issued by the upper system, and set the dynamic reconfiguration in the dynamic reconfiguration storage chip assembly
  • the storage array stores the computing function configuration file, and sets the instantaneous reconfigurable array in the reconfigurable computing chip component to obtain the target computing function configuration file and configure the target computing function, and the instantaneously reconfigurable computing array executes the target computing function configured by the instantaneously reconfigurable array .
  • the transient reconfiguration array can obtain at least one target computing function configuration file at one time, and complete the configuration of the corresponding target computing function.
  • the transient reconstruction computing array needs to wait for the first function configuration of the transient reconstruction array to be completed.
  • the instantaneously reconfigured computing array can execute the corresponding target computing function that has been configured, and the subsequent instantaneously reconfigured computing array can quickly switch and perform corresponding other target computing functions without waiting for the functional configuration of the instantaneously reconfigured array.
  • the computing function performed by the instantaneously reconfigurable computing array is determined by the target computing function that the configuration of the instantaneously reconfigurable computing array takes effect, and the computing function of the instantaneously reconfigurable computing array can be reconfigured.
  • the computing function of the computing array is fixed, and after a computing target is completed in the computing process, the next computing function turns to another computing array, accompanied by the intermediate result generated by the previous computing function, It is accessed by the computing array corresponding to the next computing function. Therefore, an internal global storage access network connection needs to be established between all computing arrays and all data storage arrays.
  • the computing array is in the data After the stage calculation is completed through local internal storage access in the storage array, the next calculation array performs the next stage calculation in the next data storage array through local internal storage access, and all calculation processes are completed step by step.
  • the calculation result of the previous calculation array is used as The input data of the next computing array, with the conversion of the computing array, the computing data also needs to be transferred.
  • the computing process as the amount of data transfer increases, it will bring huge overhead for global internal storage access, thereby reducing computing efficiency. , will seriously affect the economy and practicability of computing devices for 3D chips.
  • the computing device provided by the embodiment of the present application, by setting the instantaneous reconfiguration computing array and the instantaneous reconfiguration array, makes the computing function performed by the transient reconfiguration computing array reconfigurable, and a target instruction corresponds to All target computing functions or part of the target computing functions can be completed in the same instantaneously reconfigurable computing array, without establishing an internal global storage access network connection between the instantaneously reconfigurable computing array and the data storage array, and the instantaneously reconfigurable computing array and data storage array can be established
  • the one-to-one connection or many-to-one connection of the storage array can avoid a large number of internal global storage accesses during the calculation process under a target instruction, avoid frequent switching of the instantaneous reconstruction calculation array and a large amount of
  • the transiently reconfigurable computing array executes the target computing function recorded in the instruction sequence of the target instruction
  • the transiently reconfigurable computing array needs to wait for the completion of the first functional configuration of the transiently reconfigurable array to execute two adjacent target computing functions There is no need to wait for the instantaneous reconfiguration of the computing function configuration of the array, which can further save time for the execution efficiency of the target computing function of the target instruction and improve the execution efficiency of the target computing function of the target instruction, thereby further improving the computing efficiency of the computing device and further reducing Calculate power consumption.
  • FIG. 1 is a schematic structural diagram of a computing device provided in an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a logical structure of a computing device provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of the logical structure of another computing device provided by the embodiment of the present application.
  • FIG. 4 is a schematic diagram of another computing device logic structure provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of a logical structure of another computing device provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of a transient reconstruction principle provided by the embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computing system provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a partial structure of a computing device provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a computing method of a computing device provided by an embodiment of the present application.
  • the in-memory computing system of the computing device is an effective means to overcome the storage wall.
  • the upper system can use the standard DDR interface, and the DDR interface can be DDR1, DDR2, DDR3, DDR4, DDR5 and LPDDR2, LPDDR3, LPDDR4, LPDDR5 and GDDR1, GDDR2, GDDR3 , GDDR4, GDDR5, GDDR6, etc., write data and configure control instructions to the in-memory computing system.
  • the upper system retrieves the calculation result.
  • the input and output of calculation results are transmitted through the external interface of the computing system, which needs to go through the storage wall; the storage access in the middle of the calculation process is completed in the in-memory computing system.
  • the embodiments of the present application provide a computing device, a computing system, and a computing method, which can improve existing computing devices.
  • FIG. 1 is a schematic structural diagram of a computing device provided in an embodiment of the present application.
  • the computing device provided by the embodiment of the present application includes: a data storage chip assembly 100 , a reconfigurable computing chip assembly 200 and a dynamically reconfigurable storage chip assembly 300 .
  • the data storage chip assembly 100 includes at least one layer of data storage chips 110.
  • the data storage chip assembly 100 shown in FIG. 1 only shows one layer of data storage chips 110.
  • FIG. 1 is only schematic and not intended as a specific limitation of this application.
  • the data storage chip 110 includes a plurality of data storage arrays 111, and the data storage arrays 111 are used to store target data, target instructions and calculation protocol data, calculation protocol data such as original data address, length, format type and target address (calculated or processed) data storage address), length, format type, etc.
  • Calculations can include numerical calculations, such as multiplication and addition, convolution, correlation, matrix operations and image, video compression, decompression, etc.; can also include digital signal processing calculations, such as discrete Fourier transform, digital filtering, discrete cosine transform, etc.; It also includes the hybrid calculation of numerical calculation and digital signal processing calculation, which is not specifically limited in this application. According to different storage requirements and storage scales, the data storage chip 110 can be provided with different numbers of data storage arrays 111.
  • FIG. 1 only schematically shows the number and arrangement of the data storage arrays 111, which are not specifically limited in this application.
  • the data storage array 111 may include at least one data storage unit, which is used to store different target data, which is not specifically limited in this application.
  • the target data may come from the distribution of the host system, and this application does not specifically limit it.
  • the dynamically reconfigurable memory chip assembly 300 includes at least one layer of dynamically reconfigurable memory chips 310.
  • the dynamically reconfigurable memory chip assembly 300 shown in FIG. 1 only includes one layer of dynamically reconfigurable memory chips 310.
  • FIG. 1 is just an example It is intended to be indicative and not intended as a specific limitation of this application.
  • the dynamically reconfigurable memory chip 310 includes a plurality of dynamically reconfigurable memory arrays 311.
  • the dynamically reconfigurable memory arrays 311 are used to store computing function configuration files and fixed computing data. Some computing function requirements include fixed computing data, and the fixed computing data may include some programming Files and calculation constants, such as convolution kernel weights of image convolution and coefficients of finite impulse response filters, etc., are not specifically limited in this application.
  • the reconfigurable computing chip assembly 200 includes at least one layer of instantaneously reconfigurable computing chips 210 and at least one layer of instantaneously reconfigurable computing chips 220.
  • the reconfigurable computing chip assembly shown in FIG. 1 includes one layer of instantaneously reconfigurable computing chips 210 and one layer of The transient reconfiguration chip 220 shown in FIG. 1 is only an exemplary diagram, and is not intended as a specific limitation of the present application.
  • the transient reconfiguration computing chip 210 includes multiple transient reconfiguration computing arrays 211, and the transient reconfiguration chip 220 includes multiple transient reconfiguration arrays 221.
  • the target computing function configuration file is a computing function configuration file stored in the dynamically reconfigurable storage array 311 and corresponding to the target computing function.
  • the host system can control the transient reconstruction array 221 to call the target computing function configuration file through the target command.
  • the dynamically reconfigurable storage array 311 actively sends the target computing function configuration file to the transient reconfigurable array 221 , which is not specifically limited in this application.
  • target computing functions may be recorded in the instruction sequence of the target instruction, and the target computing function and the target computing function configuration file are one-to-one or many-to-one.
  • the transient reconfiguration array 221 After the transient reconfiguration array 221 obtains at least one target computing function configuration file, it can configure the target computing function according to the target computing function configuration file. After the configuration takes effect, the transient reconfiguration computing array 211 can obtain the corresponding target computing function.
  • the transient reconstruction computing array 211 can execute the target computing function configured by the transient reconstruction array 221 based on the target data.
  • the target data may be obtained by the instantaneous reconstruction computing array 211 from the data storage array 111 according to the target instruction.
  • the transient reconfiguration array 221 can obtain the computing function configuration files (target computing function configuration files) corresponding to all target computing functions recorded in the instruction sequence of the target instruction from the dynamic reconfiguration storage array 311 once, and then Load the configuration files of all target computing functions into the instantaneous reconstruction array 221, and make the loaded configuration files take effect one by one according to the computing steps recorded in the instruction sequence; or obtain part of the target computing function configuration recorded in the instruction sequence of the target instruction at one time , and then use the computing time of the instantaneous reconfiguration computing array 211 to preload the configuration file corresponding to the target computing function of the subsequent computing step into the preparation area, and when it is necessary to instantaneously reconfigure the computing array 211 function to the subsequent computing step, make the corresponding target The feature profile takes effect.
  • the transient reconstruction computing array 211 needs to wait for the instant The first function configuration of the reconfiguration array 221 is completed, and after the first function configuration of the transient reconfiguration array 221 is completed, the transient reconfiguration computing array 211 can execute the configured corresponding target computing function.
  • the transient reconstruction array 221 can obtain the target computing function configuration files corresponding to all target computing functions recorded in the instruction sequence of the target instruction for the first time, and correspondingly, the transient reconstruction computing array 211 can follow the instruction sequence of the target instruction The sequence of records executes the target computing function.
  • the transient reconfiguration computing array 211 needs to wait for the completion of the first function configuration of the transient reconfiguration array 221 .
  • the transient reconstruction array 221 can obtain the target computing function configuration file corresponding to some of the target computing functions recorded in the instruction sequence of the target instruction, and then when the transient reconstruction computing array 211 executes the configured target computing function, the instantaneous reconstruction
  • the configuration array 221 can simultaneously obtain the target computing function configuration files corresponding to the remaining target computing functions and complete the configuration. Therefore, the transient reconstruction computing array 211 needs to wait for the first function configuration of the transient reconstruction array 221 to be completed.
  • the data storage array chip 110 also includes a first connection structure 112
  • the transient reconfiguration computing chip 210 also includes a second connection structure 212
  • the transient reconfiguration chip 220 also includes a third connection structure 222
  • the dynamic reconfiguration memory chip 310 also includes a fourth connection structure 312 .
  • a first inter-sheet connection structure 130 is provided between the first connection structure 112 and the second connection structure 212
  • a second inter-sheet connection structure 230 is provided between the second connection structure 212 and the third connection structure 222
  • the third connection structure 222 and the fourth connection structure 312 is provided with a third inter-sheet connection structure 320 .
  • the data storage array chip 110 and the transient reconstruction computing chip 210 can realize the connection between the chips through the first connection structure 112, the second connection structure 212 and the first inter-chip connection structure 130 respectively;
  • the chips 220 can be connected through the second connection structure 212, the third connection structure 222 and the second inter-chip connection structure 230 respectively;
  • the transient reconfiguration chip 220 and the dynamic reconfiguration memory chip 310 can be respectively
  • the connection between chips is realized through the third connection structure 222 , the fourth connection structure 312 and the third inter-chip connection structure 320 .
  • the specific connection mode and connection process are not specifically limited in this application, and the connection mode and connection relationship shown in Figure 1 are only schematic, and are not specifically limited as this application.
  • the dynamic reconfiguration storage in the dynamic reconfiguration storage chip assembly 300 is set.
  • the array 311 is used to store the computing function configuration file
  • the transient reconfiguration array 221 in the reconfigurable computing chip assembly 200 is set to obtain the target computing function configuration file and configure the target computing function
  • the transient reconfiguration computing array 211 is used to execute
  • the target computing function configured by the array 221 is reconfigured instantaneously.
  • the transient reconfiguration array 221 can obtain at least one target computing function configuration file at one time, and complete the configuration of the corresponding target computing function.
  • the transient reconfiguration array 221 During the execution of a target instruction, only the transient reconfiguration array 221 obtains the instruction sequence of the target instruction for the first time
  • the transient reconstruction computing array 211 needs to wait for the first function configuration of the transient reconstruction array 221 to be completed, the first After the secondary function configuration is completed, the instantaneous reconfiguration computing array 211 can continuously execute the configured corresponding target computing function, and the subsequent instantaneous reconfiguration computing array 211 can continuously execute the configured corresponding target computing function without waiting for the instantaneous reconfiguration array 221 Function configuration.
  • the computing function performed by the transiently reconfigurable computing array 211 is mainly based on the target computing function configured by the transiently reconfigurable computing array 221 , and the computing function of the transiently reconfigurable computing array 211 can be reconfigured.
  • the computing function of the computing array is fixed. After a computing target is completed during the computing process, the next computing function turns to another computing array, and the intermediate result generated by the previous computing function is transferred to the next computing array. The computing array corresponding to the computing function is accessed. Therefore, an internal global storage access network connection needs to be established between all computing arrays and all data storage arrays. During the calculation process of a target instruction, the computing array is in the data storage array.
  • the next calculation array executes the next phase calculation through local internal storage access in the next data storage array, and all calculation processes are completed step by step, and the calculation result of the previous calculation array is used as the next calculation
  • the input data of the array, along with the conversion of the calculation array, the calculation data also needs to be transferred.
  • the calculation process as the amount of data transfer increases, it will bring a huge overhead for global internal storage access, which will reduce the calculation efficiency and will seriously Affects the economics and availability of computing devices.
  • the internal global storage access bus that must be designed for data movement will not only occupy a large area of the computing array chip, but also destroy the design layout of the computing array chip, resulting in a decrease in system performance.
  • the computing device provided in the embodiment of the present application can reconfigure the computing function executed by the instantaneously reconfigurable computing array 211 by setting the instantaneously reconfigurable computing array 211 and the instantaneously reconfigurable computing array 221, All target computing functions or part of the target computing functions corresponding to a target instruction can be completed in the same transiently reconfigurable computing array 211, without establishing an internal global storage access network connection between the transiently reconfigurable computing array 211 and the data storage array 111, and can mainly Use the one-to-one connection or many-to-one connection between the instantaneously reconfigurable computing array 211 and the data storage array 111 to realize high-bandwidth local interconnection, which can greatly reduce or even avoid a large amount of internal global storage in the calculation process under a target instruction Access, significantly reducing the frequent switching of instantaneously reconfigured computing arrays and the massive movement of data, greatly improving the computing efficiency of computing devices and reducing computing power consumption.
  • the transiently reconfigurable computing array 211 executes the target computing function recorded in the instruction sequence of the target instruction
  • the transiently reconfigurable computing array 211 needs to wait for the first functional configuration of the transiently reconfigurable array 221 to be completed, and execute two adjacent In the middle of the target computing function, there is no need to wait for the instantaneous reconfiguration of the computing function configuration of the array, which can further save time for the execution efficiency of the target computing function of the target instruction and improve the execution efficiency of the target computing function of the target instruction, thereby further improving the computing efficiency of the computing device , to further reduce computing power consumption.
  • the transient reconfiguration array 221 is used to obtain all target computing function configuration files corresponding to all target computing functions recorded in the instruction sequence of the target instruction, and complete the function configuration.
  • the instantaneous reconfiguration array 221 can obtain all target computing function configuration files corresponding to all target computing functions recorded in the instruction sequence of the target instruction at one time, and complete the configuration file loading of all target computing functions
  • the instantaneous reconfiguration array 221 realizes the configuration target computing function of the instantaneous reconfiguration computing array 211.
  • the process of configuring the target computing function can be understood as making the loaded configuration files take effect on the instantaneous reconfiguration computing array 211 one by one according to the calculation steps, and the instantaneous reconfiguration
  • the computing array 211 executes the corresponding target computing functions one by one.
  • the instantaneously reconfigurable computing array 211 only needs to wait for the first functional configuration of the instantaneously reconfigurable array 221 to be completed, and does not need to wait for the functional configuration of the instantaneously reconfigurable array 221 again, which can further save the cost of target instructions.
  • the execution efficiency of the target computing function is reduced, and the execution efficiency of the target computing function of the target instruction is improved, thereby further improving the computing efficiency of the computing device.
  • the data storage array 111 storing the target data is also used to store the result data, and the result data is obtained by the instantaneous reconstruction calculation array 211 executing the target calculation function based on the target data, and the result data includes intermediate result data and final result data
  • the target data based on which the transient reconstruction calculation array 211 executes the current target calculation function is the intermediate result data obtained by executing the last target calculation function, and the final result data is obtained by the transient reconstruction calculation array 211 executing the last target calculation function.
  • the target data, intermediate result data and final result data are all stored in the same data storage array 111, which can avoid instantaneous reconstruction of the input data of the computing array 211 in the process of performing different target computing functions
  • the storage of the output data and the output data causes a large amount of data transfer in different data storage arrays 111, which can avoid a large number of internal global storage accesses during the calculation process under a target instruction, and can further improve the calculation efficiency of the calculation device.
  • FIG. 2 is a schematic diagram of a logic structure of a computing device provided in an embodiment of the present application. Exemplarily, as shown in FIG. 2 , there is a one-to-one correspondence between the data storage array 111 and the instantaneous reconstruction calculation array 211; and/or,
  • the instantaneous reconstruction calculation array 211 is in one-to-one correspondence with the instantaneous reconstruction array 221 .
  • Figure 2 shows m data storage arrays 111, which are respectively data storage array 1, data storage array 2, data storage array 3 ... data storage array m, and m instantaneous reconstruction calculation arrays 211, which are respectively instantaneous reconstruction calculation array 1 , instantaneous reconstruction calculation array 2, instantaneous reconstruction calculation array 3 ... instantaneous reconstruction calculation array m, m instantaneous reconstruction arrays 221, respectively including instantaneous reconstruction array 1, instantaneous reconstruction array 2, instantaneous reconstruction array 3 ... Instantaneous reconfiguration array m, m dynamic reconfiguration storage arrays 311, respectively dynamic reconfiguration storage array 1, dynamic reconfiguration storage array 2, dynamic reconfiguration storage array 3... dynamic reconfiguration storage array m, m is greater than zero Natural number.
  • the data storage array 111, the instantaneous reconfiguration computing array 211, the instantaneous reconfiguration array 221, and the dynamic reconfiguration storage array 311 shown in FIG. All target computing function configuration files can be obtained in the same dynamically reconfigurable storage array 311, which can avoid frequent internal global storage access when calling the target computing function configuration file, and can improve the efficiency of file calling; data storage array 111 Part or all of them can correspond to multiple transient reconfiguration computing arrays 211, so as to provide convenience for storage access between transient reconfiguration computing arrays 211; multiple transient reconfiguration computing arrays 211 can correspond to one transient reconfiguration array 221, so as to improve transient reconfiguration Utilization efficiency of data loading logic units in array 221 .
  • the data storage array 111 is in one-to-one correspondence with the instantaneous reconstruction computing array 211 ; and/or, the instantaneous reconstruction computing array 211 is in one-to-one correspondence with the instantaneous reconstruction array 221 . It is possible to avoid establishing global storage access connections between all data storage arrays 111 and all transient reconstruction computing arrays 211 , and only need to establish one-to-one correspondence connections between data storage arrays 111 and transient reconstruction computing arrays 211 .
  • transient reconfiguration computing arrays 211 and all transient reconfiguration arrays 221 it is possible to avoid establishing global storage access connections between all transient reconfiguration computing arrays 211 and all transient reconfiguration arrays 221 , and only need to establish a one-to-one correspondence connection between transient reconfiguration computing arrays 211 and all transient reconfiguration arrays 221 .
  • the data corresponding to all target computing functions can be stored in the same data storage array 111, which can avoid frequent internal global storage accesses when calling data, and can improve the efficiency of data retrieval and data storage.
  • the one-to-one correspondence between the transient reconfiguration computing array 211 and the transient reconfiguration array 221 can avoid internal global storage access when executing the target computing function, further increase the speed of executing the target computing function, and improve the computing efficiency of computing devices.
  • the transiently reconfigurable computing array 211 that executes all target computing functions recorded in the instruction sequence of the target instruction is the same transiently reconfigurable computing array 211 .
  • the execution of all target computing functions in one target instruction can be regarded as one computing cycle.
  • Figure 2 shows m computing cycles, which are respectively computing cycle 1, computing cycle 2, computing cycle 3... computing cycle m.
  • the instantaneous reconstruction calculation array 211 of all target calculation functions recorded in the instruction sequence for executing the target instruction is the same instantaneous reconstruction calculation array 211, then all target calculation functions of a calculation cycle are in the same instantaneous reconstruction calculation array 211 Complete, it can avoid frequent internal global access to different instantaneously reconfigured computing arrays 211 in a computing cycle, and can improve the computing efficiency of a computing cycle.
  • FIG. 3 is a schematic diagram of a logic structure of another computing device provided in the embodiment of the present application.
  • the dynamically reconfigurable storage array 311 includes at least one reconfigurable storage unit, and the reconfigurable storage unit is used to store the computing function configuration file.
  • Multiple reconfigurable storage units in each dynamically reconfigurable storage array 311 may be represented as step1, step2, step3 to stepk respectively, and k may be a natural number greater than 0.
  • the reconstructed storage unit can be regarded as the original storage space of the computing function configuration file. The more reconstructed storage units, the greater the storage density and the more functions stored.
  • All reconfigurable storage units in a dynamically reconfigurable storage array can store all target computing function configuration files required for a computing cycle.
  • the instantaneous reconfiguration array 221 may correspond to only one dynamic reconfiguration storage array 311, and the dynamic reconfiguration storage array 311 may be a dynamic reconfiguration storage array 311 with a relatively large granularity, which can avoid
  • the target computing function configuration file causes frequent internal global storage access, there is no need to establish an internal global storage access connection to the dynamic reconfiguration storage array 311 and the instantaneous reconfiguration array 221 , which can improve the efficiency of file invocation.
  • FIG. 4 is a schematic diagram of a logical structure of another computing device provided by an embodiment of the present application.
  • the instantaneous reconfiguration array 221 may include at least two instantaneous configuration storage modules, and the instantaneous configuration storage modules may include a multiplexer 221a and at least two configuration storage modules 221b , the configuration storage module 221b is used to obtain all target computing function configuration files corresponding to all target computing functions recorded in the instruction sequence of the target instruction, and through the switching of the multiplexer 221a, the corresponding target computing function configuration files are reconstructed instantaneously Computing array 211 takes effect.
  • the multiplexer 221a is used to select and connect the configuration storage module 221b configured with the corresponding target computing function based on the order recorded in the instruction sequence of the target instruction, so that the instantaneous reconstruction computing array 211 executes the target computing configured in the configuration storage module 221b Function.
  • the configuration storage module 221b can be realized by any memory unit that can support random read, such as SRAM and Nor Flash, which is not specifically limited in this application.
  • the transiently reconfigurable computing chip 210 includes a plurality of transiently reconfigurable computing arrays 211, and each transiently reconfigurable computing array 211 includes a plurality of programmable logic blocks programmable logic blocks LAB/CLB, such as
  • the programmable logic blocks LAB/CLB shown in Figure 4 can be represented as LAB/CLB_00, LAB/CLB_01..., LAB/CLB_10, LAB/CLB_11..., LAB/CLB_20, LAB/CLB_21..., LAB/CLB_30, LAB/CLB_31 ...;
  • the transient reconfiguration chip 220 includes a plurality of transient reconfiguration arrays 221, and the transient reconfiguration array 221 corresponds to the transient reconfiguration calculation array 211, as shown in FIG.
  • each transient reconfiguration array 221 includes a plurality of transient configuration storage modules
  • the transient configuration storage module includes a configuration storage module 221b
  • the configuration storage module 221b corresponds to the programmable logic block programmable logic block
  • each transient configuration storage module may include a multiplexer 221a and i configuration storage modules 221b, i is a natural number, i can represent the maximum calculation step of the design of the instantaneous reconfiguration computing chip 210, the multiplexer 221a shown in FIG. , expressed as CRAM_STP1, CRAM_STP2, CRAM_STP3...CRAM_STPi.
  • All target computing functions recorded in the instruction sequence of a target instruction can be configured in a transient reconfiguration array 221, and all target computing functions are decomposed into each programmable logic block of each transient reconfiguration computing array 211, Each programmable logic block executes a part of all target computing functions, wherein the configuration storage module 221b can correspond to configure a target computing function of the programmable logic block, CRAM_STP1, CRAM_STP2, CRAM_STP3...CRAM_STPi can respectively correspond to i computing steps, programmable logic block configuration state, the multiplexer 221a can select which configuration memory in the configuration storage module 221b to switch to the corresponding programmable logic block according to the execution order recorded in the instruction sequence of the target instruction, such as LAB/CLB_00 is switched to Interconnect with CRAM_STP1, after completing the calculation function of the corresponding step, switch to interconnect with CRAM_STP2, and so on, until the calculation function of the last step is completed.
  • the configuration storage module 221b can
  • Each programmable logic block can correspond to a calculation cycle for completing a target instruction.
  • FIG. 4 is only schematic and is not a specific limitation of the present application.
  • the dynamic reconfiguration memory chip 310 may be connected to the transient reconfiguration chip 220 through a bus, and the dynamic reconfiguration memory chip 310 may transmit all target computing function configuration files to the transient reconfiguration chip 220 at one time. , stored in multiple transient configuration storage modules.
  • a multiplexer 221a is connected to multiple configuration storage modules 221b, and each configuration storage module 221b is configured with a part of a target computing function, and each instantaneously reconfigurable computing array 211 may include A plurality of programmable logic blocks, each of which can perform a part of a target computing function, corresponds to the multiplexer 221a. According to the order recorded in the instruction sequence of the target instruction, the programmable logic block executes the configuration memory in the configuration storage module 221b selected by the multiplexer 221a according to the execution order, and the selected configuration memory is configured with the corresponding Part of the target computing function.
  • the transient reconfiguration chip 220 further includes a transient reconfiguration control logic module 223, and the transient reconfiguration control logic module 223 is used to obtain from the dynamic reconfiguration storage array 311 according to the instruction sequence of the target instruction A target computing function configuration file corresponding to the target computing function is loaded into each configuration memory.
  • one multiplexer 221a is correspondingly connected to multiple configuration storage modules 221b.
  • the transient reconfiguration control logic module 223 can realize the automatic retrieval of the target computing function configuration file according to the target instruction, and can realize the instantaneous reconfiguration array 221 to retrieve all the targets in a computing cycle at one time. All the target computing function configuration files corresponding to the computing functions are loaded into the transient reconstruction array 221 by completing the configuration files of all target computing functions, and the loaded configuration files are made effective on the transient reconstruction computing array 211 one by one according to the calculation steps, which can Avoiding frequent conversion of computing arrays for realizing different computing functions and avoiding frequent transfer of computing data can improve the computing efficiency of computing devices and further expand the economy and practicability of computing devices.
  • the transient reconfiguration computing chip and the transient reconfiguration chip are arranged on the same chip layer.
  • at least one transient reconfiguration computing array and at least one transient reconfiguration array are arranged on the same chip. Integrating the chips with two functions into one layer can simplify the number of chip layers of the computing device, reduce the manufacturing process, and save costs.
  • FIG. 5 is a schematic diagram of a logic structure of another computing device provided in the embodiment of the present application.
  • the transient reconstruction computing chip 210 and the transient reconstruction chip 220 are arranged on the same chip layer, that is, multiple transient reconstruction computing arrays 211 and multiple transient reconstruction arrays 221 are arranged on the same chip layer.
  • each transient reconfiguration array 221 includes a multiplexer MUX, a first configuration memory CRAMA, and a second configuration memory CRAMB.
  • the multiplexer MUX is used to select and connect the first configuration memory CRAMA configured with the computing function of the current target based on the order recorded in the instruction sequence of the target instruction, so that the transient reconstruction computing array executes the current target configured by the first configuration memory CRAMA Computing function;
  • the second configuration memory CRAMB is used to perform the target computing function recorded in the instruction sequence of the target instruction by dynamically reconfiguring the storage array 311 during the process of instantaneously reconfiguring the computing array to execute the current target computing function configured by the first configuration memory CRAMA Capabilities get the next target computing capability profile and complete the capability configuration.
  • the instantaneous reconstruction computing array 211 may include a plurality of programmable logic blocks, and the programmable logic blocks in FIG.
  • each programmable logic block can correspond to a multiplexer MUX, a first configuration memory CRAMA, and a second configuration memory CRAMB, and the programmable logic block is used to implement the first selection connection of the multiplexer MUX
  • a configuration memory CRAMA or a second configuration memory CRAMB configures the target computing functions.
  • the instantaneous reconstruction calculation array 211 can be represented as LAB/CLB_00, LAB/CLB_01..., LAB/CLB_10, LAB/CLB_11..., and the instantaneous reconstruction calculation array 211 corresponds to the instantaneous reconstruction array 221 one by one, which is not specifically limited in this application .
  • the multiple dynamically reconfigurable memory arrays in the dynamically reconfigurable memory chip 310 can be expressed as PRF1STP1, PRF1STP2, . . . , PRF1STPx, . All configuration files of the calculation step, where each PRF1STPx contains multiple configuration sub-files, 0 ⁇ x ⁇ i, x is a natural number, where LAB/CLB_00, LAB/CLB_01..., LAB/CLB_10, LAB/CLB_11... correspond to PRF1STP1, PRF1STP2... calculation steps need to be loaded into the programmable logic blocks LAB/CLB_00, LAB/CLB_01..., LAB/CLB_10, LAB/CLB_11..., and make the configuration subfile effective through the multiplexer.
  • LAB/CLB_00, LAB/CLB_01..., LAB/CLB_10, LAB/CLB_11... are represented as a transient reconfiguration computing array 211
  • all target computing functions recorded in the instruction sequence of a target instruction can be reconstructed in one transient Computation is done in array 211.
  • target calculation functions recorded in the instruction sequence of a target instruction they are the first target calculation function, the second target calculation function, the third target calculation function, and the fourth target calculation function, respectively corresponding to the first target calculation function.
  • all multiplexers MUX can select according to the execution order recorded in the instruction sequence of the target instruction
  • All the first configuration memories CRAMA corresponding to the first target computing function are connected and configured, and the transient reconstruction computing array 211 has and executes the first target computing functions configured in all the first configuration memories CRAMA.
  • the multiplexer MUX can select and connect all the second configuration memory CRAMB corresponding to the second target computing function according to the execution sequence recorded in the instruction sequence of the target command, and instantaneously reconstruct the computing array 211 Possess and execute all the second target computing functions configured in the second configuration memory CRAMB, while the instantaneous reconfiguration computing array 211 executes the second target computing functions configured in all the second configuration memories CRAMB, all the first configuration memories CRAMA are Release, start loading the third target computing function configuration file and complete the function configuration.
  • the multiplexer MUX can select and connect all the first configuration memories CRAMA configured with the third target computing function according to the execution order recorded in the instruction sequence of the target command, and instantaneously reconstruct the computing array 211 Execute all the third target computing functions configured in the first configuration memory CRAMA, and at the same time that the instantaneous reconfiguration computing array 211 executes all the third target computing functions configured in the first configuration memory CRAMA, all the second configuration memories CRAMB are released, Start loading the 4th target computing function configuration file and complete the function configuration.
  • the multiplexer MUX can select and connect all the second configuration memories CRAMB configured with the fourth target computing function according to the execution order recorded in the instruction sequence of the target command, and instantaneously reconstruct the computing array 211 Execute all the fourth target computing functions prepared in the second configuration memory CRAMB. After the execution of the fourth target computing functions is completed, all target computing functions recorded in the instruction sequence of the target instructions are executed.
  • each multiplexer corresponds to two configuration memories
  • each multiplexer may correspond to multiple configuration memories.
  • the configuration memory selected and connected by the multiplexer calculates the behavior, and the remaining configuration memories are synchronously loaded into the target computing function configuration file corresponding to the subsequent calculation step. It can greatly reduce the risk of waiting delay caused by a calculation step that is too short, and the configuration memory of the next calculation step has not yet finished loading the configuration file.
  • the data in the configuration memory is configured to determine the function of the programmable logic block, that is, by configuring the configuration memory data, the functional configuration of the computing array 211 is realized instantaneously.
  • the lookup table LUT is one of the reconfigurable basic structures of FPGA/eFPGA. Multiple LUTs form a programmable logic block.
  • the 4-input lookup table 4-LUT in Figure 6 is a typical programmable logic block that constitutes a LAB/CLB. Reconstruct the basic structure, 4-LUT has four logic inputs A, B, C and D and a logic output Y; each ladder structure in Figure 6 is a two-to-one multiplexer MUX, which is not specifically limited in this application .
  • 4-LUT is a lookup table for 4 input channels
  • 3-LUT is a lookup table for 3 input channels
  • the four logic inputs of 4-LUT are used as the selection terminal of the multiplexer, when the selection terminal of each multiplexer is 1 , the 1-side input data of the multiplexer is selected to the output interface, and when the selection terminal of each multiplexer is 0, the 0-side input data of the multiplexer is selected to be output, so the logic of the 4-LUT
  • the relationship between the output Y and the four logic inputs A, B, C, and D of the LUT is determined by the data in the configuration memory; for example, when the data in the configuration memory is hexadecimal 0x8009 from top to bottom, that is, binary 1000 0000 0000 1001 , the relationship between the logic output Y of the 4-LUT and the four logic inputs A, B, C and D of the LUT is:
  • the transiently reconfigurable computing array 211 may also include a processing module, which is scheduled by the programmable logic blocks in the transiently reconfigurable computing array 211, and configured to configure the corresponding target computing function according to the obtained target computing function configuration file; the processing module is also used to Based on the target data, the configured target calculation function is performed.
  • the processing module may include a calculation unit and a static random storage module, and the calculation unit may not be limited to a multiplication and addition calculation unit, a multiplication calculation unit, a pulse processor, a hash calculation unit, and a machine learning unit, etc., which are not specifically limited in this application.
  • the transient reconfiguration computing array may also include other hard core IPs, which can be understood as existing effective computing units (hardware devices), which are not specifically limited in this application.
  • the processing module and/or hard core IP can also be embedded in the internal structure (fabric) of FPGA (Field Programmable Gate Array) or eFPGA (Embedded Field Programmable Gate Array), and its programmability can be used to realize reconfigurability function, which is not specifically limited in this application.
  • the use of FPGA or eFPGA can adaptively increase the effective computing density, that is, increase the density of computing devices, thereby increasing the types and quantities of computing functions.
  • the configuration storage module can be used to configure the target computing function according to the obtained target computing function configuration file, and the calculation unit can perform calculation of the corresponding function according to the target computing function configured by the configuration storage module to obtain result data.
  • At least two configuration storage modules are configured to alternately configure the target computing function, and a multiplexer is used to select and connect the configuration storage module with the current target computing function recorded in the instruction sequence configured with the target instruction, instantaneously
  • the reconfiguration computing array executes the target computing function configured in the configuration storage module selected and connected by the multiplexer, and the configuration storage module not selected and connected can simultaneously load the configuration of the next target computing function. There is no need to wait for the computing function configuration of the array to be reconfigured instantaneously during the execution of two adjacent target computing functions.
  • the execution of two adjacent target computing functions is continuous, which can further save time for the execution efficiency of the target computing function of the target instruction and improve The execution efficiency of the target computing function of the target instruction, thereby further improving the computing efficiency of the computing device.
  • the instantaneously reconfigurable computing array 211 can continuously execute the corresponding target computing functions that have been configured without waiting for the functional configuration of the instantaneously reconfigurable array 221, which can be realized through two technical routes: the instantaneously reconfigurable array 221 loads all subsequent target computing function configurations at one time file to multiple configuration storage modules in the instantaneous reconstruction array 221, and by switching the multiplexer, the configuration storage module corresponding to the current required target computing function configuration file is used to configure the instantaneous reconstruction computing array 211, and the instantaneous reconstruction After the computing array 211 completes the current target computing function, by switching the multiplexer, the computing function of the instantaneously reconstructed computing array 211 is switched to the configuration storage module of the target computing function configuration file corresponding to the next computing step; the instantaneously reconstructed array 221 In addition to loading and configuring the current target computing function configuration file, during the process of instantaneously reconfiguring the computing array 211 to complete the current target computing function, at least the target computing function configuration file corresponding to
  • the configuration storage module corresponding to the last calculation step in the transient reconstruction array 221 is released and used as a preliminary configuration storage module.
  • the subsequent The target computing function configuration file corresponding to the computing step is released and used as a preliminary configuration storage module.
  • two adjacent layers of chips are stacked and connected through a heterogeneous integrated connection component, and the heterogeneous integrated connection component is used to connect chips produced by different manufacturing processes. Due to the different functions of the data storage array chip 110, the instantaneous reconfiguration computing chip 210, the transient reconfiguration chip 220 and the dynamic reconfiguration memory chip 310, there may be more or less differences in the preparation processes of the four, which are heterogeneous chips , the integration of heterogeneous chips into computing devices requires the establishment of dense connections between heterogeneous chips, and such dense connections can use heterogeneous integrated connection components. As shown in FIG.
  • the heterogeneous integrated connection assembly may include a first connection structure 112, a second connection structure 212, and a first inter-sheet connection structure 130, or include a second connection structure 212, a third connection structure 222, and a second sheet
  • the inter-sheet connection structure 230 either includes the second connection structure 212 and the second inter-sheet connection structure 230 , or includes the third connection structure 222 , the fourth connection structure 312 and the third inter-sheet connection structure 320 .
  • the first inter-sheet connection structure 130 and the second inter-sheet connection structure 230 can be made of the same material
  • the first connection structure 112, the second connection structure 212, the third connection structure 222 and the fourth connection structure 312 can be made of the same material. same or different material.
  • the first inter-sheet connection structure 130 and the first connection structure 112 may be made of the same material, which is not specifically limited in this application.
  • the computing device provided by the embodiment of the present application can realize the integration of chips prepared by different manufacturing processes to form a computing device through the heterogeneous integrated connection component, and realize that there is no need to establish an internal global connection network inside the computing device, and the calculation can be performed based on the target instruction In the process, there is no need to perform internal global storage access, which can improve the computing efficiency of the computing device.
  • two adjacent layers of chips are connected by metal bonding.
  • the heterogeneous integrated connection components may use the same or different metal materials, such as copper and aluminum.
  • the first connection structure 112 is accompanied by the entire data storage array chip assembly 100,
  • a three-dimensional heterogeneous bonding structure is established on the lower layer of the first connection structure 112 through the subsequent process.
  • the outer layer of the structure is copper connection, and connects the aluminum connection contacts of the cross-chip interconnection inside the first connection structure 112 ;
  • the second connection structure 212 along with the entire instantaneous reconstruction computing chip 210 is a copper connection process, and a three-dimensional heterogeneous bonding structure is established on the upper layer of the second connection structure 212 through the subsequent process.
  • the copper connection contacts for cross-chip interconnection inside the second connection structure 212; the surface bonding of two three-dimensional heterogeneous bonding structures, and the corresponding interconnection between the first connection structure 112 and the second connection structure 212 is formed by hybrid bonding
  • the point bonding, that is, the first inter-chip connection structure 130, is not specifically limited in this application.
  • the computing device uses metal bonding to realize the connection of two adjacent layers of chips.
  • the physical and electrical parameters of the interconnection follow the characteristics of the semiconductor manufacturing process, that is, it is close to the interconnection in the chip, and can directly establish a cross-chip
  • the metal layer interconnection does not need to go through the input and output circuits of the prior art, and is very suitable for establishing high-density interconnections between the chips described in this application.
  • the interconnection density and speed are greatly improved, that is, the bandwidth is increased, and the power consumption is significantly reduced. .
  • a multi-layer data storage array chip 110 can be set in the data storage array chip assembly 100;
  • the dynamic reconfigurable storage array chip assembly 300 is equipped with a multi-layer dynamic reconfigurable storage array chip 310;
  • the reconfigurable computing chip assembly 200 can be equipped with a multi-layer instantaneous reconfigurable computing chip 210 and the multi-layer transient reconfiguration chip 220 are not specifically limited in this application.
  • a layer of transient reconfigurable computing chip 210 composed of hard core IP may be provided separately in the transient reconfigurable computing chip 210 , which is not specifically limited in this application.
  • the computing device provided by the embodiment of the present application can obtain a multi-layer chip structure by setting multi-layer chips to form a chip component, and can obtain corresponding computing devices according to specific functions and scale requirements, so as to realize the desired effect to the maximum extent.
  • the reconfigurable computing chip assembly is disposed between the data storage chip assembly and the dynamically reconfigurable memory chip assembly; and/or,
  • the data storage chip assembly is arranged between the reconfigurable computing chip assembly and the dynamically reconfigurable storage chip assembly; and/or,
  • the dynamic reconfigurable memory chip component is arranged between the reconfigurable computing chip component and the data storage chip component.
  • the transient reconstruction computing chip is arranged between the data storage chip and the transient reconstruction chip; and/or,
  • the instantaneous reconfiguration chip is arranged between the instantaneous reconfiguration computing chip and the dynamic reconfiguration memory chip;
  • the data storage chip is arranged between the instantaneous reconfiguration computing chip and the dynamic reconfiguration storage chip; and/or,
  • the dynamic reconfiguration memory chip is arranged between the instantaneous reconfiguration computing chip and the data storage chip.
  • the present application does not specifically limit the stacking position of each chip.
  • different chip setting positions can be flexibly set according to specific functional requirements, which can also enable the computing device to have more computing functions and a larger computing scale, and can broaden the application scenarios of the computing device .
  • any two or more of the data storage chip, the transient reconfiguration computing chip, the transient reconfiguration chip and the dynamic reconfiguration storage chip are arranged on the same chip layer.
  • the corresponding two or more chips can be integrated into one layer of chips.
  • the data storage chip and the dynamically reconfigurable memory chip are arranged on the same chip layer, that is, at least one data storage array and at least one dynamically reconfigurable memory array are integrated on one layer of chips, specifically, the data storage
  • the array and the dynamic reconfigurable storage array are arranged at intervals, and finally connected to form a one-layer chip structure, which can have both the dynamic reconfigurable storage function and the data storage function.
  • the arrays integrated on one layer of chips need to use compatible manufacturing processes to achieve integration on the same layer. Compatible manufacturing processes can be similar or identical manufacturing processes, which are not specifically limited in this application.
  • the computing device realizes the integration of chip functions by merging different chips into one layer of chips, which can reduce the manufacturing process of the computing device, and the reduction of the process flow will also bring about a reduction in the defect rate, thereby Can achieve the effect of reducing production cost.
  • integrating different chips on one layer can increase the interconnection density between different functional arrays, and enhance the computing function and storage function of computing devices.
  • the data storage array chip includes at least one of a data storage array die or a data storage array wafer; and/or,
  • the dynamically reconfigurable memory chip includes at least one of a dynamically reconfigurable memory array die or a dynamically reconfigurable memory array wafer; and/or,
  • the transiently reconfigurable computing chip includes at least one of a transiently reconfigurable computing die or a transiently reconfigurable computing wafer; and/or,
  • the flash reconfigurable chip includes at least one of a flash reconfigurable die or a flash reconfigurable wafer.
  • the chips mentioned in the embodiments of the present application may be products in the form of wafers or grains.
  • the chip may be at least one of a die (die or chip) and a wafer (wafer), but is not limited thereto, and may be any replacement that a person skilled in the art can think of.
  • the wafer refers to the silicon wafer used to make the silicon semiconductor circuit
  • the chip or the grain refers to the silicon wafer after the above-mentioned wafer made of the semiconductor circuit is divided.
  • the chip is used as an example to introduce .
  • FIG. 7 is a schematic structural diagram of the computing device computing system provided in the embodiment of the present application.
  • the computing device computing system provided by the embodiment of the present application includes: the computing device 1000 described in the first aspect and the host system 2000, the computing device 1000 includes an external storage access interface 400; the host system 2000 is connected to the external storage access The interface 400 is used by the host system 2000 to issue target instructions and target data to the computing device 1000 through the external storage access interface 400 .
  • the configuration files in the dynamically reconfigured storage array can also be loaded by the host system 2000 through the external storage access interface 400 .
  • the computing device computing system uses the instantaneous reconfiguration computing array 211 and the instantaneous reconfiguration array 221, so that the computing functions executed by the instantaneous reconfiguration computing array 211 can be reconfigured, and all target computing functions corresponding to one target instruction Or part of the target computing function can be completed in the same instantaneously reconfigurable computing array 211, mainly independent of the internal global storage access network connection between the instantaneously reconfigurable computing array 211 and the data storage array 111, and the instantaneously reconfigurable computing array 211 can be established
  • the one-to-one connection or many-to-one connection with the data storage array 111 can avoid a large number of internal global storage accesses during the calculation process under a target instruction, avoid frequent switching of the instantaneous reconstruction calculation array and a large amount of data transfer,
  • the calculation efficiency of the calculation device can be greatly improved, and the calculation power consumption can be reduced.
  • the transiently reconfigurable computing array 211 needs to wait for the first function configuration of the transiently reconfigurable array 221 to be completed. , there is no need to wait for the instantaneous reconfiguration of the computing function configuration of the array in the middle of executing two adjacent target computing functions, which can further save time for the execution efficiency of the target computing function of the target instruction, and improve the execution efficiency of the target computing function of the target instruction, thereby further Improve the computing efficiency of computing devices and further reduce computing power consumption.
  • the computing device provided by this application can be a three-dimensional chip. Adjacent chips in the three-dimensional chip are interconnected through three-dimensional heterogeneous integration, and the high-density metal layer interconnection in the chip is established layer by layer. The chips are designed and packaged in the same three-dimensional chip. Internally, there is no need for the driver provided by the IO circuit, external level boost (for output), external level step-down (for input), tri-state controller, electrostatic protection ESD and surge protection circuit, etc., no IO interface or IO Circuit interconnection, and directly establish high-density metal layer interconnection across chips or devices.
  • IO structures IO interfaces or IO circuits
  • interconnection density and interconnection speed between data storage chips, reconfigurable computing chips, and dynamic reconfigurable memory chips are increased; at the same time, three-dimensional heterogeneous
  • the integrated interconnection does not pass through the traditional IO structure, and the interconnection distance is short, which reduces the communication power consumption between chips; thus improves the integration degree and interconnection frequency of 3D chips, and reduces interconnection power consumption.
  • Three-dimensional heterogeneous integration is a three-dimensional chip interconnection bonding technology, such as Hybrid Bonding (hybrid bonding) process.
  • Hybrid Bonding hybrid bonding
  • the three-dimensional heterogeneous integrated bonding layer manufactured by BEOL back-end process
  • BEOL back-end process
  • FIG. 8 is a schematic diagram of a partial structure of a computing device provided by an embodiment of the present application.
  • the computing device is a three-dimensional chip, including a first functional component A, a second functional component B and a third functional component C, and the first functional component A, the second functional component B and the third functional component C can be One or more combinations of data storage chips, reconfigurable computing chips, and dynamically reconfigurable memory chips.
  • the first functional component A, the second functional component B and the third functional component C all include a top metal layer, an inner metal layer active layer and a substrate, wherein the top metal layer and the inner metal layer are used for signal interconnection in the functional component
  • the active layer is used to prepare transistors, circuits, or functional arrays.
  • the functional arrays can be data storage arrays, dynamic reconfigurable storage arrays, and instantaneous reconfigurable computing arrays; the substrate is used to protect modules and provide mechanical support.
  • the side close to the top metal layer on the first functional component A and the second functional component B are interconnected through a subsequent process to manufacture a three-dimensional heterogeneous bonding structure to form a face-to-face interconnection structure; the side close to the substrate on the second functional component B
  • One side and the side close to the top metal layer on the third functional component C are interconnected by manufacturing a three-dimensional heterogeneous bonding structure through a subsequent process to form a back-to-face (or face-to-back) interconnection structure.
  • a cross-component signal interconnection can be established through a three-dimensional heterogeneous bonding structure.
  • two interconnection technologies are corresponding.
  • the inner metal layer and the top metal layer are provided with a metal layer connection
  • the three-dimensional heterogeneous bonding structure is provided with an interconnection structure 3DLink, and through-holes penetrating through the active layer and the sinker bottom layer form through-silicon vias (TSVs).
  • a level conversion circuit, a first functional array 1 and a first functional array 2 may be arranged in the active layer of the first functional component A; a third functional array 2 may be arranged in the active layer of the third functional component C Function array 1 and third function array 2.
  • the cross-component interconnection between the first functional array 2 in the first functional component A and the third functional array 2 in the third functional component C is established as an example :
  • the lead-out signal of the internal metal layer of the first functional array 2 in the first functional component A is interconnected through the metal layer connection of the first functional component A and the interconnection structure 3DLink;
  • the interconnection signal passes through the metal layer of the second functional component B
  • the layer connection and the through-silicon via TSV that penetrates the active layer of the second functional component B and the thinned substrate are interconnected to the interconnection structure 3DLink, and then interconnected to the metal layer connection of the third functional component C;
  • the interconnection signal passes through The metal layers of the third functional component C are connected to realize cross-chip interconnection of the third functional array 2 in the third functional component C.
  • the cross-component interconnection between the first functional array 1 in the first functional component A and the third functional array 1 in the third functional component C is established as Example: a level conversion circuit is designed in the first functional component A, and the level conversion circuit and the first functional array 1 are interconnected through a metal layer in the first functional component A; the level conversion circuit interconnects the first functional array 1 After the signal is converted to match the core voltage of the third functional component C, the cross-component interconnection to the third functional array 1 in the third functional component C is performed using the aforementioned method.
  • the level shifting circuit can also be interconnected through a three-dimensional heterogeneous bonding structure, and be transferred and designed into the third functional component C or the second functional component B.
  • FIG. 9 is a schematic flow chart of a computing method for a computing device provided in the embodiment of the present application picture.
  • the calculation method of the calculation device provided by the embodiment of the present application includes:
  • the data storage array of the data storage chip component stores the target data and the target instruction.
  • the target instruction may include the instruction sequence, the storage address of the target data, the specified data storage array, the transient reconstruction calculation array, and the corresponding code or attribute of the dynamic reconstruction storage array, etc.
  • the target instruction may also include the data storage array, the transient reconstruction Computing arrays and corresponding selection protocol rules for dynamically reconfigurable storage arrays are not specifically limited in this application. Both the target instruction and the target data may be issued from the host system, which is not specifically limited in this embodiment of the present application.
  • the instantaneous reconfigurable array of the reconfigurable computing chip component obtains at least one corresponding target computing function configuration file according to at least one target computing function recorded in the instruction sequence of the target instruction through the dynamically reconfigurable storage array of the memory chip component .
  • At least one target computing function may be recorded in the instruction sequence of the target instruction.
  • the instruction sequence will record the execution sequence of each target computing function, which is not specifically limited in this application.
  • the instantaneous reconstruction array can acquire all target computing function configuration files corresponding to all target computing function in the target instruction at one time or part of the target computing function configuration files.
  • S300 At least one target computing function configuration file obtained by instantaneously reconfiguring the array configuration. After the target computing function configuration file obtained by instantaneously reconstructing the array configuration, it has the corresponding target computing function.
  • the instantaneous reconstruction computing array executes the target computing function according to the sequence of the target instructions to obtain corresponding result data.
  • the target data is used as the input data to obtain the result data after executing the target calculation function.
  • the calculation method of the computing device provided by the embodiment of the present application, through the instantaneous reconfiguration of the computing array and the instantaneous reconfiguration array, the computing function executed by the instantaneous reconfiguration computing array can be reconfigured, and all target computing functions or parts corresponding to a target instruction
  • the target computing function can be completed in the same instantaneously reconfigurable computing array, and there is no need to establish an internal global storage access network connection between the instantaneously reconfigurable computing array and the data storage array, and a one-to-one connection between the instantaneously reconfigurable computing array and the data storage array can be established Or a many-to-one connection, which can avoid a large number of internal global storage accesses during the calculation process under a target instruction, avoid frequent switching of instantaneous reconstruction computing arrays and large data transfers, and can greatly improve the computing efficiency of computing devices.
  • the transiently reconfigurable computing array executes the target computing function recorded in the instruction sequence of the target instruction
  • the transiently reconfigurable computing array needs to wait for the completion of the first functional configuration of the transiently reconfigurable array to execute two adjacent target computing functions. In the middle of the function, there is no need to wait for the instantaneous reconfiguration of the computing function configuration of the array, which can further save time for the execution efficiency of the target computing function of the target instruction, improve the execution efficiency of the target computing function of the target instruction, thereby further improving the computing efficiency of the computing device, and further Reduce computing power consumption.
  • the computing method of the computing device further includes:
  • the data storage array storing the target data stores the result data.
  • step S200 may include:
  • the instantaneous reconfigurable array of the reconfigurable computing chip component obtains all corresponding target computing function configuration files according to all target computing functions recorded in the instruction sequence of the target instruction through dynamically reconfiguring the dynamically reconfigurable storage array of the memory chip component.
  • Step S300 may include:
  • the calculation method of the computing device is to instantaneously reconstruct the array to obtain all target computing function configuration files corresponding to all target computing functions recorded in the instruction sequence of the target instruction at one time, and complete the configuration file loading of all target computing functions.
  • the instantaneously reconfigurable computing array executes the corresponding target computing functions that have been configured.
  • the instantaneously reconfiguring computing array only needs to wait for the first functional configuration of the instantaneously reconfiguring array to be completed, and there is no need to wait for the functional configuration of the instantaneously reconfiguring array again, which can further save target
  • the execution efficiency of the target computing function of the instruction is increased, and the execution efficiency of the target computing function of the target instruction is improved, thereby further improving the computing efficiency of the computing device.
  • the instruction sequence of the target instruction records the first target calculation function to the Nth target calculation function, and the result data includes final result data and N-1 intermediate result data, N is greater than or equal to 1, and N is a natural number ;
  • Step S400 including:
  • the instantaneous reconstruction calculation array is based on the target data, and according to the order of the target instructions, executes the nth target calculation function to obtain the nth intermediate result data;
  • the instantaneous reconstruction calculation array executes the n+1th target calculation function based on the nth intermediate result data in the order of target instructions to obtain the n+1th intermediate result data, where 0 ⁇ n ⁇ N-1, n is a natural number.
  • the computing method of the computing device provided by the embodiment of the present application is serial in the execution order of the target computing functions, and the target computing functions can be executed serially according to the requirements of the target instructions.
  • the instruction sequence of the target instruction records the first target calculation function to the Nth target calculation function, and the result data includes final result data and N-1 intermediate result data, N is greater than or equal to 1, and N is a natural number ;
  • Step S400 including:
  • the instantaneous reconstruction calculation array is based on the target data, and in accordance with the order of the target instructions, executes the qth target calculation function and the jth target calculation function synchronously, and obtains the qth intermediate result data and the jth intermediate result data respectively, where 1 ⁇ q ⁇ N , 1 ⁇ j ⁇ N, q and j are both natural numbers, j ⁇ q;
  • the instantaneous reconstruction calculation array is based on the qth intermediate result data and the jth intermediate result data, according to the order of the target instruction, executes the vth target calculation function, and obtains the vth intermediate result data.
  • 1 ⁇ v ⁇ N, v is a natural number, v ⁇ q, v ⁇ j.
  • the computing method of the computing device provided in the embodiment of the present application the computing method of the computing device provided in the embodiment of the present application, the execution sequence of the target computing function is in parallel, and the target computing function can be partially executed in parallel according to the requirements of the target instruction .
  • the transient reconfiguration array includes a multiplexer, a first configuration memory, and a second configuration memory.
  • Step S200 may include:
  • the second configuration memory obtains the corresponding target computing function according to the target computing function recorded in the instruction sequence of the target instruction through the dynamic reconfiguration of the storage array configuration file.
  • the calculation method of the calculation device uses at least two configuration storage modules to alternately configure the target calculation function, and uses a multiplexer to select and connect the configuration storage module of the current target calculation function recorded in the instruction sequence configured with the target instruction
  • the instantaneous reconfiguration computing array executes the target computing function configured in the configuration storage module selected and connected by the multiplexer, and the configuration storage module not selected and connected can configure the next target computing function at the same time. There is no need to wait for the computing function configuration of the array to be reconfigured instantaneously during the execution of two adjacent target computing functions.
  • the execution of two adjacent target computing functions is continuous, which can further save time for the execution efficiency of the target computing function of the target instruction and improve The execution efficiency of the target computing function of the target instruction, thereby further improving the computing efficiency of the computing device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Logic Circuits (AREA)

Abstract

本申请公开一种计算器件、计算***及计算方法,计算器件包括:数据存储芯片组件,包括至少一层数据存储芯片;动态重构存储芯片组件,包括至少一层动态重构存储芯片,所述动态重构存储芯片包括多个动态重构存储阵列;可重构计算芯片组件,包括至少一层瞬时重构计算芯片和至少一层瞬时重构芯片,所述瞬时重构计算芯片包括多个瞬时重构计算阵列,所述瞬时重构芯片包括多个瞬时重构阵列。能够改善现有计算器件的存储访问结构,避免数据在数据存储阵列中的频繁搬移,降低全局内部存储访问开销,提高计算效率,降低计算功耗。

Description

一种计算器件、计算***及计算方法
相关申请的交叉引用
本申请基于2021年9月3日提交的中国专利申请202111033167.4主张其优先权,此处通过参照引入其全部的记载内容。
【技术领域】
本申请涉及集成芯片技术领域,尤其涉及一种计算器件、计算***及方法。
【背景技术】
三维芯片作为计算器件的存内计算***是克服存储墙的有效手段,上位***可以通过标准DDR(双倍速率同步动态随机存储器)接口,DDR接口可以是DDR1、DDR2、DDR3、DDR4、DDR5和LPDDR2、LPDDR3、LPDDR4、LPDDR5以及GDDR1、GDDR2、GDDR3、GDDR4、GDDR5、GDDR6等,对存内计算***写入数据以及配置控制指令等,存内计算***计算完毕后,上位***取回计算结果。计算结果的输入和输出均通过计算***的外部接口传输,需要经过存储墙;计算中间过程的存储访问是在存内计算***中完成的。绝大部分存储访问在存内计算***中,多个计算步骤执行过程中的计算数据共享,能够降低存储墙壁垒,即减少通过存储墙的存储访问带来的功耗增加和带宽降低。
然而,现有三维芯片作为计算器件的存内计算***,通常是计算阵列在所对应存储阵列中通过局部内部存储访问完成阶段计算后,下一个计算阵列在与其对应的下一个存储阵列中通过局部内部存储访问执行下一个阶段计算,依次逐级完成所有计算过程,上一个计算阵列的计算结果通常是下一个计算阵列的输入数据的一部分,随着计算阵列位置的变化(计算阵列的转换),计算数据也需要进行数据转移,相邻计算阵列之间存在数据级联。在计算过程中,随着数据转移量的增多,会带来全局内部存储访问的巨大开销,进而计算效率降低。
【发明内容】
本申请实施例提供一种计算器件、计算***及计算方法,能够改善现有三维芯片作为计算器件的存储访问结构,避免数据在数据存储阵列中的频繁搬移,降低全局内部存储访问开销,提高计算效率。
本申请实施例的第一方面,提供一种计算器件,包括:数据存储芯片组件,包括至少一层数据存储芯片,所述数据存储芯片包括多个数据存储阵列,所述数据存储阵列用于存储目标数据和目标指令;动态重构存储芯片组件,包括至少一层动态重构存储芯片,所述动态重构存储芯片包括多个动态重构存储阵列,所述动态重构存储阵列用于存储计算功能配置文件;可重构计算芯片组件,包括至少一层瞬时重构计算芯片和至少一层瞬时重构芯片,所述瞬时重构计算芯片包括多个瞬时重构计算阵列,所述瞬时重构芯片包括多个瞬时重构阵列,所述瞬时重构阵列用于根据所述目标指令的指令序列通过所述动态重构存储阵列获得至少一个目标计算功能配置文件、根据获得的所述目标计算功能配置文件完成功能配置,所述瞬时重构计算阵列用于基于所述目标数据,执行所述目标指令的指令序列中记录的至少一个目标计算功能,其中,所述目标计算功能配置文件是所述动态重构存储阵列存储的与所述目标计算功能对应的所述计算功能配置文件。
本申请实施例的第二方面,提供一种计算***,包括:如第一方面所述的计算器件和上位***,所述计算器件包括外部存储访问接口;所述上位***连接所述外部存储访问接口,通过所述外部存储访问接口向所述计算器件下发目标指令和目标数据。
本申请实施例的第三方面,提供一种计算器件的计算方法,应用于如第一方面所述的计 算器件,方法包括:根据目标指令,数据存储芯片组件的数据存储阵列存储目标数据和所述目标指令;可重构计算芯片组件的瞬时重构阵列通过动态重构存储芯片组件的动态重构存储阵列按照所述目标指令的指令序列中记录的至少一个目标计算功能获得对应的至少一个目标计算功能配置文件;所述瞬时重构阵列配置获得的至少一个所述目标计算功能配置文件;瞬时重构计算阵列基于所述目标数据,按照所述目标指令的顺序,执行所述目标计算功能,得到对应的结果数据。
本申请实施例提供的计算器件、计算***及计算方法,通过设置数据存储芯片组件中的数据存储阵列存储上位***下发的目标指令和目标数据,设置动态重构存储芯片组件中的动态重构存储阵列存储计算功能配置文件,设置可重构计算芯片组件中的瞬时重构阵列获取目标计算功能配置文件并进行目标计算功能的配置,瞬时重构计算阵列执行瞬时重构阵列配置的目标计算功能。瞬时重构阵列一次性可以获取至少一个目标计算功能配置文件,并完成对应目标计算功能的配置,在一个目标指令的执行过程中,只有瞬时重构阵列第一次获取目标指令的指令序列中记录的所有目标计算功能或者部分目标计算功能对应的目标计算功能配置文件时,瞬时重构计算阵列需要等待瞬时重构阵列的第一次功能配置完成,瞬时重构阵列的第一次功能配置完成后,瞬时重构计算阵列即可执行配置完成的对应目标计算功能,后续瞬时重构计算阵列可以迅速切换并执行对应其它目标计算功能,无需等待瞬时重构阵列的功能配置。瞬时重构计算阵列的执行的计算功能决定于瞬时重构阵列配置生效的目标计算功能,瞬时重构计算阵列的计算功能是可以重构的。针对现有技术中的三维芯片的计算器件,计算阵列的计算功能固定,计算过程中一个计算目标完成后,下一个计算功能转向另一个计算阵列,并伴随上一个计算功能所产生的中间结果,被下一个计算功能所对应的计算阵列所访问,由此,所有的计算阵列与所有的数据存储阵列之间需要建立内部全局存储访问网络连接,在一个目标指令的计算过程中,计算阵列在数据存储阵列中通过局部内部存储访问完成阶段计算后,下一个计算阵列在下一个数据存储阵列中通过局部内部存储访问执行下一个阶段计算,依次逐级完成所有计算过程,上一个计算阵列的计算结果作为下一个计算阵列的输入数据,随着计算阵列的转换,计算数据也需要进行数据转移,在计算过程中,随着数据转移量增多,会带来全局内部存储访问的巨大开销,进而降低计算效率,将严重影响三维芯片的计算器件的经济性和实用性。另外,用于数据搬移而必须设计的内部全局存储访问总线不仅会占用计算阵列芯片中大量面积,还会破坏计算阵列芯片中的设计布局,导致***性能下降。因此,针对现有技术存在的问题,本申请实施例提供的计算器件,通过设置瞬时重构计算阵列和瞬时重构阵列,使得瞬时重构计算阵列执行的计算功能可重构,一个目标指令对应的所有目标计算功能或者部分目标计算功能可以在同一个瞬时重构计算阵列中完成,无需对瞬时重构计算阵列和数据存储阵列建立内部全局存储访问网络连接,可以建立瞬时重构计算阵列与数据存储阵列的一对一连接或者多对一连接,能够避免在一个目标指令下的计算过程中进行大量的内部全局存储访问,避免瞬时重构计算阵列的频繁切换以及数据的大量转移,能够极大的提高计算器件的计算效率,降低计算功耗。另外,在瞬时重构计算阵列执行目标指令的指令序列中记录的目标计算功能过程中,瞬时重构计算阵列需要等待瞬时重构阵列的第一次功能配置完成,执行两个相邻目标计算功能的中间无需等待瞬时重构阵列的计算功能配置,能够进一步节省目标指令的目标计算功能的执行效率的时间,提升目标指令的目标计算功能的执行效率,从而进一步提高计算器件的计算效率,进一步降低计算功耗。
【附图说明】
图1为本申请实施例提供的一种计算器件的结构示意图;
图2为本申请实施例提供的一种计算器件逻辑结构示意图;
图3为本申请实施例提供的另一种计算器件逻辑结构示意图;
图4为本申请实施例提供的又一种计算器件逻辑结构示意图;
图5为本申请实施例提供的再一种计算器件逻辑结构示意图;
图6为本申请实施例提供的一种瞬时重构的原理示意图;
图7为本申请实施例提供的一种计算***的结构示意图;
图8为本申请实施例提供的一种计算器件的局部结构示意图;
图9为本申请实施例提供的一种计算器件的计算方法的示意性流程图。
【具体实施方式】
为了更好的理解本说明书实施例提供的技术方案,下面通过附图以及具体实施例对本说明书实施例的技术方案做详细的说明,应当理解本说明书实施例以及实施例中的具体特征是对本说明书实施例技术方案的详细的说明,而不是对本说明书技术方案的限定,在不冲突的情况下,本说明书实施例以及实施例中的技术特征可以相互组合。
计算器件的存内计算***是克服存储墙的有效手段,上位***可以通过标准DDR接口,DDR接口可以是DDR1、DDR2、DDR3、DDR4、DDR5和LPDDR2、LPDDR3、LPDDR4、LPDDR5以及GDDR1、GDDR2、GDDR3、GDDR4、GDDR5、GDDR6等,对存内计算***写入数据以及配置控制指令等,存内计算***计算完毕后,上位***取回计算结果。计算结果的输入和输出均通过计算***的外部接口传输,需要经过存储墙;计算中间过程的存储访问是在存内计算***中完成的。绝大部分存储访问在存内计算***中,多个计算步骤执行过程中的计算数据共享,能够降低存储墙壁垒,即减少通过存储墙的存储访问带来的功耗增加和带宽降低。然而,现有计算器件的存内计算***,通常是计算阵列在所对应存储阵列中通过局部内部存储访问完成阶段计算后,下一个计算阵列在与其对应的下一个存储阵列中通过局部内部存储访问执行下一个阶段计算,依次逐级以流水线方式完成所有计算过程,上一个计算阵列的计算结果通常是下一个计算阵列的输入数据的一部分,随着计算阵列位置的变化(计算阵列的转换),计算数据也需要进行数据转移,计算阵列之间存在广泛数据级联,在计算过程中,随着计算数据转移量的增多,会带来全局内部存储访问的巨大开销,进而计算效率降低。
有鉴于此,本申请实施例提供一种计算器件、计算***及计算方法,能够改善现有计算器件随着计算流水线越长,数据转移量越大,会带来全局内部存储访问的巨大开销,进而计算效率降低的问题。
本申请实施例的第一方面,提供一种计算器件。示例性的,图1为本申请实施例提供的一种计算器件的结构示意图。如图1所示,本申请实施例提供的计算器件,包括:数据存储芯片组件100、可重构计算芯片组件200和动态重构存储芯片组件300。数据存储芯片组件100包括至少一层数据存储芯片110,图1所示的数据存储芯片组件100只示意出一层数据存储芯片110,图1只是示意性的,不作为本申请的具体限定。数据存储芯片110包括多个数据存储阵列111,数据存储阵列111用于存储目标数据、目标指令和计算协议数据,计算协议数据如原数据地址、长度、格式类型和目标地址(计算或处理后的数据的存储地址)、长度、格式类型等。计算既可以包括数值计算,如乘加、卷积、相关、矩阵运算和图像、视频压缩、解压等;也可以包括数字信号处理计算,如离散傅里叶变换、数字滤波、离散余弦变换等;也包括所述数值计算和数字信号处理计算的混合计算,本申请不作具体限定。根据不同的存储需求和存储规模,数据存储芯片110可以设置不同数量的数据存储阵列111,图1只是示意性示出数据存储阵列111的数量和排列,本申请不作具体限定。根据不同的存储需求和存储规模,数据存储阵列111可以包括至少一个数据存储单元,数据存储单元用于存储不同的目标数据,本申请不作具体限定。目标数据可以来源于上位***的下发,本申请也不作具体限定。
继续参考图1,动态重构存储芯片组件300包括至少一层动态重构存储芯片310,图1所示的动态重构存储芯片组件300只包括一层动态重构存储芯片310,图1只是示例性的示意,不作为本申请的具体限定。动态重构存储芯片310包括多个动态重构存储阵列311,动态重构存储阵列311用于存储计算功能配置文件和固定计算数据,有些计算功能需求包含固 定计算数据,固定计算数据可以包括一些编程文件以及计算常数,例如图像卷积的卷积核权重和有限冲击响应滤波器的系数等,本申请不作具体限定。
可重构计算芯片组件200包括至少一层瞬时重构计算芯片210和至少一层瞬时重构芯片220,图1所示的可重构计算芯片组件包括一层瞬时重构计算芯片210和一层瞬时重构芯片220,图1只是示例性的示意,不作为本申请的具体限定。瞬时重构计算芯片210包括多个瞬时重构计算阵列211,瞬时重构芯片220包括多个瞬时重构阵列221,瞬时重构阵列221用于根据目标指令的指令序列通过动态重构存储阵列311获得至少一个目标计算功能配置文件、根据获得的目标计算功能配置文件完成瞬时重构计算阵列211的功能配置,瞬时重构计算阵列211用于基于目标数据顺序,执行目标指令的指令序列中记录的至少一个目标计算功能,其中,目标计算功能配置文件是动态重构存储阵列311存储的与目标计算功能对应的计算功能配置文件。上位***可以通过目标指令控制瞬时重构阵列221调取目标计算功能配置文件。或者,动态重构存储阵列311将目标计算功能配置文件主动发送给瞬时重构阵列221,本申请不作具体限定。目标指令的指令序列中可以记录有多个目标计算功能,目标计算功能与目标计算功能配置文件一对一或多对一。瞬时重构阵列221获得至少一个目标计算功能配置文件后,可以根据目标计算功能配置文件进行目标计算功能的配置,配置生效后瞬时重构计算阵列211可获得对应的目标计算功能。瞬时重构计算阵列211可以基于目标数据执行瞬时重构阵列221被配置的目标计算功能。目标数据可以是瞬时重构计算阵列211根据目标指令从数据存储阵列111中获取得到。需要说明的是,瞬时重构阵列221可以从动态重构存储阵列311中一次性获取目标指令的指令序列中记录的所有目标计算功能对应的计算功能配置文件(目标计算功能配置文件),之后一次性将所有目标计算功能的配置文件载入瞬时重构阵列221,并按指令序列记录的计算步骤使载入的配置文件逐一生效;或者一次获取目标指令的指令序列中记录的部分目标计算功能配置,之后利用瞬时重构计算阵列211的计算时间,将后续计算步骤对应目标计算功能的配置文件预先载入到预备区域,并在需要瞬时重构计算阵列211功能至后续计算步骤时,使对应目标功能配置文件生效。在一个目标指令的执行过程中,只有瞬时重构阵列221第一次获取目标指令的指令序列中记录的至少一个目标计算功能对应的目标计算功能配置文件时,瞬时重构计算阵列211需要等待瞬时重构阵列221的第一次功能配置完成,瞬时重构阵列221的第一次功能配置完成后,瞬时重构计算阵列211即可执行配置完成的对应目标计算功能。示例性的,瞬时重构阵列221第一次可以获取目标指令的指令序列中记录的所有目标计算功能对应的目标计算功能配置文件,对应的,瞬时重构计算阵列211可以按照目标指令的指令序列记录的顺序,执行目标计算功能,因此,一个目标指令只需进行一次功能配置,瞬时重构计算阵列211需要等待瞬时重构阵列221的第一次功能配置完成。瞬时重构阵列221第一次可以获取目标指令的指令序列中记录的部分目标计算功能对应的目标计算功能配置文件,后续在瞬时重构计算阵列211执行已经配置完成的目标计算功能时,瞬时重构阵列221可以同步获取剩余的目标计算功能对应的目标计算功能配置文件并完成配置,因此,瞬时重构计算阵列211需要等待瞬时重构阵列221的第一次功能配置完成。
继续参考图1,数据存储阵列芯片110还包括第一连接结构112,瞬时重构计算芯片210还包括第二连接结构212,瞬时重构芯片220还包括第三连接结构222,动态重构存储芯片310还包括第四连接结构312。第一连接结构112和第二连接结构212之间设置有第一片间连接结构130,第二连接结构212和第三连接结构222之间设置有第二片间连接结构230,第三连接结构222和第四连接结构312之间设置后第三片间连接结构320。数据存储阵列芯片110与瞬时重构计算芯片210可以分别通过第一连接结构112、第二连接结构212和第一片间连接结构130实现芯片之间的连接;瞬时重构计算芯片210与瞬时重构芯片220之间可以分别通过第二连接结构212、第三连接结构222和第二片间连接结构230实现芯片之间的连接;瞬时重构芯片220与动态重构存储芯片310之间可以分别通过第三连接结构222、第四连接结构312和第三片间连接结构320实现芯片之间的连接。具体连接方式和连接工艺,本申请 不作具体限定,图1所示的连接方式和连接关系只是示意性的,不作为本申请的具体限定。
本申请实施例提供的计算器件,通过设置数据存储芯片组件100中的数据存储阵列111用于存储上位***下发的目标指令和目标数据,设置动态重构存储芯片组件300中的动态重构存储阵列311用于存储计算功能配置文件,设置可重构计算芯片组件200中的瞬时重构阵列221用于获取目标计算功能配置文件并进行目标计算功能的配置,瞬时重构计算阵列211用于执行瞬时重构阵列221配置的目标计算功能。瞬时重构阵列221一次性可以获取至少一个目标计算功能配置文件,并完成对应目标计算功能的配置,在一个目标指令的执行过程中,只有瞬时重构阵列221第一次获取目标指令的指令序列中记录的所有目标计算功能或者部分目标计算功能对应的目标计算功能配置文件时,瞬时重构计算阵列211需要等待瞬时重构阵列221的第一次功能配置完成,瞬时重构阵列221的第一次功能配置完成后,瞬时重构计算阵列211即可连续执行配置完成的对应目标计算功能,后续瞬时重构计算阵列211可以连续执行配置完成的对应目标计算功能,无需等待瞬时重构阵列221的功能配置。瞬时重构计算阵列211的执行的计算功能主要依据瞬时重构阵列221配置的目标计算功能,瞬时重构计算阵列211的计算功能是可以重构的。针对现有技术中的计算器件,计算阵列的计算功能固定,计算过程中一个计算目标完成后,下一个计算功能转向另一个计算阵列,并伴随上一个计算功能所产生的中间结果,被下一个计算功能所对应的计算阵列所访问,由此,所有的计算阵列与所有的数据存储阵列之间需要建立内部全局存储访问网络连接,在一个目标指令的计算过程中,计算阵列在数据存储阵列中通过局部内部存储访问完成阶段计算后,下一个计算阵列在下一个数据存储阵列中通过局部内部存储访问执行下一个阶段计算,依次逐级完成所有计算过程,上一个计算阵列的计算结果作为下一个计算阵列的输入数据,随着计算阵列的转换,计算数据也需要进行数据转移,在计算过程中,随着数据转移量增多,会带来全局内部存储访问的巨大开销,进而降低计算效率,将严重影响计算器件的经济性和实用性。另外,用于数据搬移而必须设计的内部全局存储访问总线不仅会占用计算阵列芯片中大量面积,还会破坏计算阵列芯片中的设计布局,导致***性能下降。因此,针对现有技术存在的问题,本申请实施例提供的计算器件,通过设置瞬时重构计算阵列211和瞬时重构阵列221,使得瞬时重构计算阵列211的执行的计算功能可重构,一个目标指令对应的所有目标计算功能或者部分目标计算功能可以在同一个瞬时重构计算阵列211中完成,无需对瞬时重构计算阵列211和数据存储阵列111建立内部全局存储访问网络连接,可以主要使用瞬时重构计算阵列211与数据存储阵列111的一对一连接或者多对一连接,实现高带宽局部互连,能够大幅降低甚至避免在一个目标指令下的计算过程中进行大量的内部全局存储访问,显著减少瞬时重构计算阵列的频繁切换以及数据的大量搬移,极大提高计算器件的计算效率,降低计算功耗。另外,在瞬时重构计算阵列211执行目标指令的指令序列中记录的目标计算功能过程中,瞬时重构计算阵列211需要等待瞬时重构阵列221的第一次功能配置完成,执行两个相邻目标计算功能的中间无需等待瞬时重构阵列的计算功能配置,能够进一步节省目标指令的目标计算功能的执行效率的时间,提升目标指令的目标计算功能的执行效率,从而进一步提高计算器件的计算效率,进一步降低计算功耗。
在一些实施方式中,瞬时重构阵列221用于获得目标指令的指令序列中记录的所有目标计算功能对应的所有目标计算功能配置文件,并完成功能配置。
本申请实施例提供的计算器件,瞬时重构阵列221可以一次性获取目标指令的指令序列中记录的所有目标计算功能对应的所有目标计算功能配置文件,并完成所有目标计算功能的配置文件载入瞬时重构阵列221,实现瞬时重构计算阵列211配置目标计算功能,配置目标计算功能的过程可以理解为按计算步骤将载入的配置文件逐一在瞬时重构计算阵列211上生效,瞬时重构计算阵列211逐一执行对应目标计算功能,瞬时重构计算阵列211只需要等待瞬时重构阵列221的第一次功能配置完成,无需再次等待瞬时重构阵列221的功能配置,能够进一步节省目标指令的目标计算功能的执行效率的时间,提升目标指令的目标计算功能的执行效率,从而进一步提高计算器件的计算效率。
在一些实施方式中,存储有目标数据的数据存储阵列111还用于存储结果数据,结果数据由瞬时重构计算阵列211基于目标数据执行目标计算功能得到,结果数据包括中间结果数据和最终结果数据,瞬时重构计算阵列211用于执行当前目标计算功能基于的目标数据为执行上一个目标计算功能得到的中间结果数据,最终结果数据由瞬时重构计算阵列211执行最后一个目标计算功能得到。
本申请实施例提供的计算器件,目标数据、中间结果数据和最终结果数据均存储在同一个数据存储阵列111内,可以避免瞬时重构计算阵列211在执行不同目标计算功能的过程中的输入数据和输出数据的存储在不同数据存储阵列111中引起数据的大量转移,能够避免在一个目标指令下的计算过程中进行大量的内部全局存储访问,能够进一步提高计算器件的计算效率。
在一些实施方式中,图2为本申请实施例提供的一种计算器件逻辑结构示意图。示例性的,如图2所示,数据存储阵列111和瞬时重构计算阵列211一一对应;和/或,
瞬时重构计算阵列211与瞬时重构阵列221一一对应。
图2示出m个数据存储阵列111,分别是数据存储阵列1、数据存储阵列2、数据存储阵列3…数据存储阵列m,m个瞬时重构计算阵列211,分别是瞬时重构计算阵列1、瞬时重构计算阵列2、瞬时重构计算阵列3…瞬时重构计算阵列m,m个瞬时重构阵列221,分别包括瞬时重构阵列1、瞬时重构阵列2、瞬时重构阵列3…瞬时重构阵列m,m个动态重构存储阵列311,分别是动态重构存储阵列1、动态重构存储阵列2、动态重构存储阵列3…动态重构存储阵列m,m为大于零的自然数。图2所示的数据存储阵列111、瞬时重构计算阵列211、瞬时重构阵列221和动态重构存储阵列311四者一一对应,图2只是示意性的,不作为本申请的具体限定。所有的目标计算功能配置文件均可以在同一个动态重构存储阵列311中获取,能够避免在调用目标计算功能配置文件时引起频繁的内部全局存储访问,可以提高文件调用的效率;数据存储阵列111可以部分或全部对应多个瞬时重构计算阵列211,以提供瞬时重构计算阵列211间的存储访问便利;多个瞬时重构计算阵列211可以对应一个瞬时重构阵列221,以提高瞬时重构阵列221中数据载入逻辑单元的利用效率。
本申请的计算器件,数据存储阵列111和瞬时重构计算阵列211一一对应;和/或,瞬时重构计算阵列211与瞬时重构阵列221一一对应。可以避免建立所有数据存储阵列111和所有瞬时重构计算阵列211建立全局存储访问连接,数据存储阵列111和瞬时重构计算阵列211建立一一对应连接即可。以及可以避免所有瞬时重构计算阵列211与所有瞬时重构阵列221建立全局存储访问连接,瞬时重构计算阵列211与所有瞬时重构阵列221建立一一对应连接即可。所有目标计算功能对应的数据可以存储在同一个数据存储阵列111中,可以避免在调用数据时发生频繁的内部全局存储访问,可以提高数据调取和数据存储的效率。瞬时重构计算阵列211与瞬时重构阵列221一一对应,可以避免在执行目标计算功能时产生内部全局存储访问,进一步提升执行目标计算功能的速度,提升计算器件的计算效率。
在一些实施方式中,继续参考图2,执行目标指令的指令序列中记录的所有目标计算功能的瞬时重构计算阵列211为同一个瞬时重构计算阵列211。一个目标指令中所有的目标计算功能执行完成可以视为1个计算循环,图2示出m个计算循环,分别为计算循环1、计算循环2、计算循环3…计算循环m。执行目标指令的指令序列中记录的所有目标计算功能的瞬时重构计算阵列211为同一个瞬时重构计算阵列211,则一个计算循环的所有目标计算功能均在同一个瞬时重构计算阵列211中完成,能够避免在计算循环中对于不同瞬时重构计算阵列211的频繁内部全局访问,可以提高一个计算循环的计算效率。
在一些可行的实施方式中,图3为本申请实施例提供的另一种计算器件逻辑结构示意图。如图3所示,动态重构存储阵列311包括至少一个重构存储单元,重构存储单元用于存储所述计算功能配置文件。每个动态重构存储阵列311中的多个重构存储单元可以分别表示为step1、step2、step3至stepk,k可以是大于0的自然数。重构存储单元可以视为计算功能配置文件的原始存储空间,重构存储单元越多,存储密度越大,存储的功能越多。一个动态 重构存储阵列中的所有重构存储单元可以对应存储一个计算循环所需的所有目标计算功能配置文件,在同一个计算循环(即执行同一个目标指令中的所有目标计算功能)中,在对于目标计算功能配置文件调取时,瞬时重构阵列221可以只对应一个动态重构存储阵列311,该动态重构存储阵列311可以是颗粒度较大的动态重构存储阵列311,能够避免在调用目标计算功能配置文件时引起频繁的内部全局存储访问,无需对动态重构存储阵列311和瞬时重构阵列221建立内部全局存储访问连接,可以提高文件调用的效率。
图4为本申请实施例提供的又一种计算器件逻辑结构示意图。示例性的,如图4所示,在一些实施方式中,瞬时重构阵列221可以包括至少两个瞬时配置存储模块,瞬时配置存储模块可以包括多路选择器221a和至少两个配置存储模块221b,配置存储模块221b用于获得目标指令的指令序列中记录的所有目标计算功能对应的所有目标计算功能配置文件,并通过多路选择器221a的切换,使对应目标计算功能配置文件在瞬时重构计算阵列211上生效。多路选择器221a用于基于目标指令的指令序列中记录的顺序,选择连接配置有对应目标计算功能的配置存储模块221b,以使瞬时重构计算阵列211执行配置存储模块221b中配置的目标计算功能。
配置存储模块221b可用任可支持随机读取的存储器单元实现,例如SRAM和Nor Flash等,本申请不作具体限定。
示例性的,如图4所示,瞬时重构计算芯片210包括多个瞬时重构计算阵列211,每个瞬时重构计算阵列211包括多个可编程逻辑块可编程逻辑块LAB/CLB,如图4所示的可编程逻辑块LAB/CLB可以分别表示为LAB/CLB_00、LAB/CLB_01…,LAB/CLB_10、LAB/CLB_11…,LAB/CLB_20、LAB/CLB_21…,LAB/CLB_30、LAB/CLB_31…;瞬时重构芯片220包括多个瞬时重构阵列221,瞬时重构阵列221与瞬时重构计算阵列211对应,如图4所示,每个瞬时重构阵列221包括多个瞬时配置存储模块,瞬时配置存储模块包括配置存储模块221b,配置存储模块221b与可编程逻辑块可编程逻辑块对应,每个瞬时配置存储模块可以包括1个多路选择器221a和i个配置存储模块221b,i为自然数,i可以表示瞬时重构计算芯片210的设计最大计算步骤,图4所示的多路选择器221a表示为MUX_LAB/CLB_00、MUX_LAB/CLB_01…,配置存储模块221b由多个配置存储器CRAM构成,表示为CRAM_STP1、CRAM_STP2、CRAM_STP3…CRAM_STPi。一个目标指令的指令序列中记录的所有目标计算功能可以被配置在一个瞬时重构阵列221中,所有目标计算功能,被分解到每个瞬时重构计算阵列211的每个可编程逻辑块上,每个可编程逻辑块执行所有目标计算功能的一部分,其中配置存储模块221b可以对应配置可编程逻辑块一个目标计算功能,CRAM_STP1、CRAM_STP2、CRAM_STP3…CRAM_STPi可以分别对应i个计算步骤中,可编程逻辑块的配置状态,多路选择器221a则可以根据目标指令的指令序列中记录的执行顺序,选择将配置存储模块221b中的哪一个配置存储器切换至对应可编程逻辑块,如LAB/CLB_00切换为与CRAM_STP1互连,完成对应步骤计算功能后,切换为与CRAM_STP2互连,依次类推,直至完成其最后一个步骤的计算功能。每个可编程逻辑块可以对应完成一个目标指令的一个计算循环,图4只是示意性的,不作为本申请的具体限定。在本申请实施例中,动态重构存储芯片310可以是通过总线与瞬时重构芯片220进行连接,动态重构存储芯片310可以一次性将所有的目标计算功能配置文件传送给瞬时重构芯片220,存放于多个瞬时配置存储模块内。
本申请实施例提供的计算器件,一个多路选择器221a连接多个配置存储模块221b,每个配置存储模块221b中配置有一个目标计算功能的一部分,每个瞬时重构计算阵列211中可以包括多个可编程逻辑块,每个可编程逻辑块可以执行一个目标计算功能的一部分,并与多路选择器221a对应。根据目标指令的指令序列中记录的顺序,可编程逻辑块按照执行顺序,执行多路选择器221a选择的配置存储模块221b中的配置存储器,被选择的配置存储器中配置有当前步骤需要执行的对应部分的目标计算功能。能够实现一次性可以获取一个目标指令中的所有目标计算功能对应的目标计算功能配置文件,并完成所有目标计算功能的配置文件 载入瞬时重构阵列221,并按计算步骤使载入的配置文件逐一在瞬时重构计算阵列211上生效。多个多路选择器切换配置存储器,立刻切换对应瞬时重构计算阵列211的计算功能,实现瞬时重构。
在一些实施方式中,继续参考图4,瞬时重构芯片220还包括瞬时重构控制逻辑模块223,瞬时重构控制逻辑模块223用于根据目标指令的指令序列从动态重构存储阵列311中获得目标计算功能对应的目标计算功能配置文件,并加载到每个配置存储器中。
在一些实施方式中,继续参考图4,一个多路选择器221a对应连接多个配置存储模块221b。
本申请实施例提供的计算器件,瞬时重构控制逻辑模块223可以实现依据目标指令对目标计算功能配置文件的自动调取,可以实现瞬时重构阵列221一次性调取一个计算循环中的所有目标计算功能对应的所有目标计算功能配置文件,以完成所有目标计算功能的配置文件载入瞬时重构阵列221,并按计算步骤使载入的配置文件逐一在瞬时重构计算阵列211上生效,能够避免为实现不同计算功能而频繁转换计算阵列,也同时避免计算数据的频繁转移,能够提高计算器件的计算效率,进一步扩大计算器件的经济性和实用性。
在一些实施方式中,瞬时重构计算芯片和瞬时重构芯片设置在同一个芯片层上。示例性的,至少一个瞬时重构计算阵列和至少一个瞬时重构阵列设置在同一个芯片上。将两种功能的芯片整合为一层,能够简化计算器件的芯片层数,减少制备工艺流程,节约成本。
在一些实施方式中,图5为本申请实施例提供的另外一种计算器件逻辑结构示意图。如图5所示,示例性的,瞬时重构计算芯片210和瞬时重构芯片220设置在同一个芯片层上,即多个瞬时重构计算阵列211与多个瞬时重构阵列221设置在同一个芯片层上,每个瞬时重构阵列221包括多路选择器MUX、第一配置存储器CRAMA和第二配置存储器CRAMB。多路选择器MUX用于基于目标指令的指令序列中记录的顺序,选择连接配置有当前目标计算功能的第一配置存储器CRAMA,以使瞬时重构计算阵列执行第一配置存储器CRAMA配置的当前目标计算功能;第二配置存储器CRAMB用于在瞬时重构计算阵列执行第一配置存储器CRAMA配置的当前目标计算功能的过程中,通过动态重构存储阵列311按照目标指令的指令序列中记录的目标计算功能获得下一个目标计算功能配置文件并完成功能配置。示例性的,如图5所示,瞬时重构计算阵列211可以包括多个可编程逻辑块,图5中可编程逻辑块表示为LAB/CLB_00、LAB/CLB_01…,LAB/CLB_10、LAB/CLB_11…,每个可编程逻辑块可以对应1个多路选择器MUX、1个第一配置存储器CRAMA和1个第二配置存储器CRAMB,可编程逻辑块用于执行多路选择器MUX选择连接的第一配置存储器CRAMA或第二配置存储器CRAMB配置的目标计算功能。或者,瞬时重构计算阵列211可以表示为LAB/CLB_00、LAB/CLB_01…,LAB/CLB_10、LAB/CLB_11…,瞬时重构计算阵列211与瞬时重构阵列221一一对应,本申请不作具体限定。动态重构存储芯片310中的多个动态重构存储阵列可以表示为PRF1STP1、PRF1STP2、…、PRF1STPx、…、PRF1STPi,i为自然数,每个PRF1STPx对应准备加载到可编程逻辑块中,并实现对应计算步骤的所有配置文件,其中每个PRF1STPx包含多个配置子文件,0<x<i,x为自然数,其中LAB/CLB_00、LAB/CLB_01…,LAB/CLB_10、LAB/CLB_11…分别对应PRF1STP1、PRF1STP2…计算步骤上,需要加载到可编程逻辑块LAB/CLB_00、LAB/CLB_01…,LAB/CLB_10、LAB/CLB_11…中,并通过多路选择器使其生效的配置子文件。
示例性的,若LAB/CLB_00、LAB/CLB_01…,LAB/CLB_10、LAB/CLB_11…表示为一个瞬时重构计算阵列211,一个目标指令的指令序列记录的所有目标计算功能可以在一个瞬时重构计算阵列211中完成。具体的,若一个目标指令的指令序列记录有4个目标计算功能,分别是第1目标计算功能、第2目标计算功能、第3目标计算功能和第4目标计算功能,分别对应第1目标计算功能配制文件PRF1STP1、第2目标计算功能配置文件PRF1STP2、第3目标计算功能配置文件PRF1STP3和第4目标计算功能配置文件PRF1STP4,首先,LAB/CLB_00、LAB/CLB_01…,LAB/CLB_10、LAB/CLB_11…的所有第一配置存储器CRAMA 和第二配置存储器CRAMB可以同时分别获得第1目标计算功能配制文件和第2目标计算功能配置文件,并完成第1目标计算功能和第2目标计算功能的功能配置。所有第一配置存储器CRAMA配置,得到第1目标计算功能以及所有第二配置存储器CRAMB配制得到第2目标计算功能后,所有多路选择器MUX可以根据目标指令的指令序列中记录的执行顺序,选择连接配置对应第1目标计算功能的所有第一配置存储器CRAMA,瞬时重构计算阵列211具备并执行所有第一配置存储器CRAMA中配制的第1目标计算功能。第1目标计算功能执行完成后,多路选择器MUX可以根据目标指令的指令序列中记录的执行顺序,选择连接配置对应第2目标计算功能的所有第二配置存储器CRAMB,瞬时重构计算阵列211具备并执行所有第二配置存储器CRAMB中配制的第2目标计算功能,在瞬时重构计算阵列211执行所有第二配置存储器CRAMB中配制的第2目标计算功能的同时,所有第一配置存储器CRAMA被释放,开始加载第3目标计算功能配制文件并完成功能配制。第2目标计算功能执行完成后,多路选择器MUX可以根据目标指令的指令序列中记录的执行顺序,选择连接配置有第3目标计算功能的所有第一配置存储器CRAMA,瞬时重构计算阵列211执行所有第一配置存储器CRAMA中配制的第3目标计算功能,在瞬时重构计算阵列211执行所有第一配置存储器CRAMA中配制的第3目标计算功能的同时,所有第二配置存储器CRAMB被释放,开始加载第4目标计算功能配制文件并完成功能配制。第3目标计算功能执行完成后,多路选择器MUX可以根据目标指令的指令序列中记录的执行顺序,选择连接配置有第4目标计算功能的所有第二配置存储器CRAMB,瞬时重构计算阵列211执行所有第二配置存储器CRAMB中配制的第4目标计算功能,第4目标计算功能执行完成后,目标指令的指令序列中记录的所有目标计算功能执行完成。
需要说明的是,上述实施例只是示意性描述每个多路选择器对应两个配置存储器的情形,每个多路选择器可以对应多个配置存储器。具体的,每个多路选择器可以对应多个配置存储器时,其中,被多路选择器选择连接的配置存储器计算行为,其余配置存储器同步载入后续计算步骤对应的目标计算功能配置文件。能够大大降低了,因某计算步骤过短,下一个计算步骤的配置存储器尚未完成配置文件载入,而产生等待延迟的风险。
示例性的,配置配置存储器里的数据,确定可编程逻辑块的功能,即通过配置配置存储器数据,实现瞬时重构计算阵列211的功能配置,图6为本申请实施例提供的一种瞬时重构的原理示意图。如图6所示,查找表LUT是FPGA/eFPGA的可重构基础结构之一,多个LUT构成一个可编程逻辑块,图6中4输入查找表4-LUT是构成LAB/CLB的典型可重构基础结构,4-LUT有四个逻辑输入A、B、C和D以及一个逻辑输出Y;图6中每个梯形结构为一个二选一的多路选择器MUX,本申请不作具体限定。4-LUT为4输入通道查找表,3-LUT为3输入通道查找表,4-LUT的四个逻辑输入,作为多路选择器的选择端,每个多路选择器的选择端为1时,选通多路选择器的1端输入数据到输出接口,每个多路选择器的选择端为0时,选通多路选择器的0端输入数据到输出接口,所以4-LUT的逻辑输出Y与LUT的四个逻辑输入A、B、C和D的关系由配置存储器中的数据决定;例如当配置存储器的数据自上而下为十六进制0x8009,即二进制1000 0000 0000 1001时,4-LUT的逻辑输出Y与LUT的四个逻辑输入A、B、C和D的关系为:
Figure PCTCN2022113709-appb-000001
更改配置存储器的数据为其它,可实现4-LUT的逻辑输出Y与4-LUT的四个逻辑输入A、B、C和D的任意对应关系;4-LUT的结构,如图6所示,由两个3-LUT加一个多路选择器组合而成;类似的,可以由于4-LUT,扩展成5-LUT和6-LUT结构,分别对应配置存储器的位(bit)数量为25和26个。对没个LUT结构设计多组配置存储器,并由多路选择器切换其中一个配置存储器作用于LUT,可实现瞬时重构LUT功能。
瞬时重构计算阵列211还可以包括处理模块,处理模块受瞬时重构计算阵列211中可编程逻辑块调度,用于根据获得的目标计算功能配置文件配置对应的目标计算功能;处理模块还用于基于目标数据,执行被配置的目标计算功能。示例性的,处理模块可以包括计算单元 和静态随机存储模块,计算单元可以不限于乘加计算单元、乘法计算单元、脉动处理器、哈希计算单元和机器学习单元等,本申请不作具体限定。瞬时重构计算阵列还可以包括其它硬核IP,硬核IP可以理解为现有的有效运算单元(硬件器件),本申请不作具体限定。处理模块和/或硬核IP还可以嵌入FPGA(现场可编程逻辑门阵列)或eFPGA(嵌入式现场可编程逻辑门阵列)的内部结构(fabric)中,可以利用其可编程性实现可重构功能,本申请不作具体限定。采用FPGA或eFPGA,可以适应性的增加有效运算密度,即增加计算器件的密度,进而实现增加计算功能的种类和数量。示例性的,配置存储模块可以用于根据获得的目标计算功能配置文件配置目标计算功能,计算单元可以根据配置存储模块配置的目标计算功能进行对应功能的计算,得到结果数据。
本申请实施例提供的计算器件,通过设置至少两个配置存储模块轮换配制目标计算功能,利用多路选择器选择连接配置有目标指令的指令序列中记录的当前目标计算功能的配置存储模块,瞬时重构计算阵列执行多路选择器选择连接的配置存储模块中配置的目标计算功能,未被选择连接的配置存储模块可以同时进行下一个目标计算功能的配置载入。执行两个相邻目标计算功能的中间无需等待瞬时重构阵列的计算功能配置,两个相邻目标计算功能的执行是连续的,能够进一步节省目标指令的目标计算功能的执行效率的时间,提升目标指令的目标计算功能的执行效率,从而进一步提高计算器件的计算效率。
瞬时重构计算阵列211可连续执行配置完成的对应目标计算功能,无需等待瞬时重构阵列221的功能配置,可通过两种技术线路实现:瞬时重构阵列221一次性加载所有后续目标计算功能配置文件至瞬时重构阵列221中的多个配置存储模块,并通过切换多路选择器将对应当前所需目标计算功能配置文件的配置存储模块,用于配置瞬时重构计算阵列211,在瞬时重构计算阵列211完成当前目标计算功能后,通过切换多路选择器,将瞬时重构计算阵列211的计算功能切换至下一计算步骤对应的目标计算功能配置文件的配置存储模块;瞬时重构阵列221除了加载并配置完成当前目标计算功能配置文件,在瞬时重构计算阵列211完成当前目标计算功能的过程中,将至少下一个计算步骤对应的目标计算功能配置文件,预先加载到预备配置存储模块,在瞬时重构计算阵列211完成当前目标计算功能后,通过切换多路选择器,将瞬时重构计算阵列211的计算功能切换至预备配置存储模块,使预先载入的下一计算步骤对应的目标计算功能配置文件生效,切换后,瞬时重构阵列221中的上一个计算步骤对应的配置存储模块被释放,并作为预备配置存储模块,在瞬时重构计算阵列211计算过程中,预先加载后续计算步骤对应的目标计算功能配置文。
在一些实施方式中,相邻两层芯片之间通过异质集成连接组件层叠连接,异质集成连接组件用于连接不同制备工艺制备的芯片。由于数据存储阵列芯片110、瞬时重构计算芯片210、瞬时重构芯片220和动态重构存储芯片310的功能不同,因此,四者的制备工艺可能存在或多或少的差异,属于异质芯片,将异质芯片集成为计算器件,需要在异质芯片之间建立密集的连接,这种密集的连接可以采用异质集成连接组件。如图1所示,异质集成连接组件可以包括第一连接结构112、第二连接结构212和第一片间连接结构130,或者包括第二连接结构212、第三连接结构222和第二片间连接结构230,或者包括第二连接结构212和第二片间连接结构230,或者包括第三连接结构222、第四连接结构312和第三片间连接结构320。示例性的,第一片间连接结构130和第二片间连接结构230可以采用相同的材质,第一连接结构112、第二连接结构212、第三连接结构222和第四连接结构312可以采用相同或不同的材质。第一片间连接结构130和第一连接结构112可以采用相同的材质,本申请均不作具体限定。
本申请实施例提供的计算器件,通过异质集成连接组件可以实现不同制备工艺制备得到的芯片集成在一起形成计算器件,并实现无需建立计算器件内部的内部全局连接网络,在基于目标指令进行计算过程中,无需进行内部全局存储访问,可以提高计算器件的计算效率。
在一些实施方式中,相邻两层芯片之间采用金属键合的方式连接。示例性的,异质集成连接组件可以采用相同或不同的金属材质,例如铜和铝。示例性的,如图1所示,以异质集成连接组件中的第一连接结构112和到第二连接结构212的互连为例,第一连接结构112随 整个数据存储阵列芯片组件100,为铝连接工艺,通过后道工序在第一连接结构112下层建立三维异质键合结构,该结构对外层为铜连接,并连通第一连接结构112内部的跨芯片互连的铝连接触点;第二连接结构212随整个瞬时重构计算芯片210,为铜连接工艺,通过后道工序在第二连接结构212上层建立三维异质键合结构,该结构对外层为铜连接,并连第二连接结构212内部的跨芯片互连的铜连接触点;两个三维异质键合结构的表面贴合,并通过混合键合形成第一连接结构112和第二连接结构212对对应互连点的键合,即第一片间连接结构130,本申请不作具体限定。
本申请实施例提供的计算器件,采用金属键合的方式实现相邻两层芯片的连接,互连的物理及电气参数遵循半导体制程工艺特征,即接近芯片内互连,且可以直接建立跨芯片金属层互连,无需经过现有技术的输入输出电路,非常适合建立本申请所述芯片之间的高密度互连,互连密度和速度极大提升,即增大了带宽,功耗显著降低。
在一种可行的实施方式中,可以根据数据存储的具体需求和存储规模的设定,可以在数据存储阵列芯片组件100中设置多层数据存储阵列芯片110;也可以根据目标计算功能配置文件的存储需求或者存储规模设定,动态重构存储阵列芯片组件300设置多层动态重构存储阵列芯片310;根据计算量的需求,可重构计算芯片组件200可以设置多层瞬时重构计算芯片210和多层瞬时重构芯片220,本申请不作具体限定。示例性的,瞬时重构计算芯片210中可以单独设置一层由硬核IP组成的瞬时重构计算芯片210,本申请不作具体限定。
本申请实施例提供的计算器件,通过设置多层芯片组成芯片组件,可以得到多层芯片结构,可以根据具体的功能和规模的需求,得到相应的计算器件,最大限度的实现需求的效果。
在一些实施方式中,可重构计算芯片组件设置于数据存储芯片组件与动态重构存储芯片组件之间;和/或,
数据存储芯片组件设置于可重构计算芯片组件与动态重构存储芯片组件之间;和/或,
动态重构存储芯片组件设置于可重构计算芯片组件与数据存储芯片组件之间。
在一些实施方式中,瞬时重构计算芯片设置于数据存储芯片与瞬时重构芯片之间;和/或,
瞬时重构芯片设置于瞬时重构计算芯片与动态重构存储芯片之间;
数据存储芯片设置于瞬时重构计算芯片与动态重构存储芯片之间;和/或,
动态重构存储芯片设置于瞬时重构计算芯片与数据存储芯片之间。
对于各个芯片的层叠位置,本申请不作具体限定。
本申请实施例提供的计算器件,不同芯片设置位置设定可以根据具体功能需求进行灵活设定,同样可以使得计算器件具有更多的计算功能,更大的计算规模,可以拓宽计算器件的应用场景。
在一些实施方式中,数据存储芯片、瞬时重构计算芯片、瞬时重构芯片和动态重构存储芯片中的任意两者或多者设置在同一个芯片层上。
如果需求的功能较少或者需求的计算规模较小,可以将对应的两个或者多个芯片整合为一层芯片。示例性的,数据存储芯片与动态重构存储芯片设置在同一个芯片层上,即将至少一个数据存储阵列和至少一个动态重构存储阵列整合在一层芯片上,具体的,可以是将数据存储阵列和动态重构存储阵列间隔设置,最后连接成一层芯片结构,该层芯片结构可以兼具动态重构存储功能与数据存储功能。需要说明的是,被整合在一层芯片上的阵列需要采用能够兼容的制备工艺比较容易实现同层整合,兼容的制备工艺可以是相似或者相同的制备工艺,本申请不作具体限定。
本申请实施例提供的计算器件,通过将不同芯片合并为一层芯片的方式,实现芯片功能的整合,可以减少计算器件的制备工艺流程,工艺流程的减少也会带来不良率的降低,从而能够达到降低生产成本的效果。另外,将不同芯片整合在一层,可以增大不同功能阵列之间的互连密度,增强计算器件的计算功能和存储功能。
在一些实施方式中,数据存储阵列芯片包括数据存储阵列晶粒或数据存储阵列晶圆中的 至少一种;和/或,
动态重构存储芯片包括动态重构存储阵列晶粒或动态重构存储阵列晶圆中的至少一种;和/或,
瞬时重构计算芯片包括瞬时重构计算晶粒或瞬时重构计算晶圆中的至少一种;和/或,
瞬时重构芯片包括瞬时重构晶粒或瞬时重构晶圆中的至少一种。
需要说明的是,本申请实施例中提到的芯片可以是以晶圆或者晶粒的形态存在的产品。芯片可以为晶粒(die或者chip)、晶圆(wafer)中至少一种,但不以此为限,也可以是本领域技术人员所能想到的任何替换。其中,晶圆是指制作硅半导体电路所用的硅晶片,芯片或晶粒是指将上述制作有半导体电路的晶圆进行分割后的硅晶片,本申请的具体实施例中以芯片为例进行介绍。
本申请实施例的第二方面,提供一种计算器件计算***,图7为本申请实施例提供的一种计算器件计算***的结构示意图。如图7所示,本申请实施例提供的计算器件计算***,包括:第一方面所述的计算器件1000和上位***2000,计算器件1000包括外部存储访问接口400;上位***2000连接外部存储访问接口400,上位***2000用于通过外部存储访问接口400向计算器件1000下发目标指令和目标数据。动态重构存储阵列中的配置文件也可以由于上位***2000通过外部存储访问接口400载入。
本申请实施例提供的计算器件计算***,通过瞬时重构计算阵列211和瞬时重构阵列221,使得瞬时重构计算阵列211的执行的计算功能可重构,一个目标指令对应的所有目标计算功能或者部分目标计算功能可以在同一个瞬时重构计算阵列211中完成,主要不依赖瞬时重构计算阵列211和数据存储阵列111之间的内部全局存储访问网络连接,可以建立瞬时重构计算阵列211与数据存储阵列111的一对一连接或者多对一连接,能够避免在一个目标指令下的计算过程中进行大量的内部全局存储访问,避免瞬时重构计算阵列的频繁切换以及数据的大量转移,能够极大的提高计算器件的计算效率,降低计算功耗。另外,在瞬时重构计算阵列211执行目标指令的指令序列中记录的目标计算功能过程中,主要按顺序,执行,瞬时重构计算阵列211需要等待瞬时重构阵列221的第一次功能配置完成,执行两个相邻目标计算功能的中间无需等待瞬时重构阵列的计算功能配置,能够进一步节省目标指令的目标计算功能的执行效率的时间,提升目标指令的目标计算功能的执行效率,从而进一步提高计算器件的计算效率,进一步降低计算功耗。
本申请提供的计算器件可以是三维芯片,三维芯片中相邻芯片之间通过三维异质集成互连,逐层建立芯片内高密度金属层互连,芯片被层叠设计和封装在同一个三维芯片内,无需IO电路所提供的驱动、外部电平升压(输出时)、外部电平降压(输入时)、三态控制器、静电防护ESD和浪涌保护电路等,无需IO接口或IO电路互连,而直接建立跨芯片或者跨器件的高密度金属层互连。因此减少芯片之间的IO结构(IO接口或IO电路)的使用,增加数据存储芯片、可重构计算芯片、动态重构存储芯片之间的互连密度和互连速度;同时,三维异质集成互连因不通过传统IO结构,且互连距离较短,降低了芯片之间的通讯功耗;进而提高了三维芯片的集成度以及互连频率,并降低了互连功耗。具体的优势体现在两点:动态重构存储芯片中的瞬时重构计算阵列与可重构计算芯片中的瞬时重构阵列之间建立广泛的高密度互连,实现瞬时重构的基础条件;可重构计算芯片与数据存储芯片之间建立广泛的高密度互连,实现可编程、高带宽、低功耗存储访问。
三维异质集成是一种三维芯片互连键合的技术,例如Hybrid Bonding(混合键合)工艺等。通过在已制备的芯片(例如数据存储芯片、可重构计算芯片和动态重构存储芯片)基础上,利用BEOL(后道工序)制造的三维异质集成键合层,实现芯片之间信号的高密度互连,制备得到三维芯片。
示例性的,图8为本申请实施例提供的一种计算器件的局部结构示意图。如图8所示,计算器件为三维芯片,包括第一功能组件A、第二功能组件B和第三功能组件C,第一功能组件A、第二功能组件B和第三功能组件C可以为数据存储芯片、可重构计算芯片、动态重 构存储芯片中的一种或者多种的组合。第一功能组件A、第二功能组件B和第三功能组件C均包含顶层金属层、内部金属层有源层和衬底,其中,顶层金属层和内部金属层用于功能组件内的信号互连;有源层用于制备晶体管、电路或者功能阵列,功能阵列可以为数据存储阵列、动态重构存储阵列、瞬时重构计算阵列;衬底用于保护模块及提供机械支撑等。第一功能组件A和第二功能组件B上接近顶层金属层的一面通过后道工序制造三维异质键合结构进行互连,形成面对面的互连结构;第二功能组件B上接近衬底的一面和第三功能组件C上接近顶层金属层的一面,通过后道工序制造三维异质键合结构互连,形成背对面(或面对背)的互连结构。第一功能组件A、第二功能组件B和第三功能组件C任意两者之间,可以通过三维异质键合结构建立跨组件信号互连。基于第一功能组件A、第二功能组件B和第三功能组件C的内核电压是否相同,对应两种互连技术。内部金属层和顶层金属层内设置有金属层连接,三维异质键合结构内设置有互连结构3DLink,贯穿有源层和沉底层的通孔形成硅通孔TSV。如图8所示,第一功能组件A的有源层内可以设置有电平转换电路、第一功能阵列1和第一功能阵列2;第三功能组件C的有源层内设置有第三功能阵列1和第三功能阵列2。
当第一功能组件A和第三功能组件C的内核电压相同时,第一功能组件A中的第一功能阵列2与第三功能组件C中的第三功能阵列2建立跨组件互连为例:第一功能阵列2在第一功能组件A中内部金属层的引出信号,通过第一功能组件A的金属层连接和互连结构3DLink形成互连;互连信号通过第二功能组件B的金属层连接以及贯穿第二功能组件B的有源层和减薄的衬底的硅通孔TSV互连至互连结构3DLink,进而互连至第三功能组件C的金属层连接;互连信号通过第三功能组件C的金属层连接,实现跨芯片互连第三功能组件C中的第三功能阵列2。
当第一功能组件A和第三功能组件C的内核电压不同时,以第一功能组件A中的第一功能阵列1与第三功能组件C中的第三功能阵列1建立跨组件互连为例:在第一功能组件A中设计电平转换电路,电平转换电路与第一功能阵列1在第一功能组件A中通过金属层互连;电平转换电路将第一功能阵列1的互连信号转换成匹配第三功能组件C的内核电压后,使用前述方法跨组件互连至第三功能组件C中的第三功能阵列1。并且,电平转换电路也可以通过三维异质键合结构互连,被转移设计到第三功能组件C或第二功能组件B中。
本申请实施例的第三方面,提供一种计算器件的计算方法,应用于如第一方面所述的计算器件,图9为本申请实施例提供的一种计算器件的计算方法的示意性流程图。如图9所示,本申请实施例提供的计算器件的计算方法,包括:
S100:根据目标指令,数据存储芯片组件的数据存储阵列存储目标数据和所述目标指令。目标指令中可以包括有指令序列、目标数据的存放地址,指定数据存储阵列、瞬时重构计算阵列以及对应的动态重构存储阵列的编码或者属性等,目标指令还可以包括数据存储阵列、瞬时重构计算阵列以及对应的动态重构存储阵列的选择协议规则等,本申请不作具体限定。目标指令和目标数据均可以来源于上位***的下发,本申请实施例不作具体限定。
S200:可重构计算芯片组件的瞬时重构阵列通过动态重构存储芯片组件的动态重构存储阵列按照目标指令的指令序列中记录的至少一个目标计算功能获得对应的至少一个目标计算功能配置文件。目标指令的指令序列中可以记录有至少一个目标计算功能,当有多个目标计算功能时,指令序列会记录有各个目标计算功能的执行顺序等,本申请不作具体限定。瞬时重构阵列可以一次性获取目标指令中的所有目标计算功能对应的所有目标计算功能配置文件或者部分目标计算功能配置文件。
S300:瞬时重构阵列配置获得的至少一个目标计算功能配置文件。瞬时重构阵列配置获得的目标计算功能配置文件后,具备对应的目标计算功能。
S400:瞬时重构计算阵列基于目标数据,按照目标指令的顺序,执行目标计算功能,得到对应的结果数据。目标数据作为输入数据经过执行目标计算功能,得到结果数据。
本申请实施例提供的计算器件的计算方法,通过瞬时重构计算阵列和瞬时重构阵列,使得瞬时重构计算阵列的执行的计算功能可重构,一个目标指令对应的所有目标计算功能或者 部分目标计算功能可以在同一个瞬时重构计算阵列中完成,无需对瞬时重构计算阵列和数据存储阵列建立内部全局存储访问网络连接,可以建立瞬时重构计算阵列与数据存储阵列的一对一连接或者多对一连接,能够避免在一个目标指令下的计算过程中进行大量的内部全局存储访问,避免瞬时重构计算阵列的频繁切换以及数据的大量转移,能够极大的提高计算器件的计算效率,降低计算功耗。另外,在瞬时重构计算阵列执行目标指令的指令序列中记录的目标计算功能过程中,,瞬时重构计算阵列需要等待瞬时重构阵列的第一次功能配置完成,执行两个相邻目标计算功能的中间无需等待瞬时重构阵列的计算功能配置,能够进一步节省目标指令的目标计算功能的执行效率的时间,提升目标指令的目标计算功能的执行效率,从而进一步提高计算器件的计算效率,进一步降低计算功耗。
在一些实施方式中,计算器件的计算方法还包括:
根据目标指令,存储有目标数据的数据存储阵列存储结果数据。
在一些实施方式中,步骤S200,可以包括:
可重构计算芯片组件的瞬时重构阵列通过动态重构存储芯片组件的动态重构存储阵列按照目标指令的指令序列中记录的所有目标计算功能获得对应的所有目标计算功能配置文件。
步骤S300,可以包括:
瞬时重构阵列配置获得的所有目标计算功能配置文件。
本申请实施例提供的计算器件的计算方法,瞬时重构阵列一次性获取目标指令的指令序列中记录的所有目标计算功能对应的所有目标计算功能配置文件,并完成所有目标计算功能的配置文件载入瞬时重构阵列221,并按计算步骤使载入的配置文件逐一在瞬时重构计算阵列211上生效。之后瞬时重构计算阵列执行配置完成的对应目标计算功能,瞬时重构计算阵列只需要等待瞬时重构阵列的第一次功能配置完成,无需再次等待瞬时重构阵列的功能配置,能够进一步节省目标指令的目标计算功能的执行效率的时间,提升目标指令的目标计算功能的执行效率,从而进一步提高计算器件的计算效率。
在一些实施方式中,目标指令的指令序列中记录有第1目标计算功能至第N目标计算功能,结果数据包括最终结果数据和N-1个中间结果数据,N大于或等于1,N为自然数;
步骤S400,包括:
瞬时重构计算阵列基于所述目标数据,按照所述目标指令的顺序,执行第n目标计算功能,得到第n中间结果数据;
瞬时重构计算阵列基于第n中间结果数据,按照目标指令的顺序,执行第n+1目标计算功能,得到第n+1中间结果数据,其中,0<n<N-1,n为自然数。
本申请实施例提供的计算器件的计算方法,对于目标计算功能的执行顺序是串行的方式,可以依据目标指令的需求进行目标计算功能的串行执行。
在一些实施方式中,目标指令的指令序列中记录有第1目标计算功能至第N目标计算功能,结果数据包括最终结果数据和N-1个中间结果数据,N大于或等于1,N为自然数;
步骤S400,包括:
瞬时重构计算阵列基于目标数据,按照目标指令的顺序,同步执行第q目标计算功能和第j目标计算功能,分别得到第q中间结果数据和第j中间结果数据,其中,1≤q<N,1≤j<N,q和j均为自然数,j≠q;
瞬时重构计算阵列基于第q中间结果数据和第j中间结果数据,按照目标指令的顺序,执行第v目标计算功能,得到第v中间结果数据其中,1<v<N,v为自然数,v≠q,v≠j。
本申请实施例提供的计算器件的计算方法,本申请实施例提供的计算器件的计算方法,对于目标计算功能的执行顺序是并行的方式,可以依据目标指令的需求进行目标计算功能的部分并行执行。
在一些实施方式中,瞬时重构阵列包括多路选择器、第一配置存储器和第二配置存储器。
步骤S200,可以包括:
在瞬时重构计算阵列基于目标数据,执行第一配置存储器配置的目标计算功能时,第二配置存储器通过动态重构存储阵列按照目标指令的指令序列中记录的目标计算功能获得对应的目标计算功能配置文件。
本申请实施例提供的计算器件的计算方法,通过至少两个配置存储模块轮换配制目标计算功能,利用多路选择器选择连接配置有目标指令的指令序列中记录的当前目标计算功能的配置存储模块,瞬时重构计算阵列执行多路选择器选择连接的配置存储模块中配置的目标计算功能,未被选择连接的配置存储模块可以同时进行下一个目标计算功能的配置。执行两个相邻目标计算功能的中间无需等待瞬时重构阵列的计算功能配置,两个相邻目标计算功能的执行是连续的,能够进一步节省目标指令的目标计算功能的执行效率的时间,提升目标指令的目标计算功能的执行效率,从而进一步提高计算器件的计算效率。
尽管已描述了本说明书的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本说明书范围的所有变更和修改。
显然,本领域的技术人员可以对本说明书进行各种改动和变型而不脱离本说明书的精神和范围。这样,倘若本说明书的这些修改和变型属于本说明书权利要求及其等同技术的范围之内,则本说明书也意图包含这些改动和变型在内。

Claims (20)

  1. 一种计算器件,其特征在于,包括:
    数据存储芯片组件,包括至少一层数据存储芯片,所述数据存储芯片包括多个数据存储阵列,所述数据存储阵列用于存储目标数据和目标指令;
    动态重构存储芯片组件,包括至少一层动态重构存储芯片,所述动态重构存储芯片包括多个动态重构存储阵列,所述动态重构存储阵列用于存储计算功能配置文件;
    可重构计算芯片组件,包括至少一层瞬时重构计算芯片和至少一层瞬时重构芯片,所述瞬时重构计算芯片包括多个瞬时重构计算阵列,所述瞬时重构芯片包括多个瞬时重构阵列,所述瞬时重构阵列用于根据所述目标指令的指令序列通过所述动态重构存储阵列获得至少一个目标计算功能配置文件、根据获得的所述目标计算功能配置文件完成功能配置,所述瞬时重构计算阵列用于基于所述目标数据,执行所述目标指令的指令序列中记录的至少一个目标计算功能,其中,所述目标计算功能配置文件是所述动态重构存储阵列存储的与所述目标计算功能对应的所述计算功能配置文件。
  2. 根据权利要求1所述的计算器件,其特征在于,所述瞬时重构阵列用于获得所述目标指令的指令序列中记录的所有所述目标计算功能对应的所有所述目标计算功能配置文件,并完成功能配置。
  3. 根据权利要求2所述的计算器件,其特征在于,所述瞬时重构阵列包括多路选择器和至少两个配置存储模块,所述配置存储模块用于获得所述目标指令的指令序列中记录的所有所述目标计算功能对应的所有所述目标计算功能配置文件,并完成功能配置;
    所述多路选择器用于基于所述目标指令的指令序列中记录的顺序,选择连接配置有对应所述目标计算功能的所述配置存储模块,以使所述瞬时重构计算阵列执行所述配置存储模块中配置的所述目标计算功能。
  4. 根据权利要求1所述的计算器件,其特征在于,所述瞬时重构阵列包括多路选择器、第一配置存储器和第二配置存储器;
    所述多路选择器用于基于所述目标指令的指令序列中记录的顺序,选择连接配置有当前所述目标计算功能的所述第一配置存储器,以使所述瞬时重构计算阵列执行所述第一配置存储器配置的当前所述目标计算功能;
    所述第二配置存储器用于在所述瞬时重构计算阵列执行所述第一配置存储器配置的当前所述目标计算功能时,通过所述动态重构存储阵列按照所述目标指令的指令序列中记录的所述目标计算功能获得下一个所述目标计算功能配置文件并完成功能配置。
  5. 根据权利要求1所述的计算器件,其特征在于,所述数据存储阵列和所述瞬时重构计算阵列一一对应;和/或,
    所述瞬时重构计算阵列与所述瞬时重构阵列一一对应。
  6. 根据权利要求5所述的计算器件,其特征在于,执行所述目标指令的指令序列中记录的所有所述目标计算功能的所述瞬时重构计算阵列为同一个所述瞬时重构计算阵列。
  7. 根据权利要求1所述的计算器件,其特征在于,存储有所述目标数据的所述数据存储阵列还用于存储结果数据,所述结果数据由所述瞬时重构计算阵列基于所述目标数据执行所述目标计算功能得到,所述结果数据包括中间结果数据和最终结果数据,所述瞬时重构计算阵列用于执行当前所述目标计算功能基于的所述目标数据为执行上一个所述目标计算功能得到的所述中间结果数据,所述最终结果数据由所述瞬时重构计算阵列执行最后一个所述目标计算功能得到。
  8. 根据权利要求1所述的计算器件,其特征在于,所述瞬时重构芯片还包括瞬时重构控制逻辑模块,所述瞬时重构控制逻辑模块用于根据所述目标指令的指令序列从所述动态重构存储阵列中获得所述目标计算功能对应的所述目标计算功能配置文件。
  9. 根据权利要求1所述的计算器件,其特征在于,相邻两层芯片之间通过异质集成连 接组件层叠连接,所述异质集成连接组件用于连接相同或不同制备工艺制备的芯片。
  10. 根据权利要求1所述的计算器件,其特征在于,所述可重构计算芯片组件设置于所述数据存储芯片组件与所述动态重构存储芯片组件之间;或,
    所述数据存储芯片组件设置于所述可重构计算芯片组件与所述动态重构存储芯片组件之间;或,
    所述动态重构存储芯片组件设置于所述可重构计算芯片组件与所述数据存储芯片组件之间。
  11. 根据权利要求1所述的计算器件,其特征在于,所述数据存储芯片、所述瞬时重构计算芯片、所述瞬时重构芯片和所述动态重构存储芯片中的任意两者或多者设置在同一个芯片层上。
  12. 一种计算***,其特征在于,包括:
    计算器件和上位***,所述计算器件包括外部存储访问接口;
    所述上位***连接所述外部存储访问接口,通过所述外部存储访问接口向所述计算器件下发目标指令和目标数据;
    其中,所述计算器件包括:数据存储芯片组件,包括至少一层数据存储芯片,所述数据存储芯片包括多个数据存储阵列,所述数据存储阵列用于存储目标数据和目标指令;
    动态重构存储芯片组件,包括至少一层动态重构存储芯片,所述动态重构存储芯片包括多个动态重构存储阵列,所述动态重构存储阵列用于存储计算功能配置文件;
    可重构计算芯片组件,包括至少一层瞬时重构计算芯片和至少一层瞬时重构芯片,所述瞬时重构计算芯片包括多个瞬时重构计算阵列,所述瞬时重构芯片包括多个瞬时重构阵列,所述瞬时重构阵列用于根据所述目标指令的指令序列通过所述动态重构存储阵列获得至少一个目标计算功能配置文件、根据获得的所述目标计算功能配置文件完成功能配置,所述瞬时重构计算阵列用于基于所述目标数据,执行所述目标指令的指令序列中记录的至少一个目标计算功能,其中,所述目标计算功能配置文件是所述动态重构存储阵列存储的与所述目标计算功能对应的所述计算功能配置文件。
  13. 根据权利要求12所述的计算***,其特征在于,所述瞬时重构阵列用于获得所述目标指令的指令序列中记录的所有所述目标计算功能对应的所有所述目标计算功能配置文件,并完成功能配置;
    所述瞬时重构阵列包括多路选择器和至少两个配置存储模块,所述配置存储模块用于获得所述目标指令的指令序列中记录的所有所述目标计算功能对应的所有所述目标计算功能配置文件,并完成功能配置;
    所述多路选择器用于基于所述目标指令的指令序列中记录的顺序,选择连接配置有对应所述目标计算功能的所述配置存储模块,以使所述瞬时重构计算阵列执行所述配置存储模块中配置的所述目标计算功能。
  14. 根据权利要求12所述的计算***,其特征在于,所述瞬时重构阵列包括多路选择器、第一配置存储器和第二配置存储器;
    所述多路选择器用于基于所述目标指令的指令序列中记录的顺序,选择连接配置有当前所述目标计算功能的所述第一配置存储器,以使所述瞬时重构计算阵列执行所述第一配置存储器配置的当前所述目标计算功能;
    所述第二配置存储器用于在所述瞬时重构计算阵列执行所述第一配置存储器配置的当前所述目标计算功能时,通过所述动态重构存储阵列按照所述目标指令的指令序列中记录的所述目标计算功能获得下一个所述目标计算功能配置文件并完成功能配置。
  15. 根据权利要求12所述的计算器件,其特征在于,存储有所述目标数据的所述数据存储阵列还用于存储结果数据,所述结果数据由所述瞬时重构计算阵列基于所述目标数据执行所述目标计算功能得到,所述结果数据包括中间结果数据和最终结果数据,所述瞬时重构计算阵列用于执行当前所述目标计算功能基于的所述目标数据为执行上一个所述目标计算功 能得到的所述中间结果数据,所述最终结果数据由所述瞬时重构计算阵列执行最后一个所述目标计算功能得到。
  16. 一种计算器件的计算方法,其特征在于,应用于计算器件,所述计算器件包括:数据存储芯片组件,包括至少一层数据存储芯片,所述数据存储芯片包括多个数据存储阵列,所述数据存储阵列用于存储目标数据和目标指令;动态重构存储芯片组件,包括至少一层动态重构存储芯片,所述动态重构存储芯片包括多个动态重构存储阵列,所述动态重构存储阵列用于存储计算功能配置文件;可重构计算芯片组件,包括至少一层瞬时重构计算芯片和至少一层瞬时重构芯片,所述瞬时重构计算芯片包括多个瞬时重构计算阵列,所述瞬时重构芯片包括多个瞬时重构阵列,所述瞬时重构阵列用于根据所述目标指令的指令序列通过所述动态重构存储阵列获得至少一个目标计算功能配置文件、根据获得的所述目标计算功能配置文件完成功能配置,所述瞬时重构计算阵列用于基于所述目标数据,执行所述目标指令的指令序列中记录的至少一个目标计算功能,其中,所述目标计算功能配置文件是所述动态重构存储阵列存储的与所述目标计算功能对应的所述计算功能配置文件,所述方法包括:
    根据目标指令,数据存储芯片组件的数据存储阵列存储目标数据和所述目标指令;
    可重构计算芯片组件的瞬时重构阵列通过动态重构存储芯片组件的动态重构存储阵列按照所述目标指令的指令序列中记录的至少一个目标计算功能获得对应的至少一个目标计算功能配置文件;
    所述瞬时重构阵列配置获得的至少一个所述目标计算功能配置文件;
    瞬时重构计算阵列基于所述目标数据,按照所述目标指令的顺序,执行所述目标计算功能,得到对应的结果数据。
  17. 根据权利要求16所述的计算器件的计算方法,其特征在于,所述可重构计算芯片组件的瞬时重构阵列通过动态重构存储芯片组件的动态重构存储阵列按照所述目标指令的指令序列中记录的至少一个目标计算功能获得对应的至少一个目标计算功能配置文件的步骤,包括:
    所述可重构计算芯片组件的所述瞬时重构阵列通过所述动态重构存储芯片组件的所述动态重构存储阵列按照所述目标指令的指令序列中记录的所有目标计算功能获得对应的所有目标计算功能配置文件;
    所述瞬时重构阵列配置获得的至少一个所述目标计算功能配置文件的步骤,包括:
    所述瞬时重构阵列配置获得的所有所述目标计算功能配置文件。
  18. 根据权利要求17所述的计算器件的计算方法,其特征在于,所述目标指令的指令序列中记录有第1目标计算功能至第N目标计算功能,所述结果数据包括最终结果数据和N-1个中间结果数据,N大于或等于1,N为自然数;
    所述瞬时重构计算阵列基于所述目标数据,按照所述目标指令的顺序,执行所述目标计算功能,得到对应的结果数据的步骤,包括:
    瞬时重构计算阵列基于所述目标数据,按照所述目标指令的顺序,执行第n目标计算功能,得到第n中间结果数据;
    瞬时重构计算阵列基于所述第n中间结果数据,按照所述目标指令的顺序,执行第n+1目标计算功能,得到第n+1中间结果数据,其中,0<n<N-1,n为自然数。
  19. 根据权利要求17所述的计算器件的计算方法,其特征在于,所述目标指令的指令序列中记录有第1目标计算功能至第N目标计算功能,所述结果数据包括最终结果数据和N-1个中间结果数据,N大于或等于1,N为自然数;
    所述瞬时重构计算阵列基于所述目标数据,按照所述目标指令的顺序,执行所述目标计算功能,得到对应的结果数据的步骤,包括:
    瞬时重构计算阵列基于所述目标数据,按照所述目标指令的顺序,同步执行第q目标计算功能和第j目标计算功能,分别得到第q中间结果数据和第j中间结果数据,其中,1≤q<N,1≤j<N,q和j均为自然数,j≠q;
    瞬时重构计算阵列基于所述第q中间结果数据和所述第j中间结果数据,按照所述目标指令的顺序,执行第v目标计算功能,得到第v中间结果数据其中,1<v<N,v为自然数,v≠q,v≠j。
  20. 根据权利要求16所述的计算器件的计算方法,其特征在于,所述瞬时重构阵列包括多路选择器、第一配置存储器块和第二配置存储器;
    所述可重构计算芯片组件的瞬时重构阵列通过动态重构存储芯片组件的动态重构存储阵列按照所述目标指令的指令序列中记录的至少一个目标计算功能获得对应的至少一个目标计算功能配置文件的步骤,包括:
    在瞬时重构计算阵列基于所述目标数据,执行所述第一配置存储器配置的所述目标计算功能时,所述第二配置存储器通过所述动态重构存储阵列按照所述目标指令的指令序列中记录的所述目标计算功能获得对应的所述目标计算功能配置文件。
PCT/CN2022/113709 2021-09-03 2022-08-19 一种计算器件、计算***及计算方法 WO2023030054A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111033167.4A CN113656345B (zh) 2021-09-03 2021-09-03 一种计算器件、计算***及计算方法
CN202111033167.4 2021-09-03

Publications (1)

Publication Number Publication Date
WO2023030054A1 true WO2023030054A1 (zh) 2023-03-09

Family

ID=78482822

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/113709 WO2023030054A1 (zh) 2021-09-03 2022-08-19 一种计算器件、计算***及计算方法

Country Status (2)

Country Link
CN (1) CN113656345B (zh)
WO (1) WO2023030054A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656345B (zh) * 2021-09-03 2024-04-12 西安紫光国芯半导体有限公司 一种计算器件、计算***及计算方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6047115A (en) * 1997-05-29 2000-04-04 Xilinx, Inc. Method for configuring FPGA memory planes for virtual hardware computation
US20130138894A1 (en) * 2011-11-30 2013-05-30 Gabriel H. Loh Hardware filter for tracking block presence in large caches
CN109033008A (zh) * 2018-07-24 2018-12-18 山东大学 一种动态可重构的Hash计算架构及其方法、Key-Value存储***
CN112463719A (zh) * 2020-12-04 2021-03-09 上海交通大学 一种基于粗粒度可重构阵列实现的存内计算方法
CN113656345A (zh) * 2021-09-03 2021-11-16 西安紫光国芯半导体有限公司 一种计算器件、计算***及计算方法

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7843215B2 (en) * 2007-03-09 2010-11-30 Quadric, Inc. Reconfigurable array to compute digital algorithms
CN101788927B (zh) * 2010-01-20 2012-08-01 哈尔滨工业大学 一种基于fpga的自适应星载计算机实现内部资源动态分配的方法
CN103942181B (zh) * 2014-03-31 2017-06-06 清华大学 用于生成动态可重构处理器的配置信息的方法、装置
CN104360982B (zh) * 2014-11-21 2017-11-10 浪潮(北京)电子信息产业有限公司 一种基于可重构芯片技术的主机***目录结构实现方法和***
CN104750660A (zh) * 2015-04-08 2015-07-01 华侨大学 一种具有多工作模式的嵌入式可重构处理器
US20180081834A1 (en) * 2016-09-16 2018-03-22 Futurewei Technologies, Inc. Apparatus and method for configuring hardware to operate in multiple modes during runtime
CN106953811B (zh) * 2017-03-14 2020-05-26 东华大学 一种大规模网络服务***行为重构方法
CN111433758B (zh) * 2018-11-21 2024-04-02 吴国盛 可编程运算与控制芯片、设计方法及其装置
CN111488114B (zh) * 2019-01-28 2021-12-21 北京灵汐科技有限公司 一种可重构的处理器架构及计算设备
CN111611197B (zh) * 2019-02-26 2021-10-08 北京知存科技有限公司 可软件定义的存算一体芯片的运算控制方法和装置
US10923450B2 (en) * 2019-06-11 2021-02-16 Intel Corporation Memory arrays with bonded and shared logic circuitry
CN112214448B (zh) * 2020-10-10 2024-04-09 声龙(新加坡)私人有限公司 异质集成工作量证明运算芯片的数据动态重构电路及方法
CN112328517B (zh) * 2020-11-10 2024-04-02 西安紫光国芯半导体有限公司 基于三维芯片的存储器数据通信装置、方法及相关设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6047115A (en) * 1997-05-29 2000-04-04 Xilinx, Inc. Method for configuring FPGA memory planes for virtual hardware computation
US20130138894A1 (en) * 2011-11-30 2013-05-30 Gabriel H. Loh Hardware filter for tracking block presence in large caches
CN109033008A (zh) * 2018-07-24 2018-12-18 山东大学 一种动态可重构的Hash计算架构及其方法、Key-Value存储***
CN112463719A (zh) * 2020-12-04 2021-03-09 上海交通大学 一种基于粗粒度可重构阵列实现的存内计算方法
CN113656345A (zh) * 2021-09-03 2021-11-16 西安紫光国芯半导体有限公司 一种计算器件、计算***及计算方法

Also Published As

Publication number Publication date
CN113656345B (zh) 2024-04-12
CN113656345A (zh) 2021-11-16

Similar Documents

Publication Publication Date Title
US10972103B2 (en) Multiplier-accumulator circuitry, and processing pipeline including same
US9577644B2 (en) Reconfigurable logic architecture
KR102381158B1 (ko) 적층형 실리콘 상호 연결(ssi) 기술 통합을 위한 독립형 인터페이스
US20180358313A1 (en) High bandwidth memory (hbm) bandwidth aggregation switch
US11288076B2 (en) IC including logic tile, having reconfigurable MAC pipeline, and reconfigurable memory
KR20110091905A (ko) 3-d 마이크로 구조 시스템에서의 평행 평면 메모리 및 프로세서 연결
WO2023030054A1 (zh) 一种计算器件、计算***及计算方法
WO2023030051A1 (zh) 一种堆叠芯片
US20230051480A1 (en) Signal routing between memory die and logic die for mode based operations
CN113515240A (zh) 一种芯片计算器件及计算***
CN113656346B (zh) 一种三维芯片及计算***
CN113722268B (zh) 一种存算一体的堆叠芯片
CN216118778U (zh) 一种堆叠芯片
CN113626373A (zh) 一种集成芯片
CN113793632B (zh) 非易失可编程芯片
WO2019114070A1 (zh) 一种分布式多功能层结构的fpga芯片
WO2021241048A1 (ja) Aiチップ
CN113626372B (zh) 一种存算一体的集成芯片
CN216118777U (zh) 一种集成芯片
CN113793844A (zh) 一种三维集成芯片
CN113705142A (zh) 一种三维芯片、计算***及计算方法
CN215769709U (zh) 一种芯片计算器件及计算***
US20220283779A1 (en) MAC Processing Pipelines, Circuitry to Configure Same, and Methods of Operating Same
TWI844011B (zh) 記憶體電路、神經網路電路以及製造積體電路裝置的方法
CN113745197A (zh) 一种三维异质集成的可编程阵列芯片结构和电子器件

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22863193

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE