WO2023016383A1 - Procédé de mémoire cache et produits associés - Google Patents

Procédé de mémoire cache et produits associés Download PDF

Info

Publication number
WO2023016383A1
WO2023016383A1 PCT/CN2022/110740 CN2022110740W WO2023016383A1 WO 2023016383 A1 WO2023016383 A1 WO 2023016383A1 CN 2022110740 W CN2022110740 W CN 2022110740W WO 2023016383 A1 WO2023016383 A1 WO 2023016383A1
Authority
WO
WIPO (PCT)
Prior art keywords
latch
data
area
cluster
chip
Prior art date
Application number
PCT/CN2022/110740
Other languages
English (en)
Chinese (zh)
Inventor
葛祥轩
张尧
刘少礼
梁军
Original Assignee
寒武纪(西安)集成电路有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202110926707.5A external-priority patent/CN115705300A/zh
Priority claimed from CN202110926703.7A external-priority patent/CN115878553A/zh
Application filed by 寒武纪(西安)集成电路有限公司 filed Critical 寒武纪(西安)集成电路有限公司
Publication of WO2023016383A1 publication Critical patent/WO2023016383A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems

Definitions

  • the present disclosure generally relates to the field of chip technology. More specifically, the present disclosure relates to a method for a cache memory, a cache memory, a system-on-chip including the cache memory, a board including the system-on-chip, and a computing device including the board.
  • the operational performance of a computing system is largely determined by the average memory access latency.
  • System performance can be significantly improved by effectively reducing the number of memory accesses by increasing the hit rate of the cache memory (referred to as "cache").
  • processors typically employ a cache mechanism, and use the cache to accommodate the mismatch in speed and performance between the processor and slow main memory.
  • the current cache implements a multi-level cache mechanism, such as three-level cache (L1, L2, and L3), and the cache closest to the main memory is called the last level cache (“Last Level Cache", LLC).
  • L1, L2, and L3 three-level cache
  • LLC last level cache
  • how to expand the application of LLC for different scenarios has also become a problem that needs to be solved.
  • the present disclosure provides a residency scheme for a cache memory.
  • a specific area in the cache memory can be configured as a locked area, and multiple-used data can be stored in the locked area, thereby improving the cache hit rate and improving the overall performance of the system.
  • the present disclosure provides a solution for a cache memory in the following aspects.
  • the present disclosure provides a method for a cache memory, comprising: configuring a specific storage space in the cache memory as a latch area supporting multiple latch modes, wherein each of the lock The latch mode corresponds to a latch-related operation performed on data in the latch area; receiving a latch-related request for performing a latch-related operation on the data in the latch area; and according to the lock A storage-related request, performing a latch-related operation on the data in the latch area in the corresponding latch mode.
  • the present disclosure provides a cache memory, including: a configuration module configured to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes, wherein each One of the latch modes corresponds to a latch-related operation performed on data in the latch area; a latch execution module is used for: receiving and latching the data in the latch area A latch-related request for a related operation; and performing a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
  • the present disclosure provides a system-on-chip comprising a cache as described above and in various embodiments below; and a processor configured to generate said latch-related request; wherein The latch execution module of the cache memory is configured to perform a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
  • the present disclosure provides a board, including the system-on-chip as described above and in the following embodiments.
  • the present disclosure provides a computing device, including the board as described above and described in various embodiments below.
  • the latch area can be used to perform latch and unlock operations on data used multiple times, thereby significantly improving the cache hit rate. Further, since the latch area of the present disclosure supports multiple latch modes, and these latch modes can be selected and used according to the configuration, the application scenarios of the latch area are expanded. When used in the scenario of the producer core and the consumer core, the latch area of the present disclosure can serve as a medium for data transfer, thereby improving the accessibility and utilization of data. In addition, since the probability of a cache hit is increased through the latch area, the solution of the present disclosure also significantly improves the overall performance of the computing system.
  • the present disclosure proposes the use scenario of expanding the cache memory.
  • the present disclosure provides solutions for a system on chip in the following aspects.
  • the present disclosure provides a method for a system-on-chip comprising at least a plurality of clusters for performing computational operations and a cache memory interconnected with the plurality of clusters, each cluster comprising A plurality of processor cores for performing the operation, the method includes: mapping a designated storage space of the off-chip memory to a latch area of the cache memory, so as to use the latch area as a link for inter-cluster data communication a cluster storage area; and performing operations of the cluster using the cluster storage area.
  • the present disclosure provides a system-on-chip, comprising: a plurality of clusters, each of which includes a plurality of processor cores for at least performing arithmetic operations; and a cache memory associated with the plurality of The clusters are interconnected and configured to: use the latch area as a cluster storage area for inter-cluster data communication, wherein the latch area forms a mapping relationship with a designated storage space of the off-chip memory; and use the cluster storage area to execute all Describe the operation of the cluster.
  • the present disclosure provides a computing device comprising a system-on-chip as described above and in various embodiments below.
  • the present disclosure provides a board, including the computing device as described above and described in various embodiments below.
  • the present disclosure provides a computing device, including the board as described above and in the following embodiments.
  • the latch area of the cache memory can be used to realize efficient communication between the clusters of the SoC. Therefore, the data that needs to be transferred through the off-chip memory can be directly transferred through the latch area, thereby speeding up data access and significantly improving the cache hit rate. Further, since the probability of a cache hit is increased through the latch area, the solution of the present disclosure also significantly improves the overall performance of the SoC. In addition, the division of the latch area simplifies the management of the cache memory and expands the usage scenarios of the cache memory. With the help of the latch area, multiple clusters of the SoC can implement multiple flexible communication mechanisms, thereby also improving the operational performance of the cluster.
  • FIG. 1 is a structural diagram showing a board according to an embodiment of the present disclosure
  • FIG. 2 is a structural diagram illustrating an integrated circuit device according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram showing the internal structure of a single-core computing device according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure.
  • FIG. 6 is a flowchart illustrating a method for a cache memory according to an embodiment of the present disclosure
  • Figure 7 is a simplified block diagram illustrating a cache memory according to an embodiment of the disclosure.
  • FIG. 8 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the disclosure.
  • FIG. 9 is a detailed block diagram illustrating a system-on-chip according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic diagram showing a hash operation in window mode according to an embodiment of the present disclosure.
  • Figure 12 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the disclosure.
  • FIG. 13 is a flowchart illustrating a method for a system on a chip according to an embodiment of the present disclosure.
  • FIG. 14 is a block diagram illustrating an operation of a system on chip according to an embodiment of the present disclosure.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • the phrase “if determined” or “if [the described condition or event] is detected” may be construed, depending on the context, to mean “once determined” or “in response to the determination” or “once detected [the described condition or event] ]” or “in response to detection of [described condition or event]”.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. It can be understood that the structure and composition shown in FIG. 1 are only an example, and are not intended to limit the solution of the present disclosure in any respect.
  • the board 10 includes a chip 101, which may be a system-on-chip (System on Chip, SoC), that is, a system-on-chip described in the context of the present disclosure. In one implementation scenario, it may be integrated with one or more combined processing devices.
  • the aforementioned combination processing device can be an artificial intelligence computing unit, which is used to support various deep learning and machine learning algorithms, and meet the intelligent processing requirements in complex scenarios in the fields of computer vision, voice, natural language processing, data mining, etc., especially in-depth Learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligent applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform.
  • the board 10 of this embodiment is suitable for cloud intelligent applications and has huge off-chip storage and on-chip storage. and powerful computing power.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 may be, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 may also include a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus.
  • the control device 106 in the board 10 may be configured to regulate the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a structural diagram showing a combination processing device in the chip 101 according to the above-described embodiment.
  • the combined processing device 20 may include a computing device 201, an interface device 202, a processing device 203, and a DRAM (Dynamic Random Access Memory, DRAM) DRAM 204.
  • DRAM Dynamic Random Access Memory
  • the computing device 201 can be configured to perform user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor. In some operations, it can be used to perform deep learning or machine learning calculations, and can also interact with the processing device 203 through the interface device 202 to jointly complete operations specified by the user.
  • the interface device 202 can be used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write it into a storage device on the computing device 201 .
  • the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the chip of the computing device 201 .
  • the interface device 202 may also read data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201 .
  • the processing device 203 may be one or more types of a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU) or other general and/or special purpose processors.
  • Processors including but not limited to Digital Signal Processor (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when considering the integration of the computing device 201 and the processing device 203 together, they are considered to form a heterogeneous multi-core structure.
  • the DRAM 204 is used to store data to be processed, and is a DDR memory, usually 16G or larger in size, for storing data of the computing device 201 and/or the processing device 203.
  • FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 as a single core.
  • the single-core computing device 301 is used to process input data such as computer vision, speech, natural language, data mining, etc.
  • the single-core computing device 301 includes three modules: a control module 31 , an operation module 32 and a storage module 33 .
  • the control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, which includes an instruction fetch unit (Instruction Fetch Unit, IFU) 311 and an instruction decoding unit (Instruction Decode Unit, IDU) 312.
  • the instruction fetching unit 311 is used to obtain instructions from the processing device 203 , and the instruction decoding unit 312 decodes the obtained instructions and sends the decoding results to the computing module 32 and the storage module 33 as control information.
  • the operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 .
  • the vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
  • the storage module 33 is used to store or transport related data, including a neuron storage unit (Neuron RAM, NRAM) 331, a parameter storage unit (Weight RAM, WRAM) 332, and a direct memory access module (Direct Memory Access, DMA) 333.
  • NRAM 331 is used to store input neurons, output neurons, and intermediate results after calculation;
  • WRAM 332 is used to store convolution kernels of deep learning networks, that is, weights;
  • DMA 333 is connected to DRAM 204 through bus 34 and is responsible for single-core calculations Data transfer between the device 301 and the DRAM 204.
  • FIG. 4 shows a schematic diagram of the internal structure of the computing device 201 as a multi-core.
  • the multi-core computing device 41 adopts a hierarchical structure design, and the multi-core computing device 41 is a system on chip, which includes at least one cluster (cluster) according to the present disclosure, and each cluster includes multiple processor cores.
  • the multi-core computing device 41 is constituted at the level of SoC-cluster-processor core.
  • the multi-core computing device 41 includes an external storage controller 401 , a peripheral communication module 402 , an on-chip interconnection module 403 , a synchronization module 404 and multiple clusters 405 .
  • the peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to execute tasks.
  • the on-chip interconnection module 403 connects the external storage controller 401, the peripheral communication module 402 and multiple clusters 405 to transmit data and control signals among the modules.
  • the synchronization module 404 is a global synchronization barrier controller (Global Barrier Controller, GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information.
  • GBC Global Barrier Controller
  • the plurality of clusters 405 of the present disclosure are the computing cores of the multi-core computing device 41 . Although 4 clusters are exemplarily shown in FIG. 4 , with the development of hardware, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more clusters 405 . In an application scenario, the cluster 405 can be used to efficiently execute deep learning algorithms.
  • each cluster 405 may include a plurality of processor cores (IPU core) 406 and a storage core (MEM core) 407, which may include, for example, the high-speed Buffer memory (eg LLC).
  • IPU core processor core
  • MEM core storage core
  • the number of processor cores 406 is exemplarily shown in the figure as four, and the present disclosure does not limit the number of processor cores 406, and its internal architecture is shown in FIG. 5 .
  • Each processor core 406 is similar to the single-core computing device 301 in FIG. 3 , and may also include three modules: a control module 51 , an operation module 52 and a storage module 53 .
  • the functions and structures of the control module 51 , computing module 52 and storage module 53 are roughly the same as those of the control module 31 , computing module 32 and storage module 33 , and will not be repeated here.
  • the storage module 53 may include an input/output direct memory access module (Input/Output Direct Memory Access, IODMA) 533 and a moving direct memory access module (Move Direct Memory Access, MVDMA) 534.
  • IODMA 533 controls memory access of NRAM 531/WRAM 532 and DRAM 204 through broadcast bus 409;
  • MVDMA 534 is used to control memory access of NRAM 531/WRAM 532 and storage unit (SRAM) 408.
  • the storage core 407 is mainly used for storage and communication, that is, storing shared data or intermediate results between the processor cores 406, executing communication between the cluster 405 and the DRAM 204, communication between the clusters 405, processors communication between the cores 406 and the like.
  • the storage core 407 may have a scalar operation capability to perform scalar operations.
  • the storage core 407 may include a static random access memory (Static Random-Access Memory, SRAM) 408, a broadcast bus 409, a cluster direct memory access module (Cluster Direct Memory Access, CDMA) 410 and a global direct memory access module (Global Direct Memory Access , GDMA) 411.
  • SRAM static random access memory
  • CDMA Cluster Direct Memory Access
  • GDMA global direct memory access module
  • the SRAM 408 can assume the role of a high-performance data transfer station.
  • the data multiplexed between different processor cores 406 in the same cluster 405 does not need to be obtained from the DRAM 204 through the processor cores 406 respectively, but is transferred between the processor cores 406 through the SRAM 408.
  • the storage core 407 only needs to quickly distribute the multiplexed data from the SRAM 408 to multiple processor cores 406, thereby improving the efficiency of inter-core communication and significantly reducing on-chip and off-chip input/output access.
  • the broadcast bus 409, the CDMA 410 and the GDMA 411 are respectively used to perform communication between the processor cores 406, communication between the clusters 405, and data transmission between the clusters 405 and the DRAM 204. They will be described separately below.
  • the broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405 .
  • the broadcast bus 409 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast.
  • Unicast refers to point-to-point (for example, a single processor core to a single processor core) data transmission
  • multicast is a communication method that transmits a piece of data from SRAM 408 to specific several processor cores 406, and broadcasting is to transmit a data
  • the communication mode in which data is transmitted from SRAM 408 to all processor cores 406 belongs to a special case of multicast.
  • the CDMA 410 is used to control the memory access of the SRAM 408 between different clusters 405 in the same computing device 201.
  • the GDMA 411 cooperates with the external memory controller 401 to control memory access from the SRAM 408 of the cluster 405 to the DRAM 204, or to read data from the DRAM 204 to the SRAM 408.
  • the communication between the DRAM 204 and the NRAM 431 or WRAM 432 can be realized in two ways.
  • the first way is to directly communicate with DRAM 204 and NRAM 431 or WRAM 432 through IODAM 433; the second way is to first transmit data between DRAM 204 and SRAM 408 through GDMA 411, and then make data transfer between SRAM 408 and SRAM 408 through MVDMA 534 Transfer between NRAM 431 or WRAM 432.
  • the second way may require more components to participate and the data flow is longer, in fact, in some embodiments, the bandwidth of the second way is much larger than the first way, so the second way is used to implement the DRAM 204 Communication with NRAM 431 or WRAM 432 may be more efficient. It can be understood that the data transmission methods described here are only exemplary, and those skilled in the art can flexibly select and apply various data transmission methods according to the specific arrangement of hardware according to the teaching of the present disclosure.
  • the function of GDMA 411 and the function of IODMA 533 can be integrated in the same component.
  • the present disclosure regards GDMA 411 and IODMA 533 as different components for the convenience of description, for those skilled in the art, as long as the functions and technical effects achieved are similar to those of the present disclosure, they belong to the protection scope of the present disclosure .
  • the function of GDMA 411, the function of IODMA 533, the function of CDMA 410, and the function of MVDMA 534 can also be realized by the same part.
  • the hardware architecture and its internal structure of the present disclosure have been described in detail above with reference to FIGS. 1-5 . It is to be understood that the foregoing description is illustrative only and not restrictive. According to different application scenarios and hardware specifications, those skilled in the art may also make changes to the board card and its internal structure of the present disclosure, and these changes still fall within the protection scope of the present disclosure.
  • the corresponding hardware architecture may not include the CDMA 410 used to control the access to the SRAM 408 among different clusters 405 in the same computing device 201 .
  • the underlying approach of the present disclosure involves improving and optimizing the cache, eg, disposed between SRAM 408 and DRAM 204, to enable efficient on-demand latching of data and communication between different clusters through the cache.
  • the following scheme of the present disclosure proposes to configure a specific storage space in the cache memory as a latch area for data latch operations, especially It is for data that will be frequently used.
  • the aforementioned frequently used data may be data to be reused between at least one task having a data dependency. It will be appreciated that data need not be locked in the cache memory when the data need only be used once.
  • the following solution of the present disclosure also proposes to configure the cache memory to support multiple latch modes, so that when a latch-related request is received, the high-speed The buffer memory operates in a latch mode corresponding to the aforementioned latch-related request.
  • various latch modes of the present disclosure may have a specific priority order to satisfy different latch-related operations.
  • the solution of the present disclosure also proposes a variety of different configuration methods, so that the cache memory can be used more flexibly and used to realize inter-cluster communication.
  • FIG. 6 is a flowchart illustrating a method 600 for a cache memory according to an embodiment of the disclosure.
  • the method 600 includes, at step S602 , configuring a specific storage space in the cache memory as a latch area supporting multiple latch modes.
  • the aforementioned multiple latch modes may include, but not limited to, an instruction mode for performing latch-related operations based on hardware instructions, a window mode for performing latch-related operations based on window attributes, and a lock mode for performing latch-related operations based on data streams.
  • Streaming mode for store-related operations and/or page mode for latch-related operations based on cache pages.
  • the aforementioned data streams may be instruction streams or data streams of different types.
  • the data stream may be the neuron data stream, weight data stream, output result data stream, etc. of the neural network model.
  • the data targeted by the latch-related operation is data that will be used multiple times by the processor of the system-on-chip, and has relatively higher priority than the data that is not subjected to the latch operation.
  • the cache hit rate can be significantly improved, thereby improving the overall performance of the system.
  • the reused data in the latch area of the LLC the read and write operations of data between the on-chip system and off-chip memory (such as DDR or DRAM) can be reduced, thereby improving memory access efficiency .
  • the above-mentioned multiple latch modes can be set to have different priorities according to user preferences or system preferences.
  • the order of priority may be instruction mode -> window mode -> stream mode -> page mode; in another implementation, the order of priority may be instruction mode -> page mode —> Stream Mode —> Window Mode.
  • the latch area in the cache memory can be used in more ways, increasing the flexibility of using the latch area to cope with different application scenarios and system requirements. Further, it may traverse sequentially according to the priority order of the above-mentioned latch modes, and when the high-priority latch mode is disabled, the low-priority latch mode may be adopted.
  • a specific storage space may be configured as a latch area supporting a corresponding latch mode according to one configuration instruction among the received configuration instructions.
  • the configuration instruction may include one or more configuration items, so as to realize the configuration of the aforementioned latch area.
  • the plurality of configuration items may include configuration items for enabling a latch area, disabling a latch area, and/or a size of a latch area.
  • the corresponding latch strategy (such as the size of the latch data or the specific data to be latched) can be configured in the aforementioned instruction mode, window mode, stream mode or page mode, so as to latch different types or specific Instructions, data or data flow, etc.
  • the scheme of the present disclosure can realize the flexible use of the cache memory, so that it can operate in one of the various latch modes of the present disclosure, or operate in the normal mode as required.
  • a latch-related request for performing latch-related operations on data in the latch area is received.
  • the latch-related request may be triggered by an operation intended to reside specific data in a latch region.
  • the latch-related request may also be triggered by an operation intended to remove or release specific data from the latch area.
  • the latch-related requests of the present disclosure may also have different expressions or contents. For example, for an instruction mode, a window mode, or a stream mode, the latch-related request may include a configuration item for indicating a behavior attribute of the cache memory, and the like.
  • the above-mentioned configuration item for indicating the behavior attribute of the cache memory includes at least one of the following multiple configuration attributes:
  • Transient (Transient) attribute do not cache in LLC, that is, directly perform data read and write operations with off-chip memory (such as DDR); for some data that is only accessed once, do not cache in LLC, thereby avoiding Occupy LLC resources;
  • Lock (Lock) attribute Reside specific data in the latch area, and read and write data from the hit cache line (cacheline). If the cache line belongs to the latch area, the attribute of the cache line is configured as a persistent (persisting) attribute; if the cache line does not belong to the latch area, the attribute of the cache line remains unchanged, that is, the following normal (normal) attributes are maintained; it should be clear Yes, the above-mentioned cache line in the latch area has two attributes, namely a persistent (persisting) attribute and a normal (normal) attribute. A cache line with a persistent (persisting) attribute in the lock area can only be accessed and replaced by a lock-related request with a Lock attribute.
  • Unlock (Unlock) attribute After reading and writing data from the hit cache line, release the corresponding storage space of the data in the latch area in the LLC, and set the corresponding cache line attribute in the latch area to the following general attributes;
  • Invalid attribute Invalid data directly after reading to avoid being replaced and written to the off-chip memory
  • Clean (Clean) attribute When performing a write operation, data can be written into the hit cache line, and the storage content of the entire cache memory (cache) can be written back to the off-chip memory, and the attributes of the cache line remain unchanged; During a read operation, data is read from the hit cache line. When the hit cache line is dirty (dirty), write it back to the off-chip memory;
  • Default (Default) attribute the default item can be used to indicate that the configuration about the latch mode is ignored.
  • the solution of the present disclosure can execute corresponding latch-related operations in the instruction mode according to these attached attributes.
  • the latch related request may indicate that the data related to the specific page will be latched in the latch area for subsequent multiple use, or may indicate that the data related to the specific page will be After multiple uses, unlock from the latch area to release more storage space for subsequent data latches. It can be understood that, through the release operation, the storage space of the latch area can be used flexibly, thereby improving the utilization efficiency of the latch area of the present disclosure.
  • a latch-related operation may be performed on data in the latch area in a corresponding latch mode.
  • the aforementioned latch-related operations may include a read operation and a write operation for the latch area.
  • the method 600 may also include latching data or a selected part of the data in a specified area of the latch area according to a latch-related request, so as to be used in subsequent multiple reads.
  • the method 600 may further include, after the read operation is completed, transferring the data or a selected part of the data from the latch area according to the latch-related request The specified area is released.
  • a predetermined proportion of data may be randomly selected from the data to form the aforementioned partial data to be latched in the latch area.
  • a predetermined proportion of data may be selected from the data by using a hash algorithm as the aforementioned partial data to be latched in the latch area.
  • the aforementioned hash algorithm may be used to select part of the data that can be locked in the latch area. The specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 .
  • the solution of the present disclosure enables the cache memory to support multiple latch modes, thereby expanding the application scenarios of the cache memory and significantly improving the cache hit rate. Furthermore, due to the introduction of multiple latch modes, the use of the latch area is more flexible and adaptable, so as to meet different application scenarios and user requirements. In addition, due to the effective latching of data in the latch area, the sharing of data between the producer kernel ("producer kernel”) and one or more consumer kernels (“consumer kernel”) is also promoted, improving the Data Accessibility and Usage.
  • the producer kernel and the consumer kernel here can be understood as two dependent tasks, where the output of the producer kernel will be used as the input to the consumer kernel, so that the consumer kernel can use the input to complete the corresponding task.
  • the output of the producer core can be used as data that needs to be used multiple times in the future, and the data that needs to be used multiple times in the future can be temporarily stored in the lock of the cache memory memory area, so that the consumer core can directly obtain the input from the cache memory without accessing the off-chip memory, thereby reducing the memory interaction between the artificial intelligence processor and the off-chip memory, and reducing the IO access memory overhead, which in turn can improve the processing efficiency and performance of artificial intelligence processors.
  • FIG. 7 is a simplified block diagram illustrating a cache memory 700 according to an embodiment of the disclosure. It can be understood that the cache memory 700 shown in FIG. 7 may be the cache memory described in conjunction with FIG. 6 , so the cache memory described in FIG. 6 is also applicable to the following description in relation to FIG. 7 .
  • the cache memory 700 of the present disclosure may include a configuration module 701 and a latch execution module 702 . Further, the cache memory 700 also includes a storage space for performing cache operations, for example, as shown in the figure, the storage space is equally divided into 8 ways (way0-way7), wherein each way includes a number of The cache line (cacheline).
  • the above-mentioned configuration module can be used to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes, wherein the size of the specific storage space is smaller than the total storage size of the cache memory .
  • way0-way5 in FIG. 7 can be configured as a specific storage space that supports latching.
  • ways6-7 in FIG. 7 can maintain the common attributes of the cache memory, that is, be used as a general cache.
  • the latch mode can be instruction mode, window mode, stream mode and/or page mode.
  • the latch execution module may be configured to receive a latch-related request for performing latch-related operations on data in the latch area.
  • the latch execution module can perform latch-related operations on data in the latch area in a corresponding latch mode according to the latch-related request.
  • the latch-related operations here may include a write operation for the latch area (that is, writing data into the latch area) or releasing data in the latch area from the latch area. For example, when the consumer core has used up the data in the lock area and the data will no longer be used by other consumer cores, the space for storing data in the lock area can be released for locking other data .
  • FIG. 8 is a simplified block diagram illustrating a system-on-chip 800 according to an embodiment of the disclosure.
  • a system-on-chip 800 of the present disclosure may include a cache memory 700 and a processor (or processor core) 802 as shown in FIG. 7 .
  • the latch execution module of the cache memory may be configured to perform a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
  • the cache memory 700 it has been described above in conjunction with FIG. 6 and FIG. 7 , and will not be repeated here.
  • the processor 802 may be various types of processors, and may include one or more processor cores to generate latch-related requests.
  • the latch execution module of the cache memory is configured to perform latch-related operations on data in the latch area in a corresponding latch mode according to the generated latch-related request.
  • the processor can be configured to generate latch-related requests according to received hardware instructions.
  • the latch mode is the page mode
  • the processor may be configured to generate a latch-related request according to the cache page configuration.
  • the processing may be used to configure a lock window, and generate a latch-related request according to the lock window.
  • the processor 802 may also be an intelligent processor or an intelligent processing unit (“Intelligence Processing Unit”, abbreviated as "IPU") including multiple computing cores, which may be configured to execute various artificial intelligence fields (such as neural network calculations).
  • IPU Intelligent Processing Unit
  • multiple computing cores which may be configured to execute various artificial intelligence fields (such as neural network calculations).
  • FIG. 9 is a detailed block diagram illustrating a system on chip 900 according to an embodiment of the present disclosure. It can be understood that the system-on-chip 900 shown here may be a specific implementation of the system-on-chip shown in FIG. 8 , and therefore the content described with respect to FIG. 8 is also applicable to FIG. 9 . Further, for the purpose of example only, the operation of the system-on-chip 900 will be described in a window mode (or stream mode) among a plurality of latch modes.
  • a system on chip 900 may include a task scheduler (“Job Scheduler”) 902 including a scheduling unit 903 and a configurator 904 .
  • the configurator 904 may be configured to generate configuration instructions according to assigned configuration tasks (e.g., obtainable from a task queue) to be sent to a configuration module (such as a CLR) in a cache memory (that is, "LLC” 906) send.
  • the scheduling unit 903 can be used to schedule multiple tasks in the task scheduler (that is, the "kernel” to be executed on the artificial intelligence processor), so as to provide intelligent processing in the system on chip of the present disclosure processor (IPU) 905 to send.
  • the intelligent processor 905 here may include multiple processor cores, and the multiple processor cores may form a cluster as shown in FIG. 4 .
  • the scheduling unit may allocate tasks to appropriate processor cores according to the idleness (eg utilization) of the multiple processor cores.
  • system-on-chip 900 also includes a system memory management unit (“System Memory Management Unit”, abbreviated as "SMMU"), which is used to convert the virtual address of the accessed data into a physical address, so as to An address enables access to an associated memory location.
  • system memory management unit includes an address translation buffer TLB (Translation Lookaside Buffer, also called fast table).
  • TLB Translation Lookaside Buffer
  • a page table is maintained in the TLB, and the page table includes at least one page table entry, and each page table entry includes a page (page) and a page frame (Frame) corresponding to the page.
  • the system memory management unit can determine the page corresponding to the virtual address according to the received virtual address, and then can determine the physical address PA (Physical Address) corresponding to the virtual address through the mapping relationship between the page and the page frame, Therefore, the access to the relevant storage location of the cache memory can be realized according to the physical address.
  • PA Physical Address
  • access to the cache memory can be implemented through the above-mentioned window mode or stream mode.
  • the intelligent processor can obtain the parameter table from the memory, and according to the parameter table, configure a lock window ("Lock window") associated with the data of the latch-related operation to be performed, and generate a lock window according to the configured lock window.
  • Requests ie, eg IO access requests with lock/unlock attributes attached.
  • the SMMU can perform latch-related operations on the LLC according to the IO access request. Specifically, the SMMU may send the aforementioned IO access request to the cache policy module 907 of the LLC 906 (which performs the same operation as the latch execution module 702 in FIG. 7 ) for execution.
  • the parameter table may include parameter items for configuring a lock window or a stream latch attribute in a stream mode.
  • parameter items may include, but are not limited to, lock/unlock window (“lock/unlock window”), lock/unlock per stream (“per stream lock/unlock”), lock ratio (“Lock Ratio”), lock Window identification ("lock window flag”) and other information.
  • the parameters in the parameter table may be user-defined.
  • the relevant parameters in the parameter table can be obtained during the running phase of the program, and the parameter table can be stored in the memory (such as DDR), so that the intelligent processor (such as the IPU 905 in the figure) can be used in the execution phase.
  • the above-mentioned lock window is used to represent the storage space that the software user wishes to lock, and the size of the lock window may be larger than the size of the lock area on the cache memory.
  • the above-mentioned locked window includes one or more of the following: the base address and size of the window, wherein the base address of the window can be a virtual address configured by the upper layer software (such as virtual address "Virtual Address", abbreviated "VA"), the window The base address of the window corresponds to the starting address of the data to be latched, and the size of the window may correspond to the size of the data to be latched.
  • the intelligent processor can determine the memory access address of the data in the task (the memory access address can be a virtual address) according to the task issued by the task scheduler, and make the memory access address of the data in the task The address is compared to the address range defined by the window's lock window. If the access address of the data in the task is within the address range of the lock window, it means that the lock window is hit, and the lock window can be enabled (such as "Enabled") at this time. Otherwise, if the access address of the data in the task is outside the address range of the lock window, it means that the lock window is not hit. At this time, the lock window can be ignored, which means that the data in the task will not be temporarily stored in the cache memory.
  • a predetermined proportion of data may be selected from the data by using a hash algorithm as the aforementioned partial data and stored in the latch area.
  • the specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 .
  • the intelligent processor can send the lock-related request attached with the Lock attribute to the cache memory LLC through the SMMU.
  • the lock-related request attached with the Lock attribute may be used to indicate that specific data resides in the lock area, and the specific data may be part of data selected according to a hash algorithm.
  • the latching process and release process of the LLC will be described below in the window mode with reference to FIG. 9 .
  • Step 1 The task scheduler configures the LLC with the help of the configurator (e.g. via the cache policy module) to enable the lock region ("Lock enable”), disable the lock region (“Lock disable”) and the size of the lock region, as shown in the figure
  • the number of ways (“Ways") to go out eg Way0-Way7.
  • Step 2 The task scheduler sends the task kernel to the IPU;
  • Step 3 The IPU obtains the lock window flag (“lock window flag”) from the parameter table, reads and configures the lock window.
  • the parameter table here can be configured by software and stored at a storage address of an off-chip dynamic random access memory ("Dynamic Random Access Memory", abbreviated as "DRAM"). Then, the task scheduler can transmit the address to the IPU and the IPU can read the parameter table according to the address, so as to complete the configuration of the locking window.
  • DRAM Dynamic Random Access Memory
  • Step 4 The IPU generates a lock-related request through the memory management unit SMMU, and when sending the request to the cache policy module of the LLC, the request can be attached with a lock attribute according to the lock window information.
  • Step 5 After receiving the lock-related request with the lock attribute, the cache policy module of the LLC stores the corresponding data in the corresponding cache line, and marks the lock attribute of the cache line (that is, the lock area), For example set to "persisting" as described above.
  • Step 6 The task scheduler sends the kernel to the IPU
  • Step 7 The IPU obtains the unlock window ID from the parameter table, reads and configures the unlock window;
  • Step 8 When the IPU transmits the request, it attaches the unlock (“unlock”) attribute according to the unlock window information;
  • Step 9 After receiving the request with the unlock attribute, the cache policy module of the LLC switches the cache line of the hit lock attribute to a normal attribute, such as the normal (Normal) attribute described in conjunction with the instruction mode above;
  • Step 10 The task scheduler disables the lock area (ie, LLC lock disable) by means of the configurator and through the CLR module.
  • the CLR module may clear the previous locking attribute configuration according to the instruction of the configurator.
  • the latch scheme of the system on chip of the present disclosure in the window mode has been described in detail above with reference to FIG. 9 .
  • the probability of a cache hit can be significantly increased, the utilization efficiency of the cache memory is improved, and the application scenarios are expanded.
  • the embodiments of the present disclosure also support latch-related operations in stream mode.
  • the enable bit corresponding to the data stream in the task of the present disclosure is low, it is regarded as the default situation, that is, the latch-related operations in stream mode are not performed. .
  • the corresponding latch-related operations can be performed on the data stream in stream mode.
  • the window mode and the stream mode of the present disclosure have similar operations, and a predetermined proportion of data can be selected from the data stream as the aforementioned partial data to be stored in the latch by using the hash algorithm and the lock ratio of the data stream. in the district. The specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 .
  • the embodiment of the present disclosure also supports latch-related operations in the page mode, and the page mode will be described below with reference to FIG. 10 .
  • FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure.
  • the cache page can be directly configured so that it has the lock attribute of the present disclosure, so that the cache page that forms a mapping relationship with the memory (such as "memory") can be used for multiple
  • the kernel kernels (kernel 0-2 shown in the figure) share access data for use.
  • the programmer may use an instruction (such as Malloc) to mark the cache page with a lock attribute.
  • Malloc an instruction
  • the SMMU can lock the data corresponding to the cache page in the latch area of the present disclosure.
  • the disclosed scheme improves the sharing and accessibility of data among multiple cores.
  • the software driver can directly configure ("System Memory Management Unit”, abbreviated as "SMMU") information in the page table through instructions, and determine to perform page-based latch operations or Normal operates both configurations.
  • SMMU System Memory Management Unit
  • the attribute of the buffer line in the cache memory can be a normal (Normal) attribute.
  • the page-based latch operation may be set according to the SMMU linearly mapped window configuration. For example, the data corresponding to the cache page in the linear mapping window is locked in the latch area of the present disclosure.
  • the SMMU can generate a corresponding lock-related request based on the information in the page table, and send the lock-related request to the LLC, and the cache policy module of the LLC can configure the cache line of the LLC according to the lock-related request to execute The corresponding cache-related operations.
  • the embodiment of the present disclosure also supports an instruction mode, at this time, the system-on-chip can configure the latch area in the LLC through a memory access instruction (IO instruction) in the instruction set.
  • IO instruction memory access instruction
  • the IO instruction is accompanied by at least one configuration domain that latches related attributes, so that the LLC can be flexibly configured by means of the configuration domain.
  • various configuration domains may represent corresponding operation behaviors that the LLC may perform when performing data access to off-chip memory (such as DDR space).
  • the above configuration attributes are included in the instruction: Transient (Transient) attribute, Lock (Lock) attribute, Unlock (Unlock) attribute, General (Normal) attribute, Invalid (Invalid) attribute, Clean (Clean) attribute Or the default (Default) attribute and so on. Since the instruction mode is the highest priority, when the IO access instruction is indicated as the Default attribute, it means that other modes (such as window mode, stream mode or page mode) can perform latch-related operations.
  • the solution of the present disclosure can execute corresponding latch-related operations in the instruction mode according to these attached attributes.
  • the IPU can determine the latch related request according to the IO instruction in the task. Specifically, when the configuration domain of the Lock attribute in the IO instruction is enabled, the Lock attribute can be attached to the lock-related request at this time, so that the LLC can store specific data in the lock-related request according to the Lock attribute. locked area. When the configuration field of the Unlock attribute in the IO command is enabled, the Unlock attribute can be attached to the lock-related request at this time, so that the LLC can release the locked area according to the lock-related request attached with the Unlock attribute. According to different application scenarios, the latch-related request here can also have other attributes similarly attached.
  • the instruction when the instruction also includes a specific configuration field for indicating the latch ratio.
  • a specific configuration field in the instruction for example, a specific bit inst_ratio_en
  • the specific use of the hash algorithm will be described in detail below in conjunction with FIG. 11 .
  • FIG. 11 illustrates a hash operation in window mode or stream mode according to an embodiment of the present disclosure.
  • the scheme of the present disclosure uses a hash operation to enforce a certain percentage of residency (ie locking) because one of the key issues with LLC residency is the bandwidth versus capacity tradeoff ("tradeoff"). Therefore, this disclosure proposes to implement a certain ratio of residency (ie, Lock Ratio), so that different bandwidths and residency capacities can be obtained for different tasks.
  • Lock Ratio can be configured in the lock/unlock window or for specific data streams. Also, although hash operations in window mode or stream mode are described below, similar operations are also applicable to hash operations in instruction mode.
  • the intelligent processor core first compares the access address of the data with the address range defined by the lock window to determine whether the requested address is within the address range of the lock window.
  • a hash operation may be performed on the hit window address range.
  • the access address of each data may be a virtual address.
  • the VA of the access address can be mapped to the Hash space (that is, the "Hash Map” in the figure), and the Hash process can preferentially retain the low-order information of the address.
  • the Hash value obtained at 1102 can be compared with the lock ratio Lock Ratio at 1104 to randomly select data of a corresponding ratio.
  • the hash value of the access address is smaller than the latch ratio, it is considered a hit, and therefore the part of data (ie, data conforming to the ratio) can be latched in the cache memory.
  • the hash value of the access address is greater than or equal to the latch ratio, it is considered a miss, and therefore this part of data will not be latched in the cache memory.
  • the lock ratio Lock Ratio when the lock ratio Lock Ratio is set to 10%, you can select the part of the data corresponding to the first 10% value from the Hash value in order, that is, the part whose hash value of the latch address of the data is smaller than the lock ratio Data is latched for related operations.
  • the latch ratio can also be other values, and the latch ratio can be customized by the software user, and the aforementioned selection operation can also be implemented according to the setting of the Hash algorithm.
  • the latch ratio may also be 20%-30%, and at this time, partial data corresponding to the first 20%-30% of the Hash values may be sequentially selected to perform latch-related operations. Thereafter, at 1106, it can be processed according to the specified request type, that is, to lock or unlock some data.
  • the latch scheme of the cache memory of the present disclosure has been described in detail above with reference to FIGS. 6-11 . Based on the idea of the aforementioned latch scheme, and as a supplement to the aforementioned latch scheme, the following will describe another extended application of the present disclosure for the cache memory in conjunction with Fig. 12-Fig. Inter-cluster communication.
  • the system on chip here may be the system on chip included in the computing device 201 shown in FIG. 2 , for example, the system on chip constituted by the multi-core computing device 41 .
  • the system-on-chip 1200 includes four clusters 0 - cluster 4 exemplarily shown. Since the cluster has been described in detail above, it will not be repeated here.
  • a cache memory 1201 which can be set, for example, in the SRAM 408 as previously shown in FIG. 5, for performing inter-cluster data transfer operations.
  • the cache memory 1201 can also perform on-chip and off-chip bidirectional communication with DRAM (such as DDR), including the transfer of various types of data or instructions.
  • FIG. 13 is a flowchart illustrating a method 1300 for a system on chip according to an embodiment of the present disclosure.
  • the system on chip here may be the system on chip as shown in FIG. 12 .
  • the system-on-chip includes at least a plurality of clusters for performing computing operations and a cache memory interconnected with the plurality of clusters.
  • each cluster may include multiple processor cores for performing the computing operations.
  • the above-mentioned latch area determined in the cache memory can be used to complete inter-cluster data communication, so that the system-on-chip does not need to set communication modules such as CDMA 410 and GDMA 411.
  • the above-mentioned latch area can be used to transfer data between tasks with dependencies, for example, the latch area can be used to transfer data between a producer core and a consumer core.
  • the processor can lock the data that the producer core needs to exchange to the consumer core in the LLC through the configured lock window.
  • the processor after the processor finishes executing the producer kernel, it can latch the data that needs to be delivered to the consumer kernel (it may be the input data or output data of the producer kernel).
  • the processor can perform the latch-related operations of the present disclosure on the LLC through the configured lock window and by means of, for example, the SMMU, so as to latch the above-mentioned data that needs to be exchanged in the LLC in the window mode, for later use by the consumer kernel.
  • the processor can also release the latch area according to the unlock window configured in the consumer kernel, that is, when the processor completes the execution of the consumer kernel by performing a read operation on the data latched in the LLC, it can release the latch area in the LLC.
  • the corresponding storage space of the data in the latch area is, when the processor completes the execution of the consumer kernel by performing a read operation on the data latched in the LLC.
  • the latch area can also be used in the application scenario of inter-chip communication.
  • a cluster or processor core of the processor transmits data (the data may be data that the producer core needs to exchange to the consumer core) via the latch area to processors in other clusters for merge processing.
  • Processors in other clusters read data from the latch area for processing, thereby realizing inter-chip data transfer.
  • inter-cluster communication is performed using the latch area, please refer to the description below.
  • the present disclosure also includes a method for performing inter-cluster communication using a latch area of a cache memory, the method comprising:
  • the specified storage space of the off-chip memory is mapped to a given storage area of the high-speed cache ("cache") (its physical properties are the same as the locking area described above in conjunction with the accompanying drawings), so that the given The storage area is used as the cluster storage area for inter-cluster data communication.
  • the cache memory may include LLC
  • the off-chip memory may include DDR.
  • the specified storage space may be the storage space specified at 1402 in FIG. 14 .
  • the cluster storage area may be a given storage area in the cache memory at 1404 in FIG. 14 .
  • the specified storage space of the DDR can be specified through software configuration, and the specified storage space of the DDR can be mapped to a given space on the cache for inter-cluster (for example, the cluster 0 shown in Figure 14 Communicate with cluster 1).
  • the determined cluster storage area may be used to perform cluster operations.
  • using the cluster store to perform operations of the cluster may include using the cluster store for inter-cluster communication.
  • using the cluster storage area for inter-cluster communication may specifically include: using the cluster storage area to implement point-to-point communication between clusters.
  • the cluster storage area may be used to implement broadcast communication from one of the multiple clusters to other clusters.
  • the cluster storage area can be used to receive the write operation of the first cluster for writing data and respond to the read operation of the second cluster, and send the previous write of the first cluster to the second cluster data.
  • the cluster storage area may also be used to receive a lock indication that the write data associated with the above-mentioned write operation resides in the cluster storage area, such as the write lock shown in FIG. 14 ("write lock"), that is, the above-mentioned lock-related request with the Lock attribute. Then, based on the lock indication, the written data may reside in the cluster storage area, wherein the cluster storage area may be the latch area determined in the above embodiment. Through such a residency manner, the hit ratio of data to be read many times in the cache memory can be significantly improved.
  • write lock shown in FIG. 14
  • the producer kernel executing in one of the clusters can lock the data that needs to be exchanged to the consumer kernel in the LLC through the above-mentioned write lock for later use by the consumer kernel, such as the producer
  • the core transmits data via LLC to processors in other clusters for merge processing. Processors in other clusters can read data from the cluster storage area for processing, thereby realizing inter-slice transmission of data.
  • the cluster memory area can also be used to receive a read invalidation indication that the write data is not written back to the off-chip memory, such as a read invalidation (“read invalidation” issued by cluster 1 in FIG. 14 ). invalid").
  • the read invalid indication may be a latch-related request with an invalid attribute, and the generation method of the latch-related request may refer to the above description for details. In different latch modes, the latch-related requests can be different. Then, after sending the write data to cluster 1, the cluster storage area may invalidate the cache line associated with the write data based on the read invalidation indication.
  • the cluster (such as cluster 0) that writes data to the cluster storage area can send a synchronization command to another cluster (such as cluster 1) after the write operation is completed , such as hsem ("hardware semaphore") in Figure 14.
  • cluster 1 can send the above-mentioned read invalidation request for the cluster storage area to invalidate the cache line after reading the data written into the cluster storage area by cluster 0, thereby preventing the write-back of the aforementioned data .
  • the above-mentioned behaviors of writing data to and reading data from the cluster storage area can also be collectively referred to as lock-related operations triggered by lock-related requests, and the confirmation method of the lock-related requests See description above.
  • the latch-related request may be used to indicate a latch operation. Through the latch operation, the data will be latched in the cluster storage area for subsequent multiple uses. Further, the latch-related request can be used to indicate a release operation, and through the release operation, data can be unlocked from the cluster storage area to release more storage space for subsequent data latches. It can be understood that, through the release operation, the storage space of the cluster storage area can be used flexibly, thereby improving the usage efficiency of the cluster storage area in the present disclosure.
  • the data or a selected part of the data may be released from the specified area of the cluster storage area according to a latch-related request.
  • a predetermined proportion of data may be randomly selected from the data to form the aforementioned partial data to be latched in the latch area.
  • hash algorithm can be used to select a predetermined proportion of data from the data as the aforementioned partial data to be latched in the cluster storage area.
  • the electronic equipment or devices disclosed in this disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment.
  • Said vehicles include airplanes, ships and/or vehicles; said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph.
  • the electronic equipment or device disclosed herein can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical care. Further, the electronic device or device disclosed herein can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge, and terminal.
  • electronic devices or devices with high computing power according to the disclosed solutions can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present disclosure is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present disclosure, those skilled in the art may understand that certain steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily required for the realization of one or some solutions of the present disclosure. In addition, according to different schemes, the description of some embodiments in this disclosure also has different emphases. In view of this, those skilled in the art may understand the part that is not described in detail in a certain embodiment of the present disclosure, and may also refer to related descriptions of other embodiments.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit.
  • the aforementioned components or units may be located at the same location or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit, or each unit exists physically independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer readable memory. Based on this, when the solution of the present disclosure is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, and it can include several instructions to make a computer device (such as a personal computer, a server, or Network devices, etc.) execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device such as a personal computer, a server, or Network devices, etc.
  • the aforementioned memory may include but not limited to U disk, flash disk, read-only memory ("Read Only Memory”, abbreviated as ROM), random access memory (“Random Access Memory”, abbreviated as RAM), mobile hard disk, magnetic disk Or various media such as CDs that can store program codes.
  • ROM read-only memory
  • RAM random access memory
  • CDs compact discs
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits.
  • the physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP, and ASIC.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory ("Resistive Random Access Memory”, abbreviated as RRAM), dynamic random access memory (“Dynamic Random Access Memory”, abbreviated as DRAM), static random access memory (“Static Random Access Memory”, abbreviated as SRAM), enhanced dynamic random access memory (“Enhanced Dynamic Random Access Memory”, abbreviated as "EDRAM”), high bandwidth memory (“High Bandwidth Memory”, abbreviated as "HBM”), hybrid memory cube ("Hybrid Memory Cube”, abbreviated as "HMC”), ROM and RAM, etc.
  • RRAM variable resistance memory
  • DRAM dynamic random access memory
  • SRAM static random access memory
  • EDRAM enhanced dynamic random access memory
  • HBM High Bandwidth Memory
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.
  • a method for a cache memory comprising:
  • Clause A2 The method of Clause A1, wherein the plurality of latch modes are performed in a predetermined order of priority.
  • Clause A3 The method according to Clause A1 or 2, wherein the plurality of latch modes include an instruction mode for performing latch-related operations based on hardware instructions, a window mode for performing latch-related operations based on window attributes, a data-based Streaming mode for performing latch-related operations on streams and/or page mode for performing latch-related operations based on cache pages.
  • the plurality of latch modes include an instruction mode for performing latch-related operations based on hardware instructions, a window mode for performing latch-related operations based on window attributes, a data-based Streaming mode for performing latch-related operations on streams and/or page mode for performing latch-related operations based on cache pages.
  • the latch-related request is determined according to the hardware instruction
  • the latch-related request is determined according to a cache page configuration
  • said latch-related requests are determined according to a lock window.
  • Clause A5. The method of Clause A4, wherein in the command mode, the window mode, or the stream mode, the latch-related request can be accompanied by a lock attribute, the lock attribute is used to indicate that specific data is stored in Retained in the latch area, the specific data is part of the data selected according to the hash algorithm.
  • Clause A6 The method of clause A3 or A4, wherein in page mode, the method comprises:
  • the cache page-based latch operation is performed according to the linear mapping window of the system memory management unit.
  • Clause A7 The method of Clause A3, wherein configuring the latch region to support the plurality of latch modes comprises:
  • Clause A8 The method of Clause A7, wherein for a write operation to a latched area, the method includes latching the data or a selected portion of the data in the latch according to the latch-related request. In the specified area of the storage area, it can be used for subsequent multiple reads.
  • Clause A9 The method according to Clause A7, wherein for the read operation of the latch area, the method comprises, after performing the read operation, storing the data or the selected portion according to the latch-related request The data is released from a designated area of the latch area.
  • a cache memory comprising:
  • a configuration module configured to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes
  • a latch execution module for:
  • a latch-related operation is performed on the data in the latch region in the corresponding latch mode.
  • a system on a chip comprising:
  • a processor configured to generate the latch-related request
  • the latch execution module of the cache memory is configured to perform latch-related operations on the data in the latch area in the corresponding latch mode according to the latch-related request.
  • Clause A12 The system-on-chip of Clause A11, wherein the latch mode comprises an instruction mode, and in the instruction mode, the processor is configured to generate the latch-related request in accordance with a received hardware instruction.
  • Clause A13 The system-on-chip of Clause A11, wherein the latch mode comprises a page mode, and in the page mode, the processor is configured to generate the latch-related request according to a cache page configuration.
  • the configurator is used to generate the configuration instruction according to the assigned configuration task, so as to send it to the configuration module of the cache memory;
  • the scheduling unit is used to schedule multiple tasks in the task scheduler so as to send them to the processor core.
  • Clause A15 The system-on-chip of Clause A14, wherein the configuration instructions include configuration items for enabling latch regions, disabling latch regions, and/or latch region sizes.
  • Clause A16 The system-on-chip of Clause A15, wherein the processor further comprises a system memory management unit for, in windowed mode or streaming mode:
  • Clause A17 The system-on-chip of Clause A16, wherein the configuration items of the lock window include one or more of the following:
  • the base address of the window corresponds to the start address of the data to be performed latch-related operations and the size of the window corresponds to the size of the data
  • the latch ratio which indicates the ratio of the data to be actually latched among the data to be subjected to latch-related operations.
  • Clause A18 The system-on-chip according to clause A17, wherein the processor is further configured to use a hash algorithm to select an access address of the data to be latch-related operations within the address range of the lock window Part of the data that can be locked in the latch area.
  • Clause A19 The system-on-chip according to Clause A17, wherein the processor is configured to randomly select the portion of data satisfying the predetermined latch ratio from the data to be latched according to a hash algorithm, and generate a Latching the associated request for latching in the latching area.
  • Clause A20 The system-on-chip of clause A14, wherein the processor is configured to perform a write operation on the data in the latch area, and the latch execution module is configured to write The data or a selected portion of the data is latched in a specified area of the latch area, and wherein the processor is further configured to perform a read operation on the data in the latch area, and the The latch execution module is configured to release the data after the read operation is performed from the designated area of the latch area according to the latch-related request.
  • Clause A21 The system-on-chip of any one of Clauses A16-A20, wherein the tasks include producer cores and consumer cores, wherein:
  • the processor When executing the producer core, the processor is used to latch the data output by the producer core in the latch area through the latch-related request for use by the consumer core;
  • the processor When executing the consumer kernel, the processor is configured to read data from the latch area, and after reading the data, unlock the data from the latch area through the latch-related request, In order to release the storage space used for the data in the latch area.
  • Clause A23 A computing device comprising the board of Clause A22.
  • Clause B1 A method for a system-on-chip comprising at least a plurality of clusters for performing computational operations and a cache memory interconnected with the plurality of clusters, each cluster comprising a plurality of clusters for performing said computational operations A plurality of processor cores, the method comprising:
  • Operations of the cluster are performed using the cluster storage area.
  • Clause B2 The method of Clause B1, wherein using the cluster storage area to perform operations of the cluster comprises using the cluster storage area for inter-cluster communication.
  • the cluster storage area is used to implement broadcast communication from one of the multiple clusters to other clusters.
  • Clause B4 The method of Clause B3, wherein utilizing the cluster storage area to enable peer-to-peer communication between clusters comprises:
  • the write data is sent to the second cluster in response to a read operation by the second cluster.
  • Clause B5. The method of Clause B4, wherein in the write operation, the method further comprises:
  • the write data is resident in the cluster storage area based on the lock indication.
  • Clause B6 The method of clause B4 or B5, wherein in the read operation, the method further comprises:
  • a cache line associated with the write data is invalidated based on the read invalidation indication after the write data is sent to the second cluster.
  • a system on a chip comprising:
  • each cluster includes at least a plurality of processor cores for performing computational operations
  • a cache memory interconnected with the plurality of clusters and configured to:
  • the latch area as a cluster storage area for inter-cluster data communication, wherein the latch area forms a mapping relationship with a designated storage space of the off-chip memory;
  • Operations of the cluster are performed using the cluster storage area.
  • Clause B8 The system-on-chip of Clause B7, wherein the cluster memory area is configured for inter-cluster communication.
  • Clause B9 The system-on-chip of Clause B8, wherein the cluster storage area is configured for point-to-point communication between clusters or broadcast communication from one of the plurality of clusters to the remaining clusters.
  • Clause B10 The system-on-chip of Clause B9, wherein in the peer-to-peer communication, the cluster storage area is configured to:
  • the write data is sent to the second cluster in response to a read operation by the second cluster.
  • Clause B11 The system-on-chip of Clause B10, wherein the second cluster is configured to:
  • the read operation is performed on the cluster memory area.
  • Clause B12 The system-on-a-chip of clause B10, wherein in the write operation, the first cluster is configured to send a lock to the cluster storage area to reside the write data in the cluster storage area indication, so that the cluster store resides the write data based on the lock indication.
  • Clause B13 The system-on-chip of clause B12, wherein in the read operation, the second cluster is configured to send a read invalidation indication to the cluster storage area that causes the write data not to be written back to the off-chip memory , such that the cluster store invalidates the cache line associated with the write data based on the read invalidation indication.
  • Clause B14 A computing device comprising the system-on-chip according to any one of clauses B7-B13.
  • Clause B15 A board comprising the computing device according to Clause B14.
  • Clause B16 A computing device comprising the board according to Clause B15.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

L'invention concerne un procédé pour une mémoire cache, une mémoire cache, un système sur puce, une carte et un dispositif informatique. Le dispositif informatique est caractérisé par un moyen de traitement informatique compris dans un moyen de traitement combiné (20) ; le moyen de traitement combiné (20) peut également comprendre une interface d'interconnexion universelle et d'autres moyens de traitement. Le moyen de traitement informatique interagit avec d'autres moyens de traitement pour réaliser conjointement une opération informatique spécifiée par un utilisateur. Le moyen de traitement combiné (20) peut également comprendre un moyen de stockage, et le moyen de stockage est connecté séparément au moyen de traitement informatique et à d'autres moyens de traitement pour stocker des données du moyen de traitement informatique et d'autres moyens de traitement. Le moyen de traitement combiné (20) peut améliorer l'efficacité d'utilisation de la mémoire cache.
PCT/CN2022/110740 2021-08-12 2022-08-08 Procédé de mémoire cache et produits associés WO2023016383A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110926703.7 2021-08-12
CN202110926707.5 2021-08-12
CN202110926707.5A CN115705300A (zh) 2021-08-12 2021-08-12 用于高速缓冲存储器的方法及其相关产品
CN202110926703.7A CN115878553A (zh) 2021-08-12 2021-08-12 用于片上***的方法及其相关产品

Publications (1)

Publication Number Publication Date
WO2023016383A1 true WO2023016383A1 (fr) 2023-02-16

Family

ID=85200562

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/110740 WO2023016383A1 (fr) 2021-08-12 2022-08-08 Procédé de mémoire cache et produits associés

Country Status (1)

Country Link
WO (1) WO2023016383A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030018855A1 (en) * 2001-07-16 2003-01-23 Mcwilliams Thomas M. Method and apparatus for caching with variable size locking regions
CN102750227A (zh) * 2011-04-19 2012-10-24 飞思卡尔半导体公司 具有动态锁步支持的高速缓存存储器
CN106547619A (zh) * 2016-10-20 2017-03-29 深圳市云海麒麟计算机***有限公司 多用户存储管理方法和***
CN110634517A (zh) * 2018-06-25 2019-12-31 成都康元多商贸有限公司 一种高性能静态随机存取存储器

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030018855A1 (en) * 2001-07-16 2003-01-23 Mcwilliams Thomas M. Method and apparatus for caching with variable size locking regions
CN102750227A (zh) * 2011-04-19 2012-10-24 飞思卡尔半导体公司 具有动态锁步支持的高速缓存存储器
CN106547619A (zh) * 2016-10-20 2017-03-29 深圳市云海麒麟计算机***有限公司 多用户存储管理方法和***
CN110634517A (zh) * 2018-06-25 2019-12-31 成都康元多商贸有限公司 一种高性能静态随机存取存储器

Similar Documents

Publication Publication Date Title
US10389839B2 (en) Method and apparatus for generating data prefetches specifying various sizes to prefetch data from a remote computing node
JP6431536B2 (ja) 最終レベルキャッシュシステム及び対応する方法
EP1658564B1 (fr) Procedes et appareil permettant de disposer d'une antememoire mise en oeuvre de maniere logicielle
JP4322259B2 (ja) マルチプロセッサシステムにおけるローカルメモリへのデータアクセスを同期化する方法および装置
US11341059B2 (en) Using multiple memory elements in an input-output memory management unit for performing virtual address to physical address translations
US20230367722A1 (en) Data processing device and method, and related products
JP2012252490A (ja) マルチプロセッサおよびそれを用いた画像処理システム
US11868306B2 (en) Processing-in-memory concurrent processing system and method
CN114328295A (zh) 存储管理装置、处理器、相关装置和相关方法
CN112527729A (zh) 一种紧耦合异构多核处理器架构及其处理方法
US20080244221A1 (en) Exposing system topology to the execution environment
US20210224213A1 (en) Techniques for near data acceleration for a multi-core architecture
TWI785320B (zh) 裝置內標記資料移動系統、資訊處置系統及用於提供裝置內標記資料移動之方法
WO2023016383A1 (fr) Procédé de mémoire cache et produits associés
Abdallah Heterogeneous Computing: An Emerging Paradigm of Embedded Systems Design
US10884477B2 (en) Coordinating accesses of shared resources by clients in a computing device
US20220300421A1 (en) Memory Sharing
CN115705300A (zh) 用于高速缓冲存储器的方法及其相关产品
CN115878553A (zh) 用于片上***的方法及其相关产品
WO2023016382A1 (fr) Procédé pour un système sur une puce et son produit associé
TW202111545A (zh) 統一位址轉譯
TWI831564B (zh) 可配置的記憶體系統及其記憶體管理方法
WO2024045580A1 (fr) Procédé de planification de tâches, et produit associé afférent
CN118113631B (zh) 一种数据处理***、方法、设备、介质及计算机程序产品
Chiu et al. Design and Implementation of the Link-List DMA Controller for High Bandwidth Data Streaming

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22855364

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE