CN109684237B - Data access method and device based on multi-core processor - Google Patents

Data access method and device based on multi-core processor Download PDF

Info

Publication number
CN109684237B
CN109684237B CN201811385741.0A CN201811385741A CN109684237B CN 109684237 B CN109684237 B CN 109684237B CN 201811385741 A CN201811385741 A CN 201811385741A CN 109684237 B CN109684237 B CN 109684237B
Authority
CN
China
Prior art keywords
data
cache
core
cores
read
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811385741.0A
Other languages
Chinese (zh)
Other versions
CN109684237A (en
Inventor
宋昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201811385741.0A priority Critical patent/CN109684237B/en
Publication of CN109684237A publication Critical patent/CN109684237A/en
Application granted granted Critical
Publication of CN109684237B publication Critical patent/CN109684237B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0893Caches characterised by their organisation or structure
    • G06F12/0897Caches characterised by their organisation or structure with two or more cache hierarchy levels
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The embodiment of the application provides a data access method and device based on a multi-core processor. In the embodiment of the application, based on a form of combining two cache coherence protocols (a directory-based cache coherence protocol and an interception-based cache coherence protocol), the directory-based cache coherence protocol is adopted to maintain the consistency of data, and if data is not recorded in a directory, the interception-based cache coherence protocol is adopted to maintain the consistency of the data, so that the number of times of sending interception through a bus is reduced.

Description

Data access method and device based on multi-core processor
Technical Field
The present application relates to the field of information technology, and more particularly, to a method and apparatus for data access based on a multi-core processor.
Background
A multi-core processor refers to a processor that integrates two or more complete computing engines (also called "cores" or "cores") into one processor, each core has a respective second-level buffer to store data or instructions, and the second-level buffer of each core belongs to a private cache of the core, that is, the cache space of the second-level buffer is only provided for a specific core. However, the same data may exist in the data used by different cores, that is, copies of the same data may be stored in the secondary buffers of different cores, respectively, and such data is referred to as shared data.
The presence of shared data introduces cache coherency issues. Currently, there are two main protocols for ensuring cache coherency: snoop-based cache coherency protocol (or snooping protocol) and directory structure-based cache coherency protocol (or directory protocol). The realization of the interception protocol depends on a network connection in a bus or bus-like form, all memory access requests sent by a certain core can be broadcasted to other cores of the multi-core processor based on the network connection, and the other cores can modify data in the private cache based on the received memory access requests so as to realize cache consistency. However, since all the access requests are transmitted through the bus and the bus bandwidth resources are limited, the above-mentioned manner of implementing cache coherency based on the snooping protocol affects the scalability of the entire system. In the directory protocol, the distribution condition of shared data stored in the current multi-core processor is recorded in the directory structure, and a memory access request sent by a certain core is sent to a target core with corresponding data carried by the memory access request based on the directory structure. However, in the implementation of the protocol, the sharing condition of all data in the multi-core processor and the cores corresponding to the copies of the shared data need to be recorded in the directory, so that the storage space occupied by the storage directory is large.
It can be seen from the above that, the two cache coherence protocols have respective defects, and in order to take account of two reasons, namely the size of the storage space occupied by the storage directory and the bus bandwidth resource occupied by sending the memory access request, the present application provides a new solution for cache coherence.
Disclosure of Invention
The application provides a method and a device for reading data from a multi-core processor, which are beneficial to considering both the size of a storage space occupied by a storage directory and a bus bandwidth resource occupied by sending a memory access request.
In a first aspect, a method for reading data from a multi-core processor is provided, where the multi-core processor includes a plurality of cores and a third-level cache, a second-level buffer of each core includes a first buffer and a second buffer, and a modification frequency of data stored in the first buffer is higher than a modification frequency of data stored in the second buffer, the method includes: a first kernel in the multiple kernels sends a read request to a second-level buffer of the first kernel, wherein the read request requests to read first read data; if the first read data is not stored in the second-level buffer and the third-level buffer of the first kernel, the first kernel determines a target kernel from the multiple kernels by querying a directory, and forwards the read request to the target kernel, wherein the directory records that the first read data is recorded in a cache (e.g., a first cache region) of the target kernel; and if the first read data is not recorded in the directory, the first kernel sends the read request to other kernels in the multiple kernels.
In the embodiment of the application, by combining the two protocols, a cache coherence protocol based on a directory is firstly adopted to maintain the consistency of data, and if the data is not recorded in the directory, the cache coherence protocol based on snooping is adopted to maintain the consistency of the data, so that the number of times of snooping sent through a bus is reduced.
Optionally, the directory is configured to record data stored in the first cache region of each of the plurality of cores, and specifically, the directory is configured to record an index of the data stored in the first cache region of each of the plurality of cores.
It should be noted that, the directory may also store part of data in the second cache regions of multiple cores, and is not limited to the data stored in the first cache region of each core.
In the embodiment of the application, based on the characteristic that the frequency of modifying the data stored in the second cache region is less than the frequency of modifying the data stored in the first storage region, a directory-based cache coherence protocol is used for maintaining the consistency of the data for the data stored in the first cache region, and a snoop-based cache coherence protocol is used for maintaining the consistency of the data for the data not recorded in the directory. The combination of the two protocols reduces the number of snoops sent over the bus relative to using only a snoop-based cache coherency protocol. The storage space occupied by the storage directory is reduced relative to using only a directory-based cache coherency protocol.
In a possible implementation manner, if the first read data is not recorded in the directory, the sending, by the first core, the read request to other cores in the multiple cores includes: the first core sends the read request to the other cores to read the first read data from the second cache regions of the other cores and load the first read data to the second cache regions of the first core.
In the embodiment of the present application, based on that the data stored in the second cache region is generally data that does not need to be modified frequently, the first read data read from the second cache regions of other cores is stored in the second cache region of the first core, so as to save the storage space of the first cache region, and at the same time, be beneficial to reducing the write operation performed in the second cache region, and be beneficial to prolonging the service life of the second cache region.
In one possible implementation, the first buffer region is located in a Static Random Access Memory (SRAM) and the second buffer region is located in a non-volatile memory (NVM), wherein the NVM includes but is not limited to a spin transfer torque random access memory (SPT-RAM)
STT-RAM or magnetic random access memory MRAM.
In a possible implementation manner, if the first read data is not recorded in the directory, the sending, by the first core, the read request to other cores in the multiple cores includes: the first core sends the read request to the other cores to read the first read data from the first cache regions of the other cores and load the first read data into the first cache regions of the first core.
In the embodiment of the application, based on that the data stored in the first cache region is generally data that needs to be modified frequently, the first read data read from the first cache region of the other core is stored into the first cache region of the first core, so as to reduce the write operation performed in the second cache region, which is beneficial to prolonging the service life of the second cache region.
In a possible implementation manner, the first cache region is used for storing dirty data, and the second cache region is used for storing read-only data.
In a second aspect, a method for writing data into a multi-core processor is provided, where the multi-core processor includes a plurality of cores and a third-level cache, a second-level cache of each core includes a first cache region and a second cache region, and a modification frequency of data stored in the first cache region is higher than a modification frequency of data stored in the second cache region, the method includes: a first kernel in the plurality of kernels sends a write request to a second-level buffer of the first kernel, wherein the write request requests to write target write data; if the target data is not stored in the second-level buffer and the third-level buffer of the first core, the first core determines a target core from the plurality of cores by querying a directory and sends an invalid signal to the target core, wherein the directory records that the target data is stored in the second-level buffer of the target core, and the invalid signal is used for indicating that the target data is marked as invalid data; if the target data is not recorded in the directory, the first kernel sends the invalid signal to other kernels in the multiple kernels.
The target data is data that needs to be marked as an invalid state in the process of writing the target write data, and it can be understood that the target data is data to be modified by the target write data, or the target data is data to be replaced by the target write data.
In the embodiment of the application, by combining the two protocols, a cache coherence protocol based on a directory is firstly adopted to maintain the consistency of data, and if the data is not recorded in the directory, the cache coherence protocol based on snooping is adopted to maintain the consistency of the data, so that the number of times of snooping sent through a bus is reduced.
Optionally, the directory is configured to record data stored in the first cache region of each of the plurality of cores, and specifically, the directory is configured to record an index of the data stored in the first cache region of each of the plurality of cores.
It should be noted that, the directory may also store part of data in the second cache regions of multiple cores, and is not limited to the data stored in the first cache region of each core.
In the embodiment of the application, based on the characteristic that the frequency of modifying the data stored in the second cache region is less than the frequency of modifying the data stored in the first storage region, a directory-based cache coherence protocol is used for maintaining the consistency of the data for the data stored in the first cache region, and a snoop-based cache coherence protocol is used for maintaining the consistency of the data for the data not recorded in the directory. The combination of the two protocols reduces the number of snoops sent over the bus relative to using only a snoop-based cache coherency protocol. The storage space occupied by the storage directory is reduced relative to using only a directory-based cache coherency protocol.
In one possible implementation, the first buffer region is located in a Static Random Access Memory (SRAM) and the second buffer region is located in a non-volatile memory (NVM), wherein the NVM includes, but is not limited to, a spin transfer torque random access memory (STT-RAM) or a Magnetic Random Access Memory (MRAM).
In a possible implementation manner, the first cache region is used for storing dirty data, and the second cache region is used for storing read-only data.
In a third aspect, a device for reading data from a multi-core processor is provided, where the multi-core processor includes multiple cores and a third-level cache, a second-level cache of each core includes a first cache region and a second cache region, a modification frequency of data stored in the first cache region is higher than a modification frequency of data stored in the second cache region, the device is configured with a first core of the multiple cores, and the device includes modules for executing any one of the possible implementations of the first aspect.
In a fourth aspect, an apparatus for writing data into a multi-core processor is provided, where the multi-core processor includes a plurality of cores and a third-level cache, a second-level cache of each core includes a first cache region and a second cache region, a modification frequency of data stored in the first cache region is higher than a modification frequency of data stored in the second cache region, the apparatus is disposed in a first core of the plurality of cores, and the apparatus includes modules for executing any one of possible implementations of the second aspect.
In a fifth aspect, a multi-core processor is characterized in that the multi-core processor includes a plurality of cores, a third-level cache and a memory, a second-level cache of each core includes a first cache region and a second cache region, a modification frequency of data stored in the first cache region is higher than a modification frequency of data stored in the second cache region, the memory is used for storing a computer program, and a first core of the plurality of cores is used for calling and running the computer program from the memory, so as to execute the method in the above aspects.
In a sixth aspect, there is provided a computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the method of the above-mentioned aspects.
It should be noted that all or part of the computer program code may be stored in the first storage medium, where the first storage medium may be packaged together with the multicore processor or may be packaged separately from the multicore processor, and this is not specifically limited in this embodiment of the present application.
In a seventh aspect, a computer-readable medium is provided, which stores program code, which, when run on a computer, causes the computer to perform the method of the above-mentioned aspects.
In an eighth aspect, a chip system is provided, the chip system comprising a multi-core processor for performing the functions referred to in the above aspects, such as generating, receiving, sending, or processing data and/or information referred to in the above methods. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the multicore processor. The chip system may be formed by a chip, or may include a chip and other discrete devices.
Drawings
Fig. 1 is a schematic diagram of an architecture of a multicore processor of an embodiment of the present application.
FIG. 2 illustrates a level two buffer architecture.
Fig. 3 is a flowchart of a method for reading and writing data based on a multi-core processor according to an embodiment of the present application.
FIG. 4 is a schematic flow chart diagram of a method for reading data from a multicore processor according to an embodiment of the present application.
FIG. 5 is a schematic flow chart of a method for writing data into a multi-core processor according to an embodiment of the present application.
FIG. 6 is a schematic diagram of an apparatus for reading data from a multicore processor according to an embodiment of the present application.
Fig. 7 is a schematic diagram of an apparatus for writing data into a multicore processor according to an embodiment of the present application.
Detailed Description
The technical solution in the present application will be described below with reference to the accompanying drawings.
For ease of understanding, the architecture of the multi-core processor of the embodiment of the present application will be described with reference to fig. 1. The multi-core processor 100 shown in fig. 1 includes a plurality of cores 110, a plurality of second-level caches 120, a plurality of hybrid cache management modules 130, a plurality of coherency management modules 140, a third-level cache 150, and a memory 160.
A core 110 for executing instructions.
And a second level cache 120(L2cache) for providing a cache space for a core corresponding to the second level cache. Each level two cache belongs to a private cache for a core, that is, the storage resource provided by the level two cache can only be used by the core corresponding to the level two cache.
FIG. 2 illustrates a level two buffer architecture. The level two buffer 120 shown in fig. 2 may contain a plurality of cache sets (sets) 210, each of which includes a plurality of cache lines (cache lines). The cache set may be understood as a unit of address mapping between a cache address and a memory address, and the cache line may be understood as a minimum unit of data exchange between a cache and a memory. In order to meet the various demands of the market on the level two buffer, a level two buffer consisting of memories of a plurality of different storage media may be used, and therefore, the level two buffer is also referred to as a hybrid level two buffer.
For example, a secondary buffer composed of a non-volatile Memory (NVM) and a Static Random-Access Memory (SRAM) is a typical mixed secondary buffer. In a second level buffer composed of a non-volatile Memory (NVM) and a Static Random-Access Memory (SRAM), each cache set includes a SRAM cache line 220 and an NVM-based cache line 230.
And the hybrid cache management module 130 is configured to manage a storage space in the second-level cache. Data having different modification frequencies are stored based on storage characteristics of different storage media.
In a secondary buffer formed by an NVM and an SRAM, the NVM includes a variable Resistive-Access Memory (ReRAM), a Spin-Transfer Torque magnetic Memory (STT-RAM), a Phase-Change Memory (PCM), and Intel 3D Xpoint, which has higher storage density, lower static power consumption, and non-volatility compared to a conventional DRAM/SRAM. However, because of the inherent characteristics of NVM, it also suffers from high write latency, limited write lifetime, and high power consumption for write operations. SRAM has the disadvantages of large power consumption, low density, and high cost, however, the write lifetime of SRAM is much longer than that of NVM.
Based on the above-described unbalanced read/write power consumption and limited write lifetime characteristics of the NVM, the number of write operations performed on the NVM can be reduced, and correspondingly, the number of write operations performed on the SRAM can be increased. The specific implementation method is described in detail below with reference to fig. 2.
And a consistency management module 140, configured to maintain consistency of the shared data stored in the multiple secondary buffers. For example, the consistency of the data stored in the plurality of level two buffers may be maintained based on a snooping protocol, a directory protocol, or a method of maintaining consistency provided herein.
The following describes a method for reading and writing data based on a multi-core processor, with reference to fig. 3, based on the architecture of the multi-core processor shown in fig. 1. Fig. 3 is a flowchart of a method for reading and writing data based on a multi-core processor according to an embodiment of the present application. The method shown in fig. 3 may be performed by any one of the cores in the multi-core processor, or by a hybrid cache management module in the core. The method shown in fig. 3 includes step 301 to step 3. The method flow of writing data is described in steps 301 to 303, and the method flow of reading data is described in steps 321 to 327.
301, the first core sends a write request to the second-level cache of the first core, where the write request is used to request to write target write data into the second-level cache of the first core. If the target write data in the write request is written into the second-level buffer of the first kernel for the first time, executing step 302; if the write request is a request to modify the target data in the first kernel into the target write data, step 303 is executed.
It should be noted that if the state of the cache line to which the target write data is to be written is an Exclusive (E) state, and it is not necessary to send snoops, if the state of the cache line to which the target write data is to be written is a shared (S) state, it may be understood that the write request is a request to modify the target data in the cache line into the target write request, and then step 303 is executed.
In the exclusive state, it may be understood that the data in the cache line in the exclusive state has no copy data, that is, only the data is stored in the current cache line, and the data in the cache line is consistent with the data stored in the memory. In the shared state, it may be understood that the data stored in the cache line in the exclusive state is stored in the second-level buffer and/or the third-level buffer corresponding to other cores of the plurality of cores, and the data in the cache line is consistent with the data stored in the memory.
302, the first core stores the target write data into an SRAM-based cache line.
Due to the "locality principle" of the program, it can be considered that the probability of the newly written data being modified is large, and therefore the data can be stored to the cache line of the SRAM to reduce the possibility of modification to the cache line of the NVM.
303, the first core stores the target write data into the SRAM-based cache line, and modifies the mapping relationship between the cache address of the target data and the memory address of the target data into the mapping relationship between the cache address of the target write data in the SRAM-based cache line and the memory address of the target data, so as to modify the target data into the target write data.
It should be noted that, if the multi-core processor adopts a directory-based coherency cache protocol, the cores corresponding to the other second-level buffers storing copies of the target data may be determined by querying the directory, so that the first core may directly send an invalidation signal to the cores corresponding to the other second-level buffers storing copies of the target data in a "point-to-point" manner, so as to mark the cache line storing the copy of the target data in an invalid state.
If the multi-core processor adopts an interception-based consistency cache protocol, the first core needs to send an invalid signal to other cores in a broadcast manner through the bus because the first core cannot know the storage condition of the copy of the target data, and the other cores know that the data needing to be marked in an invalid state is the target data by monitoring the invalid signal sent through the bus, and mark the cache line in which the target data is stored in an invalid state under the condition that the target data is stored.
321, the first core sends a read request to the second level cache of the first core, where the read request is used to request to read the first read data. If the first core's level two cache hits, go to step 322; if the first core's level two cache misses, then step 323 is performed.
Specifically, the second-level cache hit of the first core may be understood as the first read data being stored in the second-level cache of the first core.
The second level cache of the first core sends 322 the first read data to the first core.
323, the first core reads the first read data from the third level cache. If the third level register is hit, go to step 324; if the third level register is not hit, step 325 is performed.
The third-level cache is a shared cache in the multi-core processor, and each core in the multi-core processor can access the third-level cache.
The third level buffer transmits the first read data to the first core and stores the first read data in the NVM cache line of the first core 324.
325, the first kernel checks whether the second-level caches of other kernels of the multi-core processor store first read data, and if the second-level caches store the first read data, the step 326 is executed; if the first read data is not stored, go to step 327.
And 326, reading the first read data from the second-level caches of the other cores, and storing the first read data in the second-level cache of the first core.
Based on the write flow of the data, the data stored in the NVM is usually read-only data, and the data in the SRAM is usually dirty data, that is, modified data, so as to further reduce the write operation performed on the cache line of the NVM and increase the service life of the NVM, if the first read data is in the SRAM-based cache line in the secondary cache of other cores, it may be considered that the first read data belongs to dirty data, and it is likely to continue to be modified, the data may be directly stored to the SRAM-based cache line in the secondary cache of the first core; if the first read data is in the NVM-based cache line in the second-level buffers of the other cores, the first read data can be considered as read-only data, the possibility of subsequent modification is low, and the data can be directly stored in the NVM-based cache line in the second-level buffers of the first core.
327, the first core reads the first read data from the memory.
Based on the above method flow of writing and reading data, it can be found that the modification frequency of the data stored in the SRAM-based cache line in the final secondary buffer is higher than the modification frequency of the data in the NVM-based cache line, or that the data stored in the SRAM-based cache line in the secondary buffer is usually "dirty data", and the data stored in the NVM-based cache line is usually "read-only data".
Therefore, based on the different characteristics of the data stored in the different cache spaces, the application provides a new method for maintaining cache consistency, namely, based on the characteristic that the number of times that the read-only data needs to send the access request through the bus in the process of reading and writing the data is less than the number of times that the dirty data needs to send the access request through the bus in the process of reading and writing the data, a directory-based cache consistency protocol is adopted for the data in the cache region for storing the dirty data, and a snoop-based cache consistency protocol is adopted for the data in the cache region for storing the read-only data. Meanwhile, the index of only dirty data is stored in the directory, and the index of all data in the multi-core processor is not used any more, so that the storage space occupied by the storage directory is greatly reduced. This combination of the two protocols takes into account the size of the memory space occupied by the memory directory and the bus resources occupied by the memory request (e.g., an invalidate request or a forwarded read request) as compared to the two cache coherency protocols of the prior art. The following describes in detail a scheme for maintaining cache coherency during reading and writing data according to an embodiment of the present application with reference to fig. 4 and 5, respectively.
FIG. 4 is a schematic flow chart diagram of a method for reading data from a multicore processor according to an embodiment of the present application. The second-level buffer of each core of the multi-core processor comprises a first buffer area and a second buffer area, and the modification frequency of data stored in the first buffer area is higher than that of data stored in the second buffer area. The method shown in fig. 4 may be executed by any core in the multi-core processor shown in fig. 1, and specifically, may be executed by a consistency management module in the core, which is not limited in this embodiment of the present application. The method shown in fig. 4 includes steps 410 to 430.
A first core of the plurality of cores sends a read request to a level two cache of the first core, the read request requesting to read first read data 410.
And 420, if the first read data is not stored in the second-level buffer and the third-level buffer of the first kernel, determining a target kernel from the multiple kernels by the first kernel by querying a directory, and forwarding the read request to the target kernel, wherein the directory records that the first read data is stored in the cache of the target kernel.
Optionally, the directory is configured to record data stored in the first cache region of each of the plurality of cores, and specifically, the directory is configured to record an index of the data stored in the first cache region of each of the plurality of cores.
It should be noted that, the directory may also store part of data in the second cache regions of multiple cores, and is not limited to the data stored in the first cache region of each core.
The frequency of modification of the data stored in the first buffer area is higher relative to the frequency of modification of the data stored in the second buffer area, and may include, for example, the SRAM-based buffer area described above. The second buffer may be the above-described buffer storing read-only data, such as an NVM-based buffer.
The directory is used for recording an index of data stored in the second cache region of each core of the plurality of cores, and it can be understood that all data in the multi-core processor are recorded in the directory, and the distribution condition of the copies of the data in the second cache region of the secondary cache is the same as the role of the directory in the cache coherence protocol based on the directory.
The first core determines a target core by querying a directory, and forwards the read request to the target core, which may be understood as that if an index of first read data is recorded in the directory, the target core storing the first read data may be determined according to the index of the first read data, and the read request is forwarded to the target core, so as to read the first read data from a first cache region (i.e., based on an SRAM cache region) in a second-level cache of the target core.
430, if the first read data is not recorded in the directory, the first core sends the read request to other cores of the plurality of cores.
The first core may send a read request to the other cores through the bus by using a cache coherence protocol based on snooping. Correspondingly, the other cores determine the data read by the read request by monitoring the read request sent by the head office, and if the first read data is stored, the first read data can be sent to the first core.
The first read data read back by sending the read request to other cores may be data stored in the NVM-based buffer area in the secondary buffer of other cores or data stored in the SRAM-based buffer area in the secondary buffer of other cores, and an index of the data is not recorded in the directory.
It should be noted that the cache consistency scheme provided in the present application may be used in combination with the data storage policy shown in fig. 3, and may also be applied to any cache partitioned for storage according to the modification frequency of data.
FIG. 5 is a schematic flow chart of a method for writing data into a multi-core processor according to an embodiment of the present application. The second-level buffer of each core of the multi-core processor comprises a first buffer area and a second buffer area, and the modification frequency of data stored in the first buffer area is higher than that of data stored in the second buffer area. It should be understood that the method shown in fig. 5 may be executed by any core in the multi-core processor shown in fig. 1, and specifically, may be executed by a consistency management module in the core, which is not limited in this embodiment of the present application. The level two buffer of each core comprises a first buffer area and a second buffer area, the modification frequency of the data stored in the first buffer area is higher than that of the data stored in the second buffer area, and the method of fig. 5 comprises steps 510 to 530.
A first core of the plurality of cores sends a write request to a level two buffer of the first core, the write request requesting to write target write data 510.
And 520, if the target data is not stored in the second-level buffer and the third-level buffer of the first core, determining a target core from the plurality of cores by the first core by querying a directory, and sending an invalid signal to the target core, wherein the directory records that the target data is stored in the second-level buffer of the target core, and the invalid signal is used for indicating that the target data is marked as invalid data.
The target data is data that needs to be marked as an invalid state in the process of writing the target write data, and it can be understood that the target data is data to be modified by the target write data, or the target data is data to be replaced by the target write data.
The modification frequency of the data stored in the first cache region is higher than that of the data stored in the second cache region, and the first cache region may be the cache region for storing dirty data, for example, a cache region based on SRAM. The second buffer may be the above-described buffer storing read-only data, such as an NVM-based buffer.
Optionally, the directory is configured to record data stored in the first cache region of each of the plurality of cores, and specifically, the directory is configured to record an index of the data stored in the first cache region of each of the plurality of cores.
It should be noted that, the directory may also store part of data in the second cache regions of multiple cores, and is not limited to the data stored in the first cache region of each core.
The directory is used for recording an index of data stored in the first cache region of each core in the plurality of cores, and it can be understood that all data in the multi-core processor are recorded in the directory, and the distribution condition of the copies of the data in the first cache region of the secondary cache is the same as the role of the directory in the cache coherence protocol based on the directory.
The first kernel determines the target kernel by querying the directory and forwards the read request to the target kernel, which can be understood as that if the index of the target data is recorded in the directory, the target kernel storing the target data can be determined according to the index of the target data, and an invalid signal is sent to the target kernel to mark the target data as an invalid state.
530, if the target data is not recorded in the directory, the first core sends the invalidation signal to other cores in the plurality of cores.
The first core may send a read request to the other cores through the bus by using a cache coherence protocol based on snooping. Correspondingly, the other cores monitor the read request sent through the bus to determine the data read by the read request, and if the first read data is stored, the first read data can be sent to the first core.
The target data marked as invalid state by sending an invalid signal to other cores may be data stored in the NVM-based buffer area in the secondary buffer of other cores or data stored in the SRAM-based buffer area in the secondary buffer of other cores, and the index of the data is not recorded in the directory.
It should be noted that the cache consistency scheme provided in the present application may be used in combination with the data storage policy shown in fig. 3, and may also be applied to any cache partitioned for storage according to the modification frequency of data.
The method provided by the embodiment of the present application is described in detail above with reference to fig. 1 to 5, and the apparatus of the embodiment of the present application is described below with reference to fig. 6 and 7. It should be understood that the apparatus shown in fig. 6 and 7 can implement the steps of the above method, and the description is omitted here for brevity.
FIG. 6 is a schematic diagram of an apparatus for reading data from a multicore processor according to an embodiment of the present application. The multi-core processor comprises a plurality of cores and three-level caches, the second-level cache of each core comprises a first cache region and a second cache region, the modification frequency of data stored in the first cache region is higher than that of data stored in the second cache region, the device 600 is arranged in the first core of the cores, and the device comprises
The device comprises:
a sending module 610, configured to send a read request to a secondary buffer of the first core, where the read request requests to read first read data;
a processing module 620, configured to determine a target core from the multiple cores by querying a directory and forward the read request to the target core if the first read data is not stored in the second-level buffer and the third-level buffer of the first core, where the directory records that the first read data is stored in the cache of the target core;
the processing module 620 is further configured to send the read request to another core of the multiple cores if the first read data is not recorded in the directory.
Optionally, the directory is configured to record data stored in the first cache region of each of the plurality of cores, and specifically, the directory is configured to record an index of the data stored in the first cache region of each of the plurality of cores.
Optionally, as an embodiment, the processing module is specifically further configured to: and sending the read request to the other cores to read the first read data from the cache regions of the other cores and load the first read data to the second cache region of the first core.
Optionally, as an embodiment, the first buffer area is located in a static random access memory SRAM, and the second buffer area is located in any one of a nonvolatile memory NVM, a spin transfer torque random access memory STT-RAM, and a magnetic random access memory MRAM.
Optionally, as an embodiment, the first cache region is configured to store dirty data, and the second cache region is configured to store read-only data.
Fig. 7 is a schematic diagram of an apparatus for writing data into a multicore processor according to an embodiment of the present application. The multi-core processor comprises a plurality of cores and three-level caches, the second-level cache of each core comprises a first cache region and a second cache region, the modification frequency of data stored in the first cache region is higher than that of data stored in the second cache region, the device is arranged in the first core of the plurality of cores, and the device comprises: a sending module 710 and a processing module 720.
A sending module 710, configured to send a write request to the second-level buffer of the first core, where the write request requests to write target write data.
The processing module 720 is configured to determine a target core from the plurality of cores by querying a directory if target data is not stored in the second-level buffer and the third-level buffer of the first core, and send an invalid signal to the target core, where the directory records that the target data is stored in the second-level buffer (e.g., the first buffer) of the target core, and the invalid signal is used to indicate that the target data is marked as invalid data.
The processing module 720 is further configured to send the invalid signal to other cores of the plurality of cores if the target data is not recorded in the directory.
Optionally, the directory is configured to record data stored in the first cache region of each of the plurality of cores, and specifically, the directory is configured to record an index of the data stored in the first cache region of each of the plurality of cores.
Optionally, as an embodiment, the first buffer area is located in a static random access memory SRAM, and the second buffer area is located in a nonvolatile memory NVM, and the second buffer area is located in any one of the nonvolatile memory NVM, a spin transfer torque random access memory STT-RAM, and a magnetic random access memory MRAM.
Optionally, as an embodiment, the first cache region is configured to store dirty data, and the second cache region is configured to store read-only data.
In an alternative embodiment, the sending module may be an input/output interface of a first core in the multi-core processor shown in fig. 1, and the processing module may be the first core. The architecture of the multi-core processor in which the first core is located may be seen in fig. 1, and for brevity, no further description is provided here.
It should be understood that in the embodiments of the present application, the processor may be a Central Processing Unit (CPU), and the processor may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It should be understood that in the embodiment of the present application, "B corresponding to a" means that B is associated with a, from which B can be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.
It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be read by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Versatile Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (19)

1. A method for reading data from a multi-core processor is characterized in that the multi-core processor comprises a plurality of cores and three-level caches, a second-level cache of each core comprises a first cache region and a second cache region, the modification frequency of the data stored in the first cache region is higher than that of the data stored in the second cache region,
the method comprises the following steps:
a first kernel in the multiple kernels sends a read request to a second-level buffer of the first kernel, wherein the read request requests to read first read data;
if the first read data is not stored in the second-level buffer and the third-level buffer of the first kernel, the first kernel determines a target kernel from the multiple kernels by querying a directory, and forwards the read request to the target kernel, wherein the directory records that the first read data is stored in the first cache region of the target kernel;
if the first read data is not recorded in the directory, the first kernel sends the read request to other kernels in the multiple kernels.
2. The method of claim 1, wherein the directory is to record an index of data stored in the first cache region of each of the plurality of cores.
3. The method as claimed in claim 1 or 2, wherein if the first read data is not recorded in the directory, the sending, by the first core, the read request to other cores of the plurality of cores comprises:
the first core sends the read request to the other cores to read the first read data from the second cache regions of the other cores and load the first read data to the second cache regions of the first core.
4. The method of claim 1 or 2, wherein the first buffer region is located in a Static Random Access Memory (SRAM) and the second buffer region is located in a non-volatile memory (NVM), wherein the NVM comprises a spin transfer torque random access memory (STT-RAM) or a Magnetic Random Access Memory (MRAM).
5. The method of claim 1 or 2, wherein the first cache region is to store dirty data and the second cache region is to store read-only data.
6. A method for writing data into a multi-core processor is characterized in that the multi-core processor comprises a plurality of cores and a third-level cache, a second-level cache of each core comprises a first cache region and a second cache region, the modification frequency of the data stored in the first cache region is higher than that of the data stored in the second cache region,
the method comprises the following steps:
a first kernel in the plurality of kernels sends a write request to a second-level buffer of the first kernel, wherein the write request requests to write target write data;
if the target data is not stored in the second-level buffer and the third-level buffer of the first core, the first core determines a target core by querying a directory and sends an invalid signal to the target core, wherein the invalid signal is used for marking the target data in the target core as invalid data, and the directory is used for recording that the target data is stored in a first-level buffer area of the target core;
if the target data is not recorded in the directory, the first kernel sends the invalid signal to other kernels in the multiple kernels.
7. The method of claim 6, wherein the directory is to record an index of data stored in the first cache region of each of the plurality of cores.
8. The method of claim 6 or 7, wherein the first buffer region is located in a Static Random Access Memory (SRAM) and the second buffer region is located in a non-volatile memory (NVM), wherein the NVM comprises a spin transfer torque random access memory (STT-RAM) or a Magnetic Random Access Memory (MRAM).
9. The method of claim 6 or 7, wherein the first cache region is to store dirty data and the second cache region is to store read-only data.
10. An apparatus for reading data from a multi-core processor, the multi-core processor including a plurality of cores and a third level cache, a second level cache of each core including a first cache region and a second cache region, the first cache region storing data having a higher frequency of modification than the second cache region, the apparatus configured to communicate with a first core of the plurality of cores,
the device comprises:
a sending module, configured to send a read request to the second-level buffer of the first kernel, where the read request requests to read first read data;
if the first read data is not stored in the second-level buffer and the third-level buffer of the first kernel, a processing module, configured to determine a target kernel from the multiple kernels by querying a directory, and forward the read request to the target kernel, where the directory is used to record that the first read data is stored in the first cache region of the target kernel;
if the first read data is not recorded in the directory, the processing module is further configured to send the read request to other cores of the plurality of cores.
11. The apparatus of claim 10, wherein the directory is to record an index of data stored in the first cache region of each of the plurality of cores.
12. The apparatus according to claim 10 or 11, wherein the processing module is further specifically configured to:
and sending the read request to the other cores to read the first read data from the cache regions of the other cores and load the first read data to the second cache region of the first core.
13. The apparatus of claim 10 or 11, wherein the first buffer region is located in a Static Random Access Memory (SRAM) and the second buffer region is located in a non-volatile memory (NVM), wherein the NVM comprises a spin transfer torque random access memory (STT-RAM) or a Magnetic Random Access Memory (MRAM).
14. The apparatus of claim 10 or 11, wherein the first buffer is to store dirty data and the second buffer is to store read-only data.
15. An apparatus for writing data into a multi-core processor, the multi-core processor including a plurality of cores and a third-level cache, a second-level cache of each core including a first cache region and a second cache region, the first cache region storing data with a modification frequency higher than that of the second cache region, the apparatus being disposed in a first core of the plurality of cores,
the device comprises:
a sending module, configured to send a write request to a second-level buffer of the first kernel, where the write request requests to write target write data;
if the target data is not stored in the second-level buffer and the third-level buffer of the first core, the processing module is configured to determine a target core from a plurality of cores by querying a directory, and send an invalid signal to the target core, where the directory records that the target data is stored in the first cache region of the target core, and the invalid signal is used to indicate that the target data is marked as invalid data;
if the target data is not recorded in the directory, the processing module is further configured to send the invalid signal to other cores of the plurality of cores.
16. The apparatus of claim 15, wherein the directory is to record an index of data stored in the first cache region of each of the plurality of cores.
17. The apparatus of claim 15 or 16, wherein the first buffer region is located in a Static Random Access Memory (SRAM) and the second buffer region is located in a non-volatile memory (NVM), wherein the NVM comprises a spin transfer torque random access memory (STT-RAM) or a Magnetic Random Access Memory (MRAM).
18. The apparatus as claimed in claim 15 or 16, wherein said first buffer is for storing dirty data and said second buffer is for storing read-only data.
19. A multi-core processor, comprising a plurality of cores and a third-level cache, wherein the second-level cache of each core comprises a first cache region and a second cache region, the modification frequency of data stored in the first cache region is higher than that of data stored in the second cache region, and the first core of the plurality of cores is configured to call and run a computer program from a memory in which the computer program is stored, so as to execute the method according to any one of claims 1 to 9.
CN201811385741.0A 2018-11-20 2018-11-20 Data access method and device based on multi-core processor Active CN109684237B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811385741.0A CN109684237B (en) 2018-11-20 2018-11-20 Data access method and device based on multi-core processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811385741.0A CN109684237B (en) 2018-11-20 2018-11-20 Data access method and device based on multi-core processor

Publications (2)

Publication Number Publication Date
CN109684237A CN109684237A (en) 2019-04-26
CN109684237B true CN109684237B (en) 2021-06-01

Family

ID=66184828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811385741.0A Active CN109684237B (en) 2018-11-20 2018-11-20 Data access method and device based on multi-core processor

Country Status (1)

Country Link
CN (1) CN109684237B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11669454B2 (en) * 2019-05-07 2023-06-06 Intel Corporation Hybrid directory and snoopy-based coherency to reduce directory update overhead in two-level memory
WO2021155491A1 (en) * 2020-02-04 2021-08-12 Qualcomm Incorporated Data transfer with media transfer protocol (mtp) over universal serial bus (usb)
CN112885867B (en) * 2021-01-29 2021-11-09 长江先进存储产业创新中心有限责任公司 Manufacturing method of central processing unit, central processing unit and control method thereof
CN113703958B (en) * 2021-07-15 2024-03-29 山东云海国创云计算装备产业创新中心有限公司 Method, device, equipment and storage medium for accessing data among multi-architecture processors
CN114116590B (en) * 2021-11-03 2023-10-31 中汽创智科技有限公司 Data acquisition method, device, vehicle, storage medium and electronic equipment
CN116700629B (en) * 2023-08-01 2023-09-26 北京中电华大电子设计有限责任公司 Data processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101958834A (en) * 2010-09-27 2011-01-26 清华大学 On-chip network system supporting cache coherence and data request method
CN102473140A (en) * 2009-07-17 2012-05-23 株式会社东芝 Memory management device
CN104252423A (en) * 2013-06-26 2014-12-31 华为技术有限公司 Consistency processing method and device based on multi-core processor
CN104508637A (en) * 2012-07-30 2015-04-08 华为技术有限公司 Method for peer to peer cache forwarding
CN105740164A (en) * 2014-12-10 2016-07-06 阿里巴巴集团控股有限公司 Multi-core processor supporting cache consistency, reading and writing methods and apparatuses as well as device
CN106775476A (en) * 2016-12-19 2017-05-31 中国人民解放军理工大学 Mixing memory system and its management method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103729166B (en) * 2012-10-10 2017-04-12 华为技术有限公司 Method, device and system for determining thread relation of program
CN104951240B (en) * 2014-03-26 2018-08-24 阿里巴巴集团控股有限公司 A kind of data processing method and processor
CN105095116B (en) * 2014-05-19 2017-12-12 华为技术有限公司 Cache method, cache controller and the processor replaced
EP3249539B1 (en) * 2015-02-16 2021-08-18 Huawei Technologies Co., Ltd. Method and device for accessing data visitor directory in multi-core system
CN105045729B (en) * 2015-09-08 2018-11-23 浪潮(北京)电子信息产业有限公司 A kind of buffer consistency processing method and system of the remote agent with catalogue

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102473140A (en) * 2009-07-17 2012-05-23 株式会社东芝 Memory management device
CN101958834A (en) * 2010-09-27 2011-01-26 清华大学 On-chip network system supporting cache coherence and data request method
CN104508637A (en) * 2012-07-30 2015-04-08 华为技术有限公司 Method for peer to peer cache forwarding
CN104252423A (en) * 2013-06-26 2014-12-31 华为技术有限公司 Consistency processing method and device based on multi-core processor
CN105740164A (en) * 2014-12-10 2016-07-06 阿里巴巴集团控股有限公司 Multi-core processor supporting cache consistency, reading and writing methods and apparatuses as well as device
CN106775476A (en) * 2016-12-19 2017-05-31 中国人民解放军理工大学 Mixing memory system and its management method

Also Published As

Publication number Publication date
CN109684237A (en) 2019-04-26

Similar Documents

Publication Publication Date Title
CN109684237B (en) Data access method and device based on multi-core processor
US8015365B2 (en) Reducing back invalidation transactions from a snoop filter
US10402327B2 (en) Network-aware cache coherence protocol enhancement
US7827357B2 (en) Providing an inclusive shared cache among multiple core-cache clusters
US8949544B2 (en) Bypassing a cache when handling memory requests
TWI391821B (en) Processor unit, data processing system and method for issuing a request on an interconnect fabric without reference to a lower level cache based upon a tagged cache state
US7774551B2 (en) Hierarchical cache coherence directory structure
US7669010B2 (en) Prefetch miss indicator for cache coherence directory misses on external caches
US7281092B2 (en) System and method of managing cache hierarchies with adaptive mechanisms
TWI291651B (en) Apparatus and methods for managing and filtering processor core caches by using core indicating bit and processing system therefor
US20120102273A1 (en) Memory agent to access memory blade as part of the cache coherency domain
US20130073811A1 (en) Region privatization in directory-based cache coherence
US20090006668A1 (en) Performing direct data transactions with a cache memory
US9164910B2 (en) Managing the storage of data in coherent data stores
US20190026225A1 (en) Multiple chip multiprocessor cache coherence operation method and multiple chip multiprocessor
US7380068B2 (en) System and method for contention-based cache performance optimization
CN107341114B (en) Directory management method, node controller and system
CN112256604B (en) Direct memory access system and method
KR20140098096A (en) Integrated circuits with cache-coherency
CN106339330B (en) The method and system of cache flush
JP6343722B2 (en) Method and device for accessing a data visitor directory in a multi-core system
US20230100746A1 (en) Multi-level partitioned snoop filter
CN117971718B (en) Cache replacement method and device for multi-core processor
US11954033B1 (en) Page rinsing scheme to keep a directory page in an exclusive state in a single complex
JP2019164497A (en) Management device, information processing device, management method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant