CN116881192A - Cluster architecture for GPU and internal first-level cache management method thereof - Google Patents

Cluster architecture for GPU and internal first-level cache management method thereof Download PDF

Info

Publication number
CN116881192A
CN116881192A CN202310660072.8A CN202310660072A CN116881192A CN 116881192 A CN116881192 A CN 116881192A CN 202310660072 A CN202310660072 A CN 202310660072A CN 116881192 A CN116881192 A CN 116881192A
Authority
CN
China
Prior art keywords
access request
level cache
access
stream processor
crossbar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310660072.8A
Other languages
Chinese (zh)
Inventor
赵夏
王璐
张光达
王会权
何益百
温家辉
方健
蒋艳德
赵杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Defense Technology Innovation Institute PLA Academy of Military Science
Original Assignee
National Defense Technology Innovation Institute PLA Academy of Military Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Defense Technology Innovation Institute PLA Academy of Military Science filed Critical National Defense Technology Innovation Institute PLA Academy of Military Science
Priority to CN202310660072.8A priority Critical patent/CN116881192A/en
Publication of CN116881192A publication Critical patent/CN116881192A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Multi Processors (AREA)

Abstract

The invention discloses a cluster architecture for GPU and an internal first-level cache management method thereof, wherein the cluster architecture comprises the following components: the flow processor comprises a plurality of flow processors, and the plurality of flow processors are connected with the cross switch; the crossbar is provided with a plurality of input ports and a plurality of output ports, the plurality of input ports are respectively connected with the plurality of stream processors, the plurality of output ports are respectively connected with the plurality of stream processors and the on-chip interconnection network of the GPU, and the crossbar is used for carrying out communication between the plurality of stream processors and the on-chip interconnection network; the L1 index routing module is arranged in the cross switch and is used for calculating the index of the corresponding first-level cache according to the address of the access request sent by the stream processor and sending the access request to the stream processor containing the corresponding first-level cache through the cross switch. The invention can realize the first-level cache sharing of each stream processor in the cluster architecture, fully utilize the first-level cache resources in the GPU and improve the performances of the stream processor and the GPU.

Description

Cluster architecture for GPU and internal first-level cache management method thereof
Technical Field
The invention relates to the technical field of processors, in particular to a cluster architecture for a GPU and an internal first-level cache management method thereof.
Background
Graphics processors (Graphics Processing Unit, GPUs) are microprocessors that perform image and graphics-related operations, and GPUs are widely used in cloud computing platforms and data centers for providing the necessary computation to users due to their powerful computing power.
In a GPU, an on-chip interconnect network connects a stream processor (Streaming Multiprocessor, SM) to a last-level cache (LLC) and a storage system. In order to continuously increase the computational power of GPUs, the number of GPU upstream processors is increasing, and connecting numerous stream processors directly to an on-chip interconnect network greatly increases the hardware overhead of the on-chip interconnect network and thus the overall GPU. Referring to fig. 1, fig. 1 is a schematic diagram of an on-chip network interconnection structure of a GPU provided by way of example, in order to solve the problem of scalability of an on-chip interconnection network, a cluster structure is currently used in the GPU to divide a plurality of stream processors into a group, and the plurality of stream processors in each cluster structure are connected to ports of the on-chip interconnection network through a Crossbar (Crossbar) disposed inside. Meanwhile, as the first-level cache (L1 cache) of the stream processor in the GPU adopts a privately-owned design mode, namely the L1 cache in each stream processor is privately owned by the stream processor. When a memory access part in the stream processor sends a memory access request, the memory access request accesses the L1 cache of the current stream processor, if the L1 cache hits, data is returned, and the current thread continues to execute; if the L1 cache fails, the current thread is suspended, the access request is sent to an on-chip interconnection network through a Crossbar in the cluster, and then is routed to the corresponding LLC and a storage controller through the on-chip interconnection network for access; and after the access request is completed, the returned data message reaches a responding cluster through the on-chip interconnection network, and then is sent to a stream processor initiating the access request through a Crossbar, and the suspended thread in the stream processor continues to execute.
The capacity of the L1 cache cannot be infinitely extended due to the constraint of on-chip resources, for example, taking the latest Nvidia a100GPU as an example, the size of the L1 cache in the stream processor is only 192KB. Because the GPU program generates a large number of memory access requests in operation, the L1 cache can face larger failure rate, and threads in the stream processor are in a pause state due to waiting for data, the performance of the stream processor is greatly restricted, and the performance of the GPU is seriously affected.
Disclosure of Invention
In order to solve some or all of the technical problems in the prior art, the present invention provides a cluster architecture for GPU and an internal level one cache management method thereof.
The technical scheme of the invention is as follows:
in a first aspect, a cluster architecture for a GPU is provided, the cluster architecture comprising:
a plurality of stream processors, wherein the plurality of stream processors are connected with a cross switch;
the crossbar is provided with a plurality of input ports and a plurality of output ports, the input ports are respectively connected with the stream processors, the output ports are respectively connected with the stream processors and the on-chip interconnection network of the GPU, and the crossbar is used for carrying out communication between the stream processors and the on-chip interconnection network;
and the L1 index routing module is arranged in the crossbar switch and is used for calculating the index of the corresponding first-level cache according to the address of the access request sent by the stream processor and sending the access request to the stream processor containing the corresponding first-level cache through the crossbar switch.
In some possible implementations, the crossbar performs communication between the plurality of stream processors and the on-chip interconnect network, including:
receiving a memory access request sent by a stream processor, and sending the memory access request to the stream processor or sending the memory access request to an on-chip interconnection network;
and receiving access response data returned by the stream processor or access response data returned by the on-chip interconnection network, and sending the access response data to the stream processor.
In some possible implementations, the stream processor is further provided with a memory request queue unit, a memory invalidation queue unit, and a memory response queue unit;
the access request queue unit is connected with the cross switch and used for caching access requests sent by the cross switch;
the access invalidation queue unit is connected with the crossbar switch and is used for caching access requests which are missed by the first-level cache;
the access response queue unit is connected with the crossbar switch and is used for caching access response data after first-level cache hit.
In a second aspect, a first level cache management method for the cluster architecture is provided, where the method includes:
the crossbar receives a memory access request sent by a stream processor;
the L1 index routing module calculates the index of a first-level cache corresponding to the current access request, and sends the access request to a port of the crossbar connected with a stream processor containing the corresponding first-level cache;
the cross switch sends the access request to a stream processor containing a corresponding first-level cache so that the access request accesses the first-level cache of the stream processor;
judging whether the first-level cache hits or not;
if the first level cache hits, the crossbar receives returned access response data, and the access response data is sent to a stream processor initiating an access request;
if the first-level cache is not hit, the crossbar receives a returned access request, the access request is sent to the on-chip interconnection network so that the access request accesses the storage system of the GPU, the crossbar receives access response data returned by the storage system through the on-chip interconnection network, and the access response data is sent to the stream processor initiating the access request.
In some possible implementations, the L1 index routing module calculates, according to the address of the access request, an index of a first level cache corresponding to the current access request through address bitmap or hash map.
In some possible implementations, a memory request queue unit, a memory invalidation queue unit and a memory response queue unit are set in the stream processor;
and receiving and caching the access request sent by the crossbar switch by using the access request queue unit, receiving and caching the access request which is missed by the first-level cache by using the access invalidation queue unit, and receiving and caching the access response data which is hit by the first-level cache by using the access response queue unit.
The technical scheme of the invention has the main advantages that:
the cluster architecture for the GPU and the internal first-level cache management method thereof can realize the first-level cache sharing of each stream processor in the cluster architecture on the basis of setting the cluster architecture to reduce the hardware overhead of an on-chip interconnection network and the GPU, can fully utilize first-level cache resources in the GPU, improve the performances of the stream processor and the GPU, and improve the utilization rate of system resources.
Drawings
The accompanying drawings, which are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and without limitation to the invention. Attached at
In the figure:
FIG. 1 is a schematic diagram of an on-chip network interconnect structure of an exemplary GPU;
FIG. 2 is a schematic diagram of a cluster architecture according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a first level cache management method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to specific embodiments of the present invention and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The following describes in detail the technical scheme provided by the embodiment of the invention with reference to the accompanying drawings.
Referring to FIG. 2, in a first aspect, an embodiment of the present invention provides a cluster architecture for a GPU, the cluster architecture comprising:
the flow processor comprises a plurality of flow processors, and the plurality of flow processors are connected with the cross switch;
the crossbar is provided with a plurality of input ports and a plurality of output ports, the plurality of input ports are respectively connected with the plurality of stream processors, the plurality of output ports are respectively connected with the plurality of stream processors and the on-chip interconnection network of the GPU, and the crossbar is used for carrying out communication between the plurality of stream processors and the on-chip interconnection network;
the L1 index routing module is arranged in the cross switch and is used for calculating the index of the corresponding first-level cache according to the address of the access request sent by the stream processor and sending the access request to the stream processor containing the corresponding first-level cache through the cross switch.
Based on the structure of the cluster architecture, when the cluster architecture provided by the embodiment of the invention performs data access, a stream processor in the cluster architecture initiates a memory access request and sends the memory access request to a crossbar, and the crossbar receives the memory access request sent by the stream processor; the L1 index routing module in the cross switch calculates the index of the first-level cache corresponding to the current access request through address bit mapping or hash mapping according to the address of the access request, and sends the access request to a port of the cross switch connected with a stream processor containing the corresponding first-level cache; the cross switch sends the access request to a stream processor containing a corresponding first-level cache so that the access request accesses the first-level cache of the stream processor; when the first-level cache hits, the stream processor returns corresponding access response data, the crossbar receives the access response data returned by the stream processor, and sends the access response data to the stream processor initiating the access request; when the first-level cache is missed, the stream processor returns a memory access request, the crossbar receives the returned memory access request and sends the memory access request to the on-chip interconnection network so that the memory access request accesses the storage system of the GPU; the storage system of the GPU returns corresponding access response data, and the access response data is sent to the cross switch through the on-chip interconnection network; the crossbar receives access response data returned by the on-chip interconnection network and sends the access response data to the stream processor initiating the access request.
In an embodiment of the present invention, according to the above data access process, in order to implement the above data access, a crossbar performs communication between a plurality of stream processors and between the plurality of stream processors and an on-chip interconnection network, including:
receiving a memory access request sent by a stream processor, and sending the memory access request to the stream processor or sending the memory access request to an on-chip interconnection network;
and receiving access response data returned by the stream processor or access response data returned by the on-chip interconnection network, and sending the access response data to the stream processor.
Further, in an embodiment of the present invention, the crossbar switch in the cluster architecture may adopt an N (n+1) port type, where N is the number of stream processors in the cluster architecture, N input ports are respectively connected to N stream processors, and n+1 output ports are respectively connected to N stream processors and 1 on-chip interconnection network.
In order to realize the first-level cache sharing of each stream processor in the cluster architecture and ensure the orderly execution of the memory access process, in one embodiment of the invention, the stream processor is further provided with a memory access request queue unit, a memory access invalidation queue unit and a memory access response queue unit; the access request queue unit is connected with the cross switch and used for caching access requests sent by the cross switch; the access invalidation queue unit is connected with the crossbar switch and is used for caching the access request which is missed by the first-level cache; the access response queue unit is connected with the crossbar switch and is used for caching access response data after first-level cache hit.
Specifically, the access request sent by the crossbar is stored in an access request queue unit in a queue structure so as to wait for the subsequent first-level cache of the access stream processor; when the first-level cache is accessed, the access request queue unit takes out the access request at the head of the access request queue for access. When the first-level cache is not hit, the corresponding access request is stored in an access failure queue unit in a queue structure so as to wait for the subsequent transmission to the crossbar; when sending the access request to the crossbar, the access invalidation queue unit takes out the access request positioned at the head of the access invalidation queue for sending. When the first-level cache hits, corresponding access response data are stored in an access response queue unit in a queue structure so as to wait for subsequent transmission to the crossbar; when the access response data is sent to the cross switch, the access response queue unit takes out the access response data positioned at the head of the access response queue for sending.
Referring to fig. 3, in a second aspect, an embodiment of the present invention further provides a first level cache management method for the cluster architecture, where the method includes steps S1 to S6:
step S1, a crossbar receives a memory access request sent by a stream processor;
step S2, an L1 index routing module calculates an index of a first-level cache corresponding to a current access request, and sends the access request to a port of a crossbar connected with a stream processor containing the corresponding first-level cache;
step S3, the crossbar transmits the access request to a stream processor comprising a corresponding first-level cache, so that the access request accesses the first-level cache of the stream processor;
step S4, judging whether the first-level cache hits or not;
step S5, if the first-level cache hits, the crossbar receives returned access response data, and the access response data is sent to a stream processor initiating an access request;
step S6, if the first-level cache is not hit, the crossbar receives a returned access request, the access request is sent to the on-chip interconnection network so that the access request accesses the storage system of the GPU, the crossbar receives access response data returned by the storage system through the on-chip interconnection network, and the access response data is sent to the stream processor initiating the access request.
Further, in an embodiment of the present invention, the L1 index routing module calculates, according to the address of the memory access request, the index of the first level cache corresponding to the current memory access request through address bitmap or hash map.
In an embodiment of the present invention, in order to realize first-level cache sharing of each stream processor in the cluster architecture and ensure that the access process is performed orderly, an access request queue unit, an access failure queue unit, and an access response queue unit may be set in the stream processor; and receiving and caching the access request sent by the crossbar switch by using the access request queue unit, receiving and caching the access request which is missed by the first-level cache by using the access invalidation queue unit, and receiving and caching the access response data which is hit by the first-level cache by using the access response queue unit.
According to the cluster architecture for the GPU and the internal first-level cache management method thereof, provided by the embodiment of the invention, the first-level cache sharing of each stream processor in the cluster architecture can be realized on the basis of setting the cluster architecture to reduce the hardware overhead of an on-chip interconnection network and the GPU, so that the first-level cache resources in the GPU can be fully utilized, the performances of the stream processor and the GPU are improved, and the utilization rate of system resources is improved.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. In this context, "front", "rear", "left", "right", "upper" and "lower" are referred to with respect to the placement state shown in the drawings.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting thereof; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (6)

1. A cluster architecture for a GPU, the cluster architecture comprising:
a plurality of stream processors, wherein the plurality of stream processors are connected with a cross switch;
the crossbar is provided with a plurality of input ports and a plurality of output ports, the input ports are respectively connected with the stream processors, the output ports are respectively connected with the stream processors and the on-chip interconnection network of the GPU, and the crossbar is used for carrying out communication between the stream processors and the on-chip interconnection network;
and the L1 index routing module is arranged in the crossbar switch and is used for calculating the index of the corresponding first-level cache according to the address of the access request sent by the stream processor and sending the access request to the stream processor containing the corresponding first-level cache through the crossbar switch.
2. The cluster architecture for GPUs of claim 1, wherein the crossbar communicates between the plurality of stream processors and the on-chip interconnect network, comprising:
receiving a memory access request sent by a stream processor, and sending the memory access request to the stream processor or sending the memory access request to an on-chip interconnection network;
and receiving access response data returned by the stream processor or access response data returned by the on-chip interconnection network, and sending the access response data to the stream processor.
3. The cluster architecture for GPUs of claim 2, wherein the stream processor is further provided with a memory request queue unit, a memory invalidation queue unit, and a memory response queue unit;
the access request queue unit is connected with the cross switch and used for caching access requests sent by the cross switch;
the access invalidation queue unit is connected with the crossbar switch and is used for caching access requests which are missed by the first-level cache;
the access response queue unit is connected with the crossbar switch and is used for caching access response data after first-level cache hit.
4. A method for level one cache management of a cluster architecture for a GPU as recited in any of claims 1-3, wherein the method comprises:
the crossbar receives a memory access request sent by a stream processor;
the L1 index routing module calculates the index of a first-level cache corresponding to the current access request, and sends the access request to a port of the crossbar connected with a stream processor containing the corresponding first-level cache;
the cross switch sends the access request to a stream processor containing a corresponding first-level cache so that the access request accesses the first-level cache of the stream processor;
judging whether the first-level cache hits or not;
if the first level cache hits, the crossbar receives returned access response data, and the access response data is sent to a stream processor initiating an access request;
if the first-level cache is not hit, the crossbar receives a returned access request, the access request is sent to the on-chip interconnection network so that the access request accesses the storage system of the GPU, the crossbar receives access response data returned by the storage system through the on-chip interconnection network, and the access response data is sent to the stream processor initiating the access request.
5. The method of claim 4, wherein the L1 index routing module calculates the index of the first level cache corresponding to the current access request according to the address of the access request by address bitmap or hash map.
6. The method according to claim 4, wherein a memory request queue unit, a memory invalidation queue unit, and a memory response queue unit are provided in the stream processor;
and receiving and caching the access request sent by the crossbar switch by using the access request queue unit, receiving and caching the access request which is missed by the first-level cache by using the access invalidation queue unit, and receiving and caching the access response data which is hit by the first-level cache by using the access response queue unit.
CN202310660072.8A 2023-06-06 2023-06-06 Cluster architecture for GPU and internal first-level cache management method thereof Pending CN116881192A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310660072.8A CN116881192A (en) 2023-06-06 2023-06-06 Cluster architecture for GPU and internal first-level cache management method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310660072.8A CN116881192A (en) 2023-06-06 2023-06-06 Cluster architecture for GPU and internal first-level cache management method thereof

Publications (1)

Publication Number Publication Date
CN116881192A true CN116881192A (en) 2023-10-13

Family

ID=88263310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310660072.8A Pending CN116881192A (en) 2023-06-06 2023-06-06 Cluster architecture for GPU and internal first-level cache management method thereof

Country Status (1)

Country Link
CN (1) CN116881192A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117555824A (en) * 2024-01-12 2024-02-13 深圳中微电科技有限公司 Cache storage architecture in GPU simulator based on MVP architecture

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117555824A (en) * 2024-01-12 2024-02-13 深圳中微电科技有限公司 Cache storage architecture in GPU simulator based on MVP architecture

Similar Documents

Publication Publication Date Title
US10339061B2 (en) Caching for heterogeneous processors
CN110347635B (en) Heterogeneous multi-core microprocessor based on multilayer bus
US7865570B2 (en) Memory server
US8700857B2 (en) Optimizing memory copy routine selection for message passing in a multicore architecture
KR102409024B1 (en) Multi-core interconnect in a network processor
WO2014094374A1 (en) Method for constructing multiprocessor system with node having a plurality of cache uniformity domains
US9256555B2 (en) Method and system for queue descriptor cache management for a host channel adapter
US11343177B2 (en) Technologies for quality of service based throttling in fabric architectures
CN116881192A (en) Cluster architecture for GPU and internal first-level cache management method thereof
WO2016019566A1 (en) Memory management method, device and system and network-on-chip
CN111190735A (en) Linux-based on-chip CPU/GPU (Central processing Unit/graphics processing Unit) pipelined computing method and computer system
CN102375789A (en) Non-buffer zero-copy method of universal network card and zero-copy system
GB2610015A (en) Cache for storing coherent and non-coherent data
WO2014206229A1 (en) Accelerator and data processing method
CN111858096B (en) Directory-based method and system for monitoring reading of cache at shortest distance
WO2013185660A1 (en) Instruction storage device of network processor and instruction storage method for same
Li et al. Designing registration caching free high-performance MPI library with implicit on-demand paging (ODP) of InfiniBand
CN112148453A (en) Computing chip for privacy computation and network computing system
US11874783B2 (en) Coherent block read fulfillment
US11841793B2 (en) Switch-based free memory tracking in data center environments
US20230129107A1 (en) Method and apparatus to aggregate objects to be stored in a memory to optimize the memory bandwidth
Kornaros RSMCC: Enabling Ring-based Software Managed Cache-Coherent Embedded SoCs
Chaves et al. Exploiting multicast messages in cache-coherence protocols for NoC-based MPSoCs
CN116132375A (en) Multi-node arbitrary inter-core global communication method based on domestic DSP
CN116957902A (en) NoC arbitration method for GPU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination