CN116881192A

CN116881192A - Cluster architecture for GPU and internal first-level cache management method thereof

Info

Publication number: CN116881192A
Application number: CN202310660072.8A
Authority: CN
Inventors: 赵夏; 王璐; 张光达; 王会权; 何益百; 温家辉; 方健; 蒋艳德; 赵杨
Original assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Current assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Priority date: 2023-06-06
Filing date: 2023-06-06
Publication date: 2023-10-13

Abstract

The invention discloses a cluster architecture for GPU and an internal first-level cache management method thereof, wherein the cluster architecture comprises the following components: the flow processor comprises a plurality of flow processors, and the plurality of flow processors are connected with the cross switch; the crossbar is provided with a plurality of input ports and a plurality of output ports, the plurality of input ports are respectively connected with the plurality of stream processors, the plurality of output ports are respectively connected with the plurality of stream processors and the on-chip interconnection network of the GPU, and the crossbar is used for carrying out communication between the plurality of stream processors and the on-chip interconnection network; the L1 index routing module is arranged in the cross switch and is used for calculating the index of the corresponding first-level cache according to the address of the access request sent by the stream processor and sending the access request to the stream processor containing the corresponding first-level cache through the cross switch. The invention can realize the first-level cache sharing of each stream processor in the cluster architecture, fully utilize the first-level cache resources in the GPU and improve the performances of the stream processor and the GPU.

Description

Cluster architecture for GPU and internal first-level cache management method thereof

Technical Field

The invention relates to the technical field of processors, in particular to a cluster architecture for a GPU and an internal first-level cache management method thereof.

Background

Graphics processors (Graphics Processing Unit, GPUs) are microprocessors that perform image and graphics-related operations, and GPUs are widely used in cloud computing platforms and data centers for providing the necessary computation to users due to their powerful computing power.

In a GPU, an on-chip interconnect network connects a stream processor (Streaming Multiprocessor, SM) to a last-level cache (LLC) and a storage system. In order to continuously increase the computational power of GPUs, the number of GPU upstream processors is increasing, and connecting numerous stream processors directly to an on-chip interconnect network greatly increases the hardware overhead of the on-chip interconnect network and thus the overall GPU. Referring to fig. 1, fig. 1 is a schematic diagram of an on-chip network interconnection structure of a GPU provided by way of example, in order to solve the problem of scalability of an on-chip interconnection network, a cluster structure is currently used in the GPU to divide a plurality of stream processors into a group, and the plurality of stream processors in each cluster structure are connected to ports of the on-chip interconnection network through a Crossbar (Crossbar) disposed inside. Meanwhile, as the first-level cache (L1 cache) of the stream processor in the GPU adopts a privately-owned design mode, namely the L1 cache in each stream processor is privately owned by the stream processor. When a memory access part in the stream processor sends a memory access request, the memory access request accesses the L1 cache of the current stream processor, if the L1 cache hits, data is returned, and the current thread continues to execute; if the L1 cache fails, the current thread is suspended, the access request is sent to an on-chip interconnection network through a Crossbar in the cluster, and then is routed to the corresponding LLC and a storage controller through the on-chip interconnection network for access; and after the access request is completed, the returned data message reaches a responding cluster through the on-chip interconnection network, and then is sent to a stream processor initiating the access request through a Crossbar, and the suspended thread in the stream processor continues to execute.

The capacity of the L1 cache cannot be infinitely extended due to the constraint of on-chip resources, for example, taking the latest Nvidia a100GPU as an example, the size of the L1 cache in the stream processor is only 192KB. Because the GPU program generates a large number of memory access requests in operation, the L1 cache can face larger failure rate, and threads in the stream processor are in a pause state due to waiting for data, the performance of the stream processor is greatly restricted, and the performance of the GPU is seriously affected.

Disclosure of Invention

In order to solve some or all of the technical problems in the prior art, the present invention provides a cluster architecture for GPU and an internal level one cache management method thereof.

The technical scheme of the invention is as follows:

in a first aspect, a cluster architecture for a GPU is provided, the cluster architecture comprising:

a plurality of stream processors, wherein the plurality of stream processors are connected with a cross switch;

the crossbar is provided with a plurality of input ports and a plurality of output ports, the input ports are respectively connected with the stream processors, the output ports are respectively connected with the stream processors and the on-chip interconnection network of the GPU, and the crossbar is used for carrying out communication between the stream processors and the on-chip interconnection network;

and the L1 index routing module is arranged in the crossbar switch and is used for calculating the index of the corresponding first-level cache according to the address of the access request sent by the stream processor and sending the access request to the stream processor containing the corresponding first-level cache through the crossbar switch.

In some possible implementations, the crossbar performs communication between the plurality of stream processors and the on-chip interconnect network, including:

receiving a memory access request sent by a stream processor, and sending the memory access request to the stream processor or sending the memory access request to an on-chip interconnection network;

and receiving access response data returned by the stream processor or access response data returned by the on-chip interconnection network, and sending the access response data to the stream processor.

In some possible implementations, the stream processor is further provided with a memory request queue unit, a memory invalidation queue unit, and a memory response queue unit;

the access request queue unit is connected with the cross switch and used for caching access requests sent by the cross switch;

the access invalidation queue unit is connected with the crossbar switch and is used for caching access requests which are missed by the first-level cache;

the access response queue unit is connected with the crossbar switch and is used for caching access response data after first-level cache hit.

In a second aspect, a first level cache management method for the cluster architecture is provided, where the method includes:

the crossbar receives a memory access request sent by a stream processor;

the L1 index routing module calculates the index of a first-level cache corresponding to the current access request, and sends the access request to a port of the crossbar connected with a stream processor containing the corresponding first-level cache;

the cross switch sends the access request to a stream processor containing a corresponding first-level cache so that the access request accesses the first-level cache of the stream processor;

judging whether the first-level cache hits or not;

if the first level cache hits, the crossbar receives returned access response data, and the access response data is sent to a stream processor initiating an access request;

if the first-level cache is not hit, the crossbar receives a returned access request, the access request is sent to the on-chip interconnection network so that the access request accesses the storage system of the GPU, the crossbar receives access response data returned by the storage system through the on-chip interconnection network, and the access response data is sent to the stream processor initiating the access request.

In some possible implementations, the L1 index routing module calculates, according to the address of the access request, an index of a first level cache corresponding to the current access request through address bitmap or hash map.

In some possible implementations, a memory request queue unit, a memory invalidation queue unit and a memory response queue unit are set in the stream processor;

and receiving and caching the access request sent by the crossbar switch by using the access request queue unit, receiving and caching the access request which is missed by the first-level cache by using the access invalidation queue unit, and receiving and caching the access response data which is hit by the first-level cache by using the access response queue unit.

The technical scheme of the invention has the main advantages that:

the cluster architecture for the GPU and the internal first-level cache management method thereof can realize the first-level cache sharing of each stream processor in the cluster architecture on the basis of setting the cluster architecture to reduce the hardware overhead of an on-chip interconnection network and the GPU, can fully utilize first-level cache resources in the GPU, improve the performances of the stream processor and the GPU, and improve the utilization rate of system resources.

Drawings

The accompanying drawings, which are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and without limitation to the invention. Attached at

In the figure:

FIG. 1 is a schematic diagram of an on-chip network interconnect structure of an exemplary GPU;

FIG. 2 is a schematic diagram of a cluster architecture according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a first level cache management method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to specific embodiments of the present invention and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The following describes in detail the technical scheme provided by the embodiment of the invention with reference to the accompanying drawings.

Referring to FIG. 2, in a first aspect, an embodiment of the present invention provides a cluster architecture for a GPU, the cluster architecture comprising:

the flow processor comprises a plurality of flow processors, and the plurality of flow processors are connected with the cross switch;

the crossbar is provided with a plurality of input ports and a plurality of output ports, the plurality of input ports are respectively connected with the plurality of stream processors, the plurality of output ports are respectively connected with the plurality of stream processors and the on-chip interconnection network of the GPU, and the crossbar is used for carrying out communication between the plurality of stream processors and the on-chip interconnection network;

the L1 index routing module is arranged in the cross switch and is used for calculating the index of the corresponding first-level cache according to the address of the access request sent by the stream processor and sending the access request to the stream processor containing the corresponding first-level cache through the cross switch.

Based on the structure of the cluster architecture, when the cluster architecture provided by the embodiment of the invention performs data access, a stream processor in the cluster architecture initiates a memory access request and sends the memory access request to a crossbar, and the crossbar receives the memory access request sent by the stream processor; the L1 index routing module in the cross switch calculates the index of the first-level cache corresponding to the current access request through address bit mapping or hash mapping according to the address of the access request, and sends the access request to a port of the cross switch connected with a stream processor containing the corresponding first-level cache; the cross switch sends the access request to a stream processor containing a corresponding first-level cache so that the access request accesses the first-level cache of the stream processor; when the first-level cache hits, the stream processor returns corresponding access response data, the crossbar receives the access response data returned by the stream processor, and sends the access response data to the stream processor initiating the access request; when the first-level cache is missed, the stream processor returns a memory access request, the crossbar receives the returned memory access request and sends the memory access request to the on-chip interconnection network so that the memory access request accesses the storage system of the GPU; the storage system of the GPU returns corresponding access response data, and the access response data is sent to the cross switch through the on-chip interconnection network; the crossbar receives access response data returned by the on-chip interconnection network and sends the access response data to the stream processor initiating the access request.

In an embodiment of the present invention, according to the above data access process, in order to implement the above data access, a crossbar performs communication between a plurality of stream processors and between the plurality of stream processors and an on-chip interconnection network, including:

Further, in an embodiment of the present invention, the crossbar switch in the cluster architecture may adopt an N (n+1) port type, where N is the number of stream processors in the cluster architecture, N input ports are respectively connected to N stream processors, and n+1 output ports are respectively connected to N stream processors and 1 on-chip interconnection network.

In order to realize the first-level cache sharing of each stream processor in the cluster architecture and ensure the orderly execution of the memory access process, in one embodiment of the invention, the stream processor is further provided with a memory access request queue unit, a memory access invalidation queue unit and a memory access response queue unit; the access request queue unit is connected with the cross switch and used for caching access requests sent by the cross switch; the access invalidation queue unit is connected with the crossbar switch and is used for caching the access request which is missed by the first-level cache; the access response queue unit is connected with the crossbar switch and is used for caching access response data after first-level cache hit.

Specifically, the access request sent by the crossbar is stored in an access request queue unit in a queue structure so as to wait for the subsequent first-level cache of the access stream processor; when the first-level cache is accessed, the access request queue unit takes out the access request at the head of the access request queue for access. When the first-level cache is not hit, the corresponding access request is stored in an access failure queue unit in a queue structure so as to wait for the subsequent transmission to the crossbar; when sending the access request to the crossbar, the access invalidation queue unit takes out the access request positioned at the head of the access invalidation queue for sending. When the first-level cache hits, corresponding access response data are stored in an access response queue unit in a queue structure so as to wait for subsequent transmission to the crossbar; when the access response data is sent to the cross switch, the access response queue unit takes out the access response data positioned at the head of the access response queue for sending.

Referring to fig. 3, in a second aspect, an embodiment of the present invention further provides a first level cache management method for the cluster architecture, where the method includes steps S1 to S6:

step S1, a crossbar receives a memory access request sent by a stream processor;

step S2, an L1 index routing module calculates an index of a first-level cache corresponding to a current access request, and sends the access request to a port of a crossbar connected with a stream processor containing the corresponding first-level cache;

step S3, the crossbar transmits the access request to a stream processor comprising a corresponding first-level cache, so that the access request accesses the first-level cache of the stream processor;

step S4, judging whether the first-level cache hits or not;

step S5, if the first-level cache hits, the crossbar receives returned access response data, and the access response data is sent to a stream processor initiating an access request;

step S6, if the first-level cache is not hit, the crossbar receives a returned access request, the access request is sent to the on-chip interconnection network so that the access request accesses the storage system of the GPU, the crossbar receives access response data returned by the storage system through the on-chip interconnection network, and the access response data is sent to the stream processor initiating the access request.

Further, in an embodiment of the present invention, the L1 index routing module calculates, according to the address of the memory access request, the index of the first level cache corresponding to the current memory access request through address bitmap or hash map.

In an embodiment of the present invention, in order to realize first-level cache sharing of each stream processor in the cluster architecture and ensure that the access process is performed orderly, an access request queue unit, an access failure queue unit, and an access response queue unit may be set in the stream processor; and receiving and caching the access request sent by the crossbar switch by using the access request queue unit, receiving and caching the access request which is missed by the first-level cache by using the access invalidation queue unit, and receiving and caching the access response data which is hit by the first-level cache by using the access response queue unit.

According to the cluster architecture for the GPU and the internal first-level cache management method thereof, provided by the embodiment of the invention, the first-level cache sharing of each stream processor in the cluster architecture can be realized on the basis of setting the cluster architecture to reduce the hardware overhead of an on-chip interconnection network and the GPU, so that the first-level cache resources in the GPU can be fully utilized, the performances of the stream processor and the GPU are improved, and the utilization rate of system resources is improved.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. In this context, "front", "rear", "left", "right", "upper" and "lower" are referred to with respect to the placement state shown in the drawings.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting thereof; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A cluster architecture for a GPU, the cluster architecture comprising:

2. The cluster architecture for GPUs of claim 1, wherein the crossbar communicates between the plurality of stream processors and the on-chip interconnect network, comprising:

3. The cluster architecture for GPUs of claim 2, wherein the stream processor is further provided with a memory request queue unit, a memory invalidation queue unit, and a memory response queue unit;

4. A method for level one cache management of a cluster architecture for a GPU as recited in any of claims 1-3, wherein the method comprises:

the crossbar receives a memory access request sent by a stream processor;

judging whether the first-level cache hits or not;

5. The method of claim 4, wherein the L1 index routing module calculates the index of the first level cache corresponding to the current access request according to the address of the access request by address bitmap or hash map.

6. The method according to claim 4, wherein a memory request queue unit, a memory invalidation queue unit, and a memory response queue unit are provided in the stream processor;