CN109408411A

CN109408411A - The L1 Cache management method of GPGPU based on data access number

Info

Publication number: CN109408411A
Application number: CN201811113134.9A
Authority: CN
Inventors: 章铁飞; 傅均
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2018-09-25
Filing date: 2018-09-25
Publication date: 2019-03-01

Abstract

The present invention discloses the L1 Cache management method of GPGPU based on data access number a kind of, specifically includes: the hardware modifications of L1 Cache；L1 Cache management strategy based on DA Counter Value；Unified default DA value is adjusted periodically.Present invention seek to address that the L1 Cache data block flapping issue of GPGPU, main method is that each cache blocks in L1 Cache increase access times counter, and the size of the value and default setting value that compare counter operates come replacement and the bypass etc. for determining cache blocks, target is to keep frequently accessed cache blocks in the buffer, promote hit rate, the cache blocks bypass that will not be accessed again, the utilization rate of spatial cache is promoted, solves the data block flapping issue of L1 Cache to the maximum extent.

Description

The L1 Cache management method of GPGPU based on data access number

Technical field

The L1 Cache management method towards GPGPU that the present invention relates to a kind of is waved and is asked for the data block of L1 Cache Topic proposes the solution based on data access number, realizes to keep being promoted in the buffer by frequently accessed cache blocks and order Middle rate, the cache blocks bypass that will not be accessed again promote the utilization rate of spatial cache, solve L1 to the maximum extent The data block flapping issue of Cache.

Background technique

Compared to conventional processors (CPU), general image processor (GPGPU) is more suitable for the calculating of high data parallel degree Task has and more preferably calculates Energy Efficiency Ratio.Based on CUDA and OpenCL programming framework, GPGPU can be to the task in many fields Accelerated, such as machine learning application popular at present.It include multiple independent calculating cores (SIMT Core) in GPGPU, It can independently be calculated simultaneously, have high concurrent computing capability.Similar with CPU, GPGPU is calculated by the outer DRAM storage of piece to be appointed The code and data of business, and the speed that GPGPU processor calculates is much higher than the speed of DRAM data access, so GPGPU is also needed The speed wide gap for wanting the storage hierarchy system of complexity to fill up between processor and DRAM.

The storage hierarchy system of GPGPU includes register, L1 caching (cache), outside shared L2 caching and piece DRAM memory.There are multiple SIMT Core inside general GPGPU, while running multiple sets of threads (thread warp) and promoting thread Grade degree of parallelism.Each SIMT Core includes a privately owned L1 caching, and all SIMT Core pass through internal bus connection one A shared L2 caching.L2 caches the control for being responsible for caching data consistency, and using the structure of grouping (bank), each Bank is connected by a privately owned main memory access (memory channel) with the DRAM outside piece.If the mesh of an access request Data are marked not in L1 Cache, then the request is classified as miss request and is sent to L2 Cache, if target data is in L2 Cache hit, then data are issued L1 Cache by L2 Cache；Otherwise, L2 Cache generates an access miss request, and It is sent to next stage memory.

Basic storage unit is cache blocks in the L1 Cache of GPGPU, each open ended data block size of cache blocks Usually 128Byte, every four cache blocks constitute a caching group (set).The data block of reading can map and store Mr. Yu spy Determine in the free buffer block of caching group, if the corresponding no free buffer block of caching group, data block can both bypass (bypass) L1 Cache directly reaches processing core, can also select certain cache blocks in caching group, carries out data block replacement.When When the data block being replaced is accessed again, it is possible to the data block of L1 Cache be caused to wave (cache block Thrashing) problem.Caching data block waves generation after data block gets caching, does not wait until also to access again just as slow It is limited and replaced by the cache blocks of other new arrivals to deposit space, so that occurring when accessing again, cache blocks are not being cached In, it is necessary to the very long delay of experience is got in caching from next stage memory, influences access efficiency and performance.It is every in GPGPU A SIMT Core, which can be dispatched, executes multiple sets of threads, and all threads share same privately owned L1 Cache, and current thread group is in L1 Data block in Cache is easy to the data block replacement of the sets of threads executed by next round, therefore the multithreading of GPGPU runs ring Border is further exacerbated by data block flapping issue.

A kind of method waved of caching data block that solves is to increase buffer memory capacity, but the increase of L1 Cache capacity can be brought The problems such as significant negative effect, such as energy consumption become larger, and data access delay increases, cost increase, thus and it is unrealistic.Another party Different cache access patterns are presented in face, the upper different application program of GPGPU.Data block is read L1 by certain applications program After Cache, data block will not be accessed again, such as typical stream access module, and caching data block flapping issue is just not present. And certain applications program reads the data block in L1 Cache and is accessed the apparent lack of uniformity of rate presentation again: a large amount of to read To L1 Cache data block cut-off be replaced before, will not all be accessed again, and a small amount of data block can be accessed repeatedly. If frequently accessed data block prematurely replaces out L1 Cache, cause caching data block flapping issue.Therefore, as long as By no longer accessed data block replacement, while the repeatedly accessed data block of holding on a small quantity is in the buffer, so that it may effectively Ground avoids caching data block flapping issue.

Summary of the invention

The present invention will overcome the drawbacks described above of the prior art, propose a kind of secondary based on data access (Data Access, DA) Several L1 Cache management methods.

The contents of the present invention and feature are exactly: each cache blocks in L1 Cache add a DA counter, according to DA value Size makes the decision such as the replacement of cache blocks, bypass, and frequently accessed data block is maintained at caching contribution hit rate, will The data block bypass that will not be accessed again promotes spatial cache utilization rate, to alleviate the slow of L1 Cache to the maximum extent Deposit data block flapping issue.

The L1 Cache management method of GPGPU based on data access number of the invention a kind of includes technology below Step:

1) hardware modifications of L1 Cache；

The hit counter for increasing statistics hit-count in L1 Cache, when data access request occurs, if data block In L1 Cache, hit counter adds 1；Each cache blocks in L1 Cache increase by 4 data access counter DA, just Beginning turns to unified default value, all cache blocks when each data access reaches caching group belonging to cache blocks, in caching group DA value subtract 1；Increase bypass address recorder, including access times counter, random mapping function, data in L1 Cache Bit memory and hit-count counter；

2) the L1 Cache management strategy based on DA Counter Value；

The target data block of data access request then needs to access L2 Cache read block not in L1 Cache； If there are the cache blocks that idle or DA value is zero in the target cache group of L1 Cache, the data block read is filled into this Otherwise cache blocks make bypass processing to the data block of reading, data block is bypassed L1 Cache, while by the address of data block It is sent to the processing of bypass address recorder；

3) unified default DA value is adjusted periodically；

Different L1 Cache memory access modes is presented in different application programs, needs different default DA values could be most effective Ground manages L1 Cache, it is therefore necessary to periodically adjust unified default DA value, bypass address recorder is worked as according to cut-off Preceding bypass data number of blocks and address information determine the period of adjustment；And compares L1 Cache and the address bypass is remembered The hit ratio of device is recorded, determines that increase, reduction are also to maintain current DA value.

The invention has the advantages that method is simple, it is based on the additional DA count value of each cache blocks, by frequently accessed number Caching is maintained at according to block and promotes hit rate, and the data block bypass that will not be accessed again promotes spatial cache utilization rate；Separately On the one hand, the advantage low with hardware costs.

Detailed description of the invention

Fig. 1 is the GPGPU storage hierarchy figure of the method for the present invention

Fig. 2 is the L1 Cache cache blocks figure of the method for the present invention

The bypass address recorder figure of the method for the present invention when Fig. 3

Specific embodiment

With reference to the accompanying drawing, the technical solution of the method for the present invention is further illustrated.

Fig. 1 is the storage hierarchy figure of GPGPU, and each SIMT Core therein possesses privately owned L1 Cache, and passes through Internal bus is connected to L2 Cache, and L2 Cache is divided into multiple bank, and each bank is via privately owned channel and the external world DRAM is connected.Fig. 2 is L1 Cache cache blocks, the additional one 4 DA counters of each cache blocks, the visit for record buffer memory group Ask number.Fig. 3 is bypass address recorder, and f (addr) is using addr random mapping function as input, and output refers to To certain data of data bit memory.

The hardware modifications of 1.L1 Cache；

Increase the hit counter of record L1 Cache hit-count, when data access request occurs, if target data Block adds 1 in L1 Cache, hit counter；Each cache blocks in L1 Cache increase DA counter (Fig. 2 institute of 4 data Show), it is initialized as unified default value, when thering is data access to reach caching group belonging to cache blocks every time, in the caching group The DA value of all cache blocks subtracts 1.Increase bypass address recorder (bypass address recorder, BAR), comprising visiting Ask number counter, random mapping function, data bit memory and hit-count counter.All data of data bit memory It is initialized as 0.When having data block address arrival every time, while access times counter adds 1, random mapping function is made For input, certain position for exporting and being directed toward data bit memory is generated, if this bit of data has been 1, then it represents that hit, hit Number counter adds 1；Otherwise, this bit of data is set as 1.

2. the L1 Cache management strategy based on DA Counter Value；

If the target data block of data access request is in L1 Cache, then it represents that the DA value of cache blocks is hit in hit It is initialized as default value, the DA value of other cache blocks of other same caching groups subtracts 1, and hit counter adds 1.Data access The data block of request occurs cache miss, needs to access L2 Cache read block not in L1 Cache.If L1 There are free buffer block in the target cache group of Cache, then the data block read is filled into free buffer block, and initializes DA Value is default value, and L1 Cache reads in counter and adds 1.

If there is no free buffer block in target cache group, positions DA value and (be zero there are multiple DA values, select at random for 0 Select 1) cache blocks, data block therein is replaced with the data block that newly reads；The cache blocks for being 0 if there is no DA value, The data block of reading made into bypass processing, i.e., is directly used for SIMT Core around caching, while will be handled by bypass Data block address is transmitted to BAR.BAR includes access counter, random mapping function, data bit memory and hit count Device.Data bit memory includes 1 0/1 memory that quantity is p, and bits per inch is according to being initialized as 0, by the defeated of random mapping function Decision needs to be set as 1 unit out.After data block address is transmitted to BAR, access counter adds 1.Data block address be used as with The input value of machine function, the range [0, p-1] of output valve, wherein p is prime number.When the output valve of random function is x, then data The xth position of bit memory will be set as 1, if the data value has been 1, indicates BAR hit and hit counter adds 1； If the data value for being is 0, it is set to 1.

Each data block address can be considered n 0/1 data values, be expressed as addr=(a₁,a₂,a₃,...,a_n), at random Mapping function includes corresponding n integer data=(b₁,b₂,b₃,...,b_n), wherein each b_iValue belongs to [0, p-1].It reflects at random Penetrate the mathematic(al) representation of function are as follows:

If the address of two data blocks is different and front and back reaches BAR, the probability hit is only 1/p, so taking The value of biggish p value, 1/p is minimum.

3. adjusting unified default DA value periodically；

Different application programs has different cache access patterns, and fixed DA value is not suitable for all application programs, DA Value needs dynamically to be adapted with application program.How does DA value adjust? when the access interval of cache blocks is smaller, then in caching group Cache blocks should frequently replace, could constantly contribute hit rate, promote the utilization rate of cache blocks, corresponding DA value should be compared with It is small；Similarly, when the access interval of cache blocks is larger, then data block should rest in caching group the long period, Zhi Daozai Secondary accessed, corresponding DA value should be larger.But excessive DA value will affect caching service efficiency, have to system performance negative Face is rung.The access interval of the application program of most GPGPU, cache blocks is larger, with the optimal DA to match of overall performance It is worth larger and close, is applicable in unified default DA value.

The lesser application program in block access interval is cached for small part, then DA value is dynamically adjusted, including two A process: determining the adjustment period and changes DA value.When the number of memory cells that numerical value is 1 in data bit memory reaches p/2, The beginning in mark one new adjustment period, needs to calculate the hit ratio of L1 Cache and BAR.The hit-count of L1 Cache Counter Value obtains the hit ratio h of L1 Cache divided by Counter Value is read in₁；The hit counter value of BAR is divided by BAR's Access counter value obtains the hit ratio h of BAR₂.If h₁More than or equal to h₂, illustrate that L1 Cache effectively plays work With not needing to adjust current DA default value；If h₁Less than h₂, illustrate that current default DA value cannot provide and L1 Cache The adaptable hit rate of capacity, needs opposite direction to adjust DA value, i.e., if current DA value is maximum, DA value becomes by halving It is small；If current DA value is minimum, double to increase；Other situations, then according to previous adjustment direction handle, i.e., it is previous The adjustment period is to increase, then current period also increases, and vice versa.

It, will also be after adjustment while adjusting DA value when the number of memory cells that data bit memory value is 1 reaches p/2 The DA value default value unified as L1 Cache, while by the reading number counter of L1 Cache, hit-count counter with And access times counter, hit counter and data bit memory initialization in BAR, start the record in new period.

Content described in this specification embodiment is only enumerating to the way of realization of inventive concept, protection of the invention Range should not be construed as being limited to the specific forms stated in the embodiments, and protection scope of the present invention is also and in art technology Personnel conceive according to the present invention it is conceivable that equivalent technologies mean.

Claims

1. the L1 Cache management method of the GPGPU based on data access number, comprising the following steps:

1) hardware modifications of L1 Cache；

The hit counter for increasing statistics hit-count in L1 Cache, when data access request occurs, if data block is in L1 When Cache, hit counter adds 1；Each cache blocks in L1 Cache increase by 4 data access (Data Access, DA) Counter is initialized as unified default value, institute when each data access reaches caching group belonging to cache blocks, in caching group There is the DA value of cache blocks to subtract 1；Increase bypass address recorder, including access times counter, Random Maps in L1 Cache Function, data bit memory and hit-count counter；

2) the L1 Cache management strategy based on DA Counter Value；

The target data block of data access request then needs to access L2 Cache read block not in L1 Cache；If There are the cache blocks that idle or DA value is zero in the target cache group of L1 Cache, then the data block read is filled into the caching Otherwise block makees bypass processing to the data block of reading, data block is bypassed L1 Cache, while the address of data block being sent To the processing of bypass address recorder；

3) unified default DA value is adjusted periodically；

Different L1 Cache memory access modes is presented in different application programs, needs different default DA values that could most effectively manage Manage L1 Cache, it is therefore necessary to periodically adjust unified default DA value, bypass address recorder is according to cut-off current Bypass data number of blocks and address information determine the period of adjustment；And compare L1 Cache and bypass address recorder Hit ratio, determine increase, reduce also be to maintain current DA value.