CN109408411A - The L1 Cache management method of GPGPU based on data access number - Google Patents

The L1 Cache management method of GPGPU based on data access number Download PDF

Info

Publication number
CN109408411A
CN109408411A CN201811113134.9A CN201811113134A CN109408411A CN 109408411 A CN109408411 A CN 109408411A CN 201811113134 A CN201811113134 A CN 201811113134A CN 109408411 A CN109408411 A CN 109408411A
Authority
CN
China
Prior art keywords
cache
value
data
data block
counter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811113134.9A
Other languages
Chinese (zh)
Inventor
章铁飞
傅均
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN201811113134.9A priority Critical patent/CN109408411A/en
Publication of CN109408411A publication Critical patent/CN109408411A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The present invention discloses the L1 Cache management method of GPGPU based on data access number a kind of, specifically includes: the hardware modifications of L1 Cache;L1 Cache management strategy based on DA Counter Value;Unified default DA value is adjusted periodically.Present invention seek to address that the L1 Cache data block flapping issue of GPGPU, main method is that each cache blocks in L1 Cache increase access times counter, and the size of the value and default setting value that compare counter operates come replacement and the bypass etc. for determining cache blocks, target is to keep frequently accessed cache blocks in the buffer, promote hit rate, the cache blocks bypass that will not be accessed again, the utilization rate of spatial cache is promoted, solves the data block flapping issue of L1 Cache to the maximum extent.

Description

The L1 Cache management method of GPGPU based on data access number
Technical field
The L1 Cache management method towards GPGPU that the present invention relates to a kind of is waved and is asked for the data block of L1 Cache Topic proposes the solution based on data access number, realizes to keep being promoted in the buffer by frequently accessed cache blocks and order Middle rate, the cache blocks bypass that will not be accessed again promote the utilization rate of spatial cache, solve L1 to the maximum extent The data block flapping issue of Cache.
Background technique
Compared to conventional processors (CPU), general image processor (GPGPU) is more suitable for the calculating of high data parallel degree Task has and more preferably calculates Energy Efficiency Ratio.Based on CUDA and OpenCL programming framework, GPGPU can be to the task in many fields Accelerated, such as machine learning application popular at present.It include multiple independent calculating cores (SIMT Core) in GPGPU, It can independently be calculated simultaneously, have high concurrent computing capability.Similar with CPU, GPGPU is calculated by the outer DRAM storage of piece to be appointed The code and data of business, and the speed that GPGPU processor calculates is much higher than the speed of DRAM data access, so GPGPU is also needed The speed wide gap for wanting the storage hierarchy system of complexity to fill up between processor and DRAM.
The storage hierarchy system of GPGPU includes register, L1 caching (cache), outside shared L2 caching and piece DRAM memory.There are multiple SIMT Core inside general GPGPU, while running multiple sets of threads (thread warp) and promoting thread Grade degree of parallelism.Each SIMT Core includes a privately owned L1 caching, and all SIMT Core pass through internal bus connection one A shared L2 caching.L2 caches the control for being responsible for caching data consistency, and using the structure of grouping (bank), each Bank is connected by a privately owned main memory access (memory channel) with the DRAM outside piece.If the mesh of an access request Data are marked not in L1 Cache, then the request is classified as miss request and is sent to L2 Cache, if target data is in L2 Cache hit, then data are issued L1 Cache by L2 Cache;Otherwise, L2 Cache generates an access miss request, and It is sent to next stage memory.
Basic storage unit is cache blocks in the L1 Cache of GPGPU, each open ended data block size of cache blocks Usually 128Byte, every four cache blocks constitute a caching group (set).The data block of reading can map and store Mr. Yu spy Determine in the free buffer block of caching group, if the corresponding no free buffer block of caching group, data block can both bypass (bypass) L1 Cache directly reaches processing core, can also select certain cache blocks in caching group, carries out data block replacement.When When the data block being replaced is accessed again, it is possible to the data block of L1 Cache be caused to wave (cache block Thrashing) problem.Caching data block waves generation after data block gets caching, does not wait until also to access again just as slow It is limited and replaced by the cache blocks of other new arrivals to deposit space, so that occurring when accessing again, cache blocks are not being cached In, it is necessary to the very long delay of experience is got in caching from next stage memory, influences access efficiency and performance.It is every in GPGPU A SIMT Core, which can be dispatched, executes multiple sets of threads, and all threads share same privately owned L1 Cache, and current thread group is in L1 Data block in Cache is easy to the data block replacement of the sets of threads executed by next round, therefore the multithreading of GPGPU runs ring Border is further exacerbated by data block flapping issue.
A kind of method waved of caching data block that solves is to increase buffer memory capacity, but the increase of L1 Cache capacity can be brought The problems such as significant negative effect, such as energy consumption become larger, and data access delay increases, cost increase, thus and it is unrealistic.Another party Different cache access patterns are presented in face, the upper different application program of GPGPU.Data block is read L1 by certain applications program After Cache, data block will not be accessed again, such as typical stream access module, and caching data block flapping issue is just not present. And certain applications program reads the data block in L1 Cache and is accessed the apparent lack of uniformity of rate presentation again: a large amount of to read To L1 Cache data block cut-off be replaced before, will not all be accessed again, and a small amount of data block can be accessed repeatedly. If frequently accessed data block prematurely replaces out L1 Cache, cause caching data block flapping issue.Therefore, as long as By no longer accessed data block replacement, while the repeatedly accessed data block of holding on a small quantity is in the buffer, so that it may effectively Ground avoids caching data block flapping issue.
Summary of the invention
The present invention will overcome the drawbacks described above of the prior art, propose a kind of secondary based on data access (Data Access, DA) Several L1 Cache management methods.
The contents of the present invention and feature are exactly: each cache blocks in L1 Cache add a DA counter, according to DA value Size makes the decision such as the replacement of cache blocks, bypass, and frequently accessed data block is maintained at caching contribution hit rate, will The data block bypass that will not be accessed again promotes spatial cache utilization rate, to alleviate the slow of L1 Cache to the maximum extent Deposit data block flapping issue.
The L1 Cache management method of GPGPU based on data access number of the invention a kind of includes technology below Step:
1) hardware modifications of L1 Cache;
The hit counter for increasing statistics hit-count in L1 Cache, when data access request occurs, if data block In L1 Cache, hit counter adds 1;Each cache blocks in L1 Cache increase by 4 data access counter DA, just Beginning turns to unified default value, all cache blocks when each data access reaches caching group belonging to cache blocks, in caching group DA value subtract 1;Increase bypass address recorder, including access times counter, random mapping function, data in L1 Cache Bit memory and hit-count counter;
2) the L1 Cache management strategy based on DA Counter Value;
The target data block of data access request then needs to access L2 Cache read block not in L1 Cache; If there are the cache blocks that idle or DA value is zero in the target cache group of L1 Cache, the data block read is filled into this Otherwise cache blocks make bypass processing to the data block of reading, data block is bypassed L1 Cache, while by the address of data block It is sent to the processing of bypass address recorder;
3) unified default DA value is adjusted periodically;
Different L1 Cache memory access modes is presented in different application programs, needs different default DA values could be most effective Ground manages L1 Cache, it is therefore necessary to periodically adjust unified default DA value, bypass address recorder is worked as according to cut-off Preceding bypass data number of blocks and address information determine the period of adjustment;And compares L1 Cache and the address bypass is remembered The hit ratio of device is recorded, determines that increase, reduction are also to maintain current DA value.
The invention has the advantages that method is simple, it is based on the additional DA count value of each cache blocks, by frequently accessed number Caching is maintained at according to block and promotes hit rate, and the data block bypass that will not be accessed again promotes spatial cache utilization rate;Separately On the one hand, the advantage low with hardware costs.
Detailed description of the invention
Fig. 1 is the GPGPU storage hierarchy figure of the method for the present invention
Fig. 2 is the L1 Cache cache blocks figure of the method for the present invention
The bypass address recorder figure of the method for the present invention when Fig. 3
Specific embodiment
With reference to the accompanying drawing, the technical solution of the method for the present invention is further illustrated.
Fig. 1 is the storage hierarchy figure of GPGPU, and each SIMT Core therein possesses privately owned L1 Cache, and passes through Internal bus is connected to L2 Cache, and L2 Cache is divided into multiple bank, and each bank is via privately owned channel and the external world DRAM is connected.Fig. 2 is L1 Cache cache blocks, the additional one 4 DA counters of each cache blocks, the visit for record buffer memory group Ask number.Fig. 3 is bypass address recorder, and f (addr) is using addr random mapping function as input, and output refers to To certain data of data bit memory.
The L1 Cache management method of GPGPU based on data access number of the invention a kind of includes technology below Step:
The hardware modifications of 1.L1 Cache;
Increase the hit counter of record L1 Cache hit-count, when data access request occurs, if target data Block adds 1 in L1 Cache, hit counter;Each cache blocks in L1 Cache increase DA counter (Fig. 2 institute of 4 data Show), it is initialized as unified default value, when thering is data access to reach caching group belonging to cache blocks every time, in the caching group The DA value of all cache blocks subtracts 1.Increase bypass address recorder (bypass address recorder, BAR), comprising visiting Ask number counter, random mapping function, data bit memory and hit-count counter.All data of data bit memory It is initialized as 0.When having data block address arrival every time, while access times counter adds 1, random mapping function is made For input, certain position for exporting and being directed toward data bit memory is generated, if this bit of data has been 1, then it represents that hit, hit Number counter adds 1;Otherwise, this bit of data is set as 1.
2. the L1 Cache management strategy based on DA Counter Value;
If the target data block of data access request is in L1 Cache, then it represents that the DA value of cache blocks is hit in hit It is initialized as default value, the DA value of other cache blocks of other same caching groups subtracts 1, and hit counter adds 1.Data access The data block of request occurs cache miss, needs to access L2 Cache read block not in L1 Cache.If L1 There are free buffer block in the target cache group of Cache, then the data block read is filled into free buffer block, and initializes DA Value is default value, and L1 Cache reads in counter and adds 1.
If there is no free buffer block in target cache group, positions DA value and (be zero there are multiple DA values, select at random for 0 Select 1) cache blocks, data block therein is replaced with the data block that newly reads;The cache blocks for being 0 if there is no DA value, The data block of reading made into bypass processing, i.e., is directly used for SIMT Core around caching, while will be handled by bypass Data block address is transmitted to BAR.BAR includes access counter, random mapping function, data bit memory and hit count Device.Data bit memory includes 1 0/1 memory that quantity is p, and bits per inch is according to being initialized as 0, by the defeated of random mapping function Decision needs to be set as 1 unit out.After data block address is transmitted to BAR, access counter adds 1.Data block address be used as with The input value of machine function, the range [0, p-1] of output valve, wherein p is prime number.When the output valve of random function is x, then data The xth position of bit memory will be set as 1, if the data value has been 1, indicates BAR hit and hit counter adds 1; If the data value for being is 0, it is set to 1.
Each data block address can be considered n 0/1 data values, be expressed as addr=(a1,a2,a3,...,an), at random Mapping function includes corresponding n integer data=(b1,b2,b3,...,bn), wherein each biValue belongs to [0, p-1].It reflects at random Penetrate the mathematic(al) representation of function are as follows:
If the address of two data blocks is different and front and back reaches BAR, the probability hit is only 1/p, so taking The value of biggish p value, 1/p is minimum.
3. adjusting unified default DA value periodically;
Different application programs has different cache access patterns, and fixed DA value is not suitable for all application programs, DA Value needs dynamically to be adapted with application program.How does DA value adjust? when the access interval of cache blocks is smaller, then in caching group Cache blocks should frequently replace, could constantly contribute hit rate, promote the utilization rate of cache blocks, corresponding DA value should be compared with It is small;Similarly, when the access interval of cache blocks is larger, then data block should rest in caching group the long period, Zhi Daozai Secondary accessed, corresponding DA value should be larger.But excessive DA value will affect caching service efficiency, have to system performance negative Face is rung.The access interval of the application program of most GPGPU, cache blocks is larger, with the optimal DA to match of overall performance It is worth larger and close, is applicable in unified default DA value.
The lesser application program in block access interval is cached for small part, then DA value is dynamically adjusted, including two A process: determining the adjustment period and changes DA value.When the number of memory cells that numerical value is 1 in data bit memory reaches p/2, The beginning in mark one new adjustment period, needs to calculate the hit ratio of L1 Cache and BAR.The hit-count of L1 Cache Counter Value obtains the hit ratio h of L1 Cache divided by Counter Value is read in1;The hit counter value of BAR is divided by BAR's Access counter value obtains the hit ratio h of BAR2.If h1More than or equal to h2, illustrate that L1 Cache effectively plays work With not needing to adjust current DA default value;If h1Less than h2, illustrate that current default DA value cannot provide and L1 Cache The adaptable hit rate of capacity, needs opposite direction to adjust DA value, i.e., if current DA value is maximum, DA value becomes by halving It is small;If current DA value is minimum, double to increase;Other situations, then according to previous adjustment direction handle, i.e., it is previous The adjustment period is to increase, then current period also increases, and vice versa.
It, will also be after adjustment while adjusting DA value when the number of memory cells that data bit memory value is 1 reaches p/2 The DA value default value unified as L1 Cache, while by the reading number counter of L1 Cache, hit-count counter with And access times counter, hit counter and data bit memory initialization in BAR, start the record in new period.
Content described in this specification embodiment is only enumerating to the way of realization of inventive concept, protection of the invention Range should not be construed as being limited to the specific forms stated in the embodiments, and protection scope of the present invention is also and in art technology Personnel conceive according to the present invention it is conceivable that equivalent technologies mean.

Claims (1)

1. the L1 Cache management method of the GPGPU based on data access number, comprising the following steps:
1) hardware modifications of L1 Cache;
The hit counter for increasing statistics hit-count in L1 Cache, when data access request occurs, if data block is in L1 When Cache, hit counter adds 1;Each cache blocks in L1 Cache increase by 4 data access (Data Access, DA) Counter is initialized as unified default value, institute when each data access reaches caching group belonging to cache blocks, in caching group There is the DA value of cache blocks to subtract 1;Increase bypass address recorder, including access times counter, Random Maps in L1 Cache Function, data bit memory and hit-count counter;
2) the L1 Cache management strategy based on DA Counter Value;
The target data block of data access request then needs to access L2 Cache read block not in L1 Cache;If There are the cache blocks that idle or DA value is zero in the target cache group of L1 Cache, then the data block read is filled into the caching Otherwise block makees bypass processing to the data block of reading, data block is bypassed L1 Cache, while the address of data block being sent To the processing of bypass address recorder;
3) unified default DA value is adjusted periodically;
Different L1 Cache memory access modes is presented in different application programs, needs different default DA values that could most effectively manage Manage L1 Cache, it is therefore necessary to periodically adjust unified default DA value, bypass address recorder is according to cut-off current Bypass data number of blocks and address information determine the period of adjustment;And compare L1 Cache and bypass address recorder Hit ratio, determine increase, reduce also be to maintain current DA value.
CN201811113134.9A 2018-09-25 2018-09-25 The L1 Cache management method of GPGPU based on data access number Pending CN109408411A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811113134.9A CN109408411A (en) 2018-09-25 2018-09-25 The L1 Cache management method of GPGPU based on data access number

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811113134.9A CN109408411A (en) 2018-09-25 2018-09-25 The L1 Cache management method of GPGPU based on data access number

Publications (1)

Publication Number Publication Date
CN109408411A true CN109408411A (en) 2019-03-01

Family

ID=65465105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811113134.9A Pending CN109408411A (en) 2018-09-25 2018-09-25 The L1 Cache management method of GPGPU based on data access number

Country Status (1)

Country Link
CN (1) CN109408411A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472200A (en) * 2019-07-29 2019-11-19 深圳市中兴新云服务有限公司 A kind of data processing method based on list, device and electronic equipment
CN111880726A (en) * 2020-06-19 2020-11-03 浙江工商大学 Method for improving CNFET cache performance
CN112667534A (en) * 2020-12-31 2021-04-16 海光信息技术股份有限公司 Buffer storage device, processor and electronic equipment
CN112799976A (en) * 2021-02-15 2021-05-14 浙江工商大学 DRAM row buffer management method based on two-stage Q table
CN115454502A (en) * 2022-09-02 2022-12-09 杭州登临瀚海科技有限公司 Method for scheduling return data of SIMT architecture processor and corresponding processor

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH03175545A (en) * 1989-12-04 1991-07-30 Nec Corp Cache memory control circuit
CN1829979A (en) * 2003-08-05 2006-09-06 Sap股份公司 A method of data caching
CN101571835A (en) * 2009-03-26 2009-11-04 浙江大学 Realization method for changing Cache group associativity based on requirement of program
CN101944068A (en) * 2010-08-23 2011-01-12 中国科学技术大学苏州研究院 Performance optimization method for sharing cache
CN102999443A (en) * 2012-11-16 2013-03-27 广州优倍达信息科技有限公司 Management method of computer cache system
CN104778132A (en) * 2015-04-08 2015-07-15 浪潮电子信息产业股份有限公司 Multi-core processor directory cache replacement method
WO2017117734A1 (en) * 2016-01-06 2017-07-13 华为技术有限公司 Cache management method, cache controller and computer system
CN108073446A (en) * 2016-11-10 2018-05-25 华为技术有限公司 Overtime pre-judging method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH03175545A (en) * 1989-12-04 1991-07-30 Nec Corp Cache memory control circuit
CN1829979A (en) * 2003-08-05 2006-09-06 Sap股份公司 A method of data caching
CN101571835A (en) * 2009-03-26 2009-11-04 浙江大学 Realization method for changing Cache group associativity based on requirement of program
CN101944068A (en) * 2010-08-23 2011-01-12 中国科学技术大学苏州研究院 Performance optimization method for sharing cache
CN102999443A (en) * 2012-11-16 2013-03-27 广州优倍达信息科技有限公司 Management method of computer cache system
CN104778132A (en) * 2015-04-08 2015-07-15 浪潮电子信息产业股份有限公司 Multi-core processor directory cache replacement method
WO2017117734A1 (en) * 2016-01-06 2017-07-13 华为技术有限公司 Cache management method, cache controller and computer system
CN108139872A (en) * 2016-01-06 2018-06-08 华为技术有限公司 A kind of buffer memory management method, cache controller and computer system
CN108073446A (en) * 2016-11-10 2018-05-25 华为技术有限公司 Overtime pre-judging method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MAHMOUD KHAIRY 等: ""SACAT: Streaming-Aware Conflict-Avoiding Thrashing-Resistant GPGPU Cache Management Scheme"", 《IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS》 *
SREYA SREEDHARAN: ""A cache replacement policy based on re-reference count"", 《2017 INTERNATIONAL CONFERENCE ON INVENTIVE COMMUNICATION AND COMPUTATIONAL TECHNOLOGIES (ICICCT)》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472200A (en) * 2019-07-29 2019-11-19 深圳市中兴新云服务有限公司 A kind of data processing method based on list, device and electronic equipment
CN110472200B (en) * 2019-07-29 2023-10-27 深圳市中兴新云服务有限公司 Form-based data processing method and device and electronic equipment
CN111880726A (en) * 2020-06-19 2020-11-03 浙江工商大学 Method for improving CNFET cache performance
CN111880726B (en) * 2020-06-19 2022-05-10 浙江工商大学 Method for improving CNFET cache performance
CN112667534A (en) * 2020-12-31 2021-04-16 海光信息技术股份有限公司 Buffer storage device, processor and electronic equipment
CN112667534B (en) * 2020-12-31 2023-10-20 海光信息技术股份有限公司 Buffer storage device, processor and electronic equipment
CN112799976A (en) * 2021-02-15 2021-05-14 浙江工商大学 DRAM row buffer management method based on two-stage Q table
CN115454502A (en) * 2022-09-02 2022-12-09 杭州登临瀚海科技有限公司 Method for scheduling return data of SIMT architecture processor and corresponding processor

Similar Documents

Publication Publication Date Title
CN109408411A (en) The L1 Cache management method of GPGPU based on data access number
US11531617B2 (en) Allocating and accessing memory pages with near and far memory blocks from heterogenous memories
CN105068940B (en) A kind of adaptive page strategy based on Bank divisions determines method
RU2607984C1 (en) Method and corresponding device for determining page shared by virtual memory control mode
US11126555B2 (en) Multi-line data prefetching using dynamic prefetch depth
DE102012221504B4 (en) Multilevel-Instruction-Cache-Pre-Fetch
US9304920B2 (en) System and method for providing cache-aware lightweight producer consumer queues
CN106708626A (en) Low power consumption-oriented heterogeneous multi-core shared cache partitioning method
DE102007012058A1 (en) Synchronizing novelty information in an inclusive cache hierarchy
CN104699631A (en) Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor)
CN106909515A (en) Towards multinuclear shared last level cache management method and device that mixing is hosted
CN107463510B (en) High-performance heterogeneous multi-core shared cache buffer management method
CN105359103A (en) Memory resource optimization method and apparatus
US20130111175A1 (en) Methods and apparatus to control generation of memory access requests
CN1828773A (en) Multidimensional array rapid read-write method and apparatus on dynamic random access memory
WO2013155750A1 (en) Page colouring technology-based memory database access optimization method
Syu et al. High-endurance hybrid cache design in CMP architecture with cache partitioning and access-aware policy
US20160342514A1 (en) Method for managing a last level cache and apparatus utilizing the same
CN114968588A (en) Data caching method and device for multi-concurrent deep learning training task
CN106201918B (en) A kind of method and system based on big data quantity and extensive caching quick release
CN106126434B (en) The replacement method and its device of the cache lines of the buffer area of central processing unit
CN103955397B (en) A kind of scheduling virtual machine many policy selection method based on micro-architecture perception
CN108255590A (en) A kind of method of data flow control and device
CN105808160A (en) mpCache hybrid storage system based on SSD (Solid State Disk)
CN112817639B (en) Method for accessing register file by GPU read-write unit through operand collector

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190301

WD01 Invention patent application deemed withdrawn after publication