CN115599710A

CN115599710A - Method for rapidly realizing variable cacheline switching

Info

Publication number: CN115599710A
Application number: CN202211212006.6A
Authority: CN
Inventors: 钱家祥; 石小刚; 黄光新; 戴程
Original assignee: Zhihua Microelectronics Technology Nanjing Co ltd
Current assignee: Zhihua Microelectronics Technology Nanjing Co ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-01-13

Abstract

The invention provides a method for quickly realizing variable cacheline switching, which solves the problem that the conventional cache design has the condition of mixing low locality (such as control logic) and high locality (such as data flow logic) so as to reduce the access efficiency, and the main technical scheme comprises the following steps: s1, combining continuous reference cachelines to form a new cacheline by taking the size of an original cacheline of a cache as a reference unit, wherein the size is marked as an Xn file, which indicates that the cacheline is the size of n reference cachelines, the minimum n is 1, and the maximum n is all the reference cachelines in the cache; s2, keeping tag marks corresponding to the sizes of the reference cachelines in the cache, updating and replacing one reference cacheline for the X1 gear when the cache misses, and updating and replacing n reference cachelines for the Xn gear; namely, the Xn file is switched from 1 to n reference cachelines by continuously amplifying the original replacement logic processing unit; s3, the Xn gear switching sets a switching control register through a configuration interface, internal switching selection logic is controlled by the register to act, and the switching process is carried out at any cache request processing interval.

Description

Method for rapidly realizing variable cacheline switching

Technical Field

The invention relates to the technical field of cache design, in particular to a method for quickly realizing variable cache line switching.

Background

The Cache design is used for improving and improving the data access efficiency, the theoretical basis of the Cache design is that a data space (near the data which is just used, continuous data can be used recently) and time continuity (the data which is just used can be used again recently), the Cache design is inserted between a request end and a storage, a part of data is selectively copied from a large-capacity storage (the access speed is low, one request is responded in tens of to hundreds of cycles, commonly seen as DDR) according to a locality design algorithm to a small-capacity storage (the access speed is high, one request is responded in a plurality of cycles, seen as sram), if certain requested data is in the Cache, the data can be responded quickly, otherwise, the requested data is taken from the large storage, the better the locality is theoretically requested, the data response speed is better, the response speed is higher than the response efficiency of the small storage of the Cache, and therefore the great improvement of the access efficiency is achieved.

Description of the general principles of cache design: firstly, partitioning the whole ddr according to smaller continuous data units (marked as cachelines), wherein each cacheline stores a specific effective part of a corresponding address as a unique identifier (marked as tag) for distinguishing the cacheline from other cachelines, and the whole cache is formed by combining a plurality of cachelines; when the cache receives a read-write request, judging whether the request data is in a certain cacheline (in the case of being recorded as hit/hit, and not in the case of being recorded as miss/miss) according to tag, if so, executing the corresponding read-write request, and if not, requesting the cacheline containing data corresponding to the miss request from ddr for replacing the certain cacheline determined by an algorithm; therefore, a plurality of cachelines related to the latest request are always reserved in the cache under the common control of the request and the replacement algorithm, and the locality principle shows that the future access probability exists in one cacheline in the cache, and the response speed of the requests is accelerated from ddr response level to cache response level, so that the purpose of access and storage acceleration is realized.

Cacheline is composed of a plurality of continuous data (denoted as data), the size and the locality of Cacheline are closely related, the better the locality is, the larger Cacheline is needed, and conversely, the smaller Cacheline is needed, for example, continuous address access is needed, the data in Cacheline is continuously used, and the larger Cacheline can obtain higher hit rate; a plurality of data in the cyclic logic are repeatedly used, and the jump logic address is jumped greatly and then is continuous in a small range, so that the method is suitable for small cachelines, and the reduction of the hit rate caused by excessive useless data in the large cachelines is avoided;

in actual design, control logic requests are variable, locality is weak and suitable for small cachelines, data processing logic requests are regular, locality is strong and suitable for large cachelines, separate processing modes can be adopted in some designs, but a plurality of designs are limited by the fact that a single cache needs to deal with access requests with strong and weak locality and the like due to the fact that resources are realized, power consumption is reduced, control data stream splitting is difficult, and the like.

Disclosure of Invention

The invention provides a variable cacheline switching technology which can dynamically switch cacheline sizes during work, so that the hit rate under different local requests is improved, more request response cycles are improved to the cache response level, and the aim of improving the access efficiency of local strong and weak mixed requests is fulfilled.

In order to solve the technical problems, the invention adopts the technical scheme that: a method for rapidly realizing variable cacheline switching is characterized by comprising the following steps:

s1, combining continuous reference cachelines to form a new cacheline by taking the size of an original cacheline of a cache as a reference unit, wherein the size is marked as an Xn file, which indicates that the cacheline is the size of n reference cachelines, the minimum n is 1, and the maximum n is all the reference cachelines in the cache;

s2, keeping tag marks corresponding to the sizes of the reference cachelines in the cache, equally processing the request of the original cache to work in an X1 gear, keeping the X1 gear consistent with an Xn gear when the cache request hits, updating and replacing one reference cacheline only in the X1 gear when the cache request does not hit, and changing from replacing 1 to replacing n reference cachelines by continuously amplifying the original replacement logic processing units during the Xn gear switching;

s3, the Xn gear switching sets a switching control register through a configuration interface, internal switching selection logic is controlled by the register to act, and the whole switching process is in any cache request processing interval.

Further, the reference cacheline in the step S1 is divided by 2 consecutive indexes.

Further, tag identification in the step S2 is a flag of cacheline unique valid information and is not added newly.

Further, the original hit judgment logic and the subsequent processing logic after hit are retained in the step S2; and in case of miss, changing from replacing 1 to replacing n reference cachelines by amplifying the original replacement logic processing unit.

Compared with the prior art, the invention has the beneficial effects that: the dynamic cache line switching can be realized by modifying 2 nodes of the internal mark and replacement request sending logic and adding the control switching logic, the original cache access efficiency is improved, and the whole cache line switching method has the characteristics of low realization cost, low complexity, good improvement effect and the like.

Drawings

The disclosure of the present invention is illustrated with reference to the accompanying drawings. It is to be understood that the drawings are designed solely for the purposes of illustration and not as a definition of the limits of the invention. In the drawings, like reference numerals are used to refer to like parts. Wherein:

FIG. 1 is a flow chart of the conventional cache principle according to a preferred embodiment of the present invention;

FIG. 2 is a schematic diagram of a conventional cache internal judgment flow according to a preferred embodiment of the present invention;

FIG. 3 is a simplified diagram of the internal judgment of a conventional cache according to a preferred embodiment of the present invention;

FIG. 4 is a comparison of a cache design and a conventional design according to a preferred embodiment of the present invention;

FIG. 5 is a flow chart of the internal judgment of the cache design according to a preferred embodiment of the present invention;

FIG. 6 is a flow chart of the cache design replacement logic expansion according to a preferred embodiment of the present invention;

FIG. 7 is a flow chart of the correction process for the replacement logic of cache design according to a preferred embodiment of the present invention

FIG. 8 is a simplified flow chart of the cache design replacement logic according to a preferred embodiment of the present invention;

FIG. 9 is a flow chart of the cache design replacement logic synchronization process according to a preferred embodiment of the present invention;

FIG. 10 is a flowchart of the overall cache design process according to a preferred embodiment of the present invention.

Detailed Description

It is easily understood that, according to the technical solution of the present invention, a person skilled in the art can propose various alternative structural modes and implementation modes without changing the spirit of the present invention. Therefore, the following detailed description and the accompanying drawings are merely illustrative of the technical aspects of the present invention, and should not be construed as limiting or restricting the technical aspects of the present invention.

An embodiment according to the present invention is shown in conjunction with fig. 1-10.

As shown in fig. 1-2, the conventional cache design includes the following parts:

storage mem (sram): the device is used for storing data copied from main storage, the interior of the device is equally divided into a plurality of storage spaces, and the size of each space is marked as cacheline;

internal marking: recording related information of each cacheline position in mem;

wherein vld: the cacheline position validity marks correspond to cachelines one by one, and when a certain cacheline position is put into data, the cacheline position is valid (1 is valid, and 0 is invalid);

tag: the position information of the data stored in the cacheline position in the main storage, wherein the specific content comes from part of bits in the corresponding request address;

lru: the cacheline position reads and writes the access record, and the access record is recorded once every time the access record is read/written;

hit/miss determination: comparing partial bit (same position as that when the tag is stored) of the request address with tags corresponding to all effective cacheline positions, wherein if the bit is consistent, the result is hit, otherwise, the result is miss;

replacement request sending logic: when the request is not hit, the current request is aligned and adjusted according to the cacheline size and then is sent to the main storage, and the cacheline data block containing the missed request data is requested;

and (4) mem read-write request: and summarizing and generating read-write requests for the mem in the cache under different conditions, wherein the read-write requests comprise read/write requests during hit, replacement write requests during miss, write/read requests after miss replacement is completed and the like.

It can be seen that as shown in FIG. 3 (the triangle numbers are consistent with the indications in FIG. 2), the work of the entire cache includes both hit and miss processing flows:

for the hit process, firstly, the request address and the internal mark are compared to judge whether the hit is achieved, if the hit is achieved, the mem read-write request generated by the mem read-write request logic is used, the mem then executes the read-write request (the read request returns to read data, and the write request only writes the mem), and the internal mark is updated in the process of executing the process.

For the miss process, firstly, the request address is compared with the internal mark to judge whether the process is hit or not, if the process is not hit, the cache line data block containing the miss data is sent through the replacement request sending logic request, the cache line data block is returned to the end to generate the mem writing request, after the writing is finished, the mem reading request for reading the miss data is generated, the mem executes the reading and writing request, and the processes execute and update the internal mark at the same time.

The replacement in the introduction reflects the locality regularity existing in the Cache work, including temporal locality (the probability of recently used data is reused) and spatial locality (the probability of adjacent data of the recently used data is used in the near future); the cache access for the first time is definitely not hit, a cache data block containing the request data is selected, namely a plurality of data adjacent to the data are selected, the request data are reserved, two types of locality are utilized, the probability that the subsequent access data are just the cache data moved before is in direct proportion to the strength of the locality, and the more the cache data are accessed, the less the cache data are invalid.

The better the locality in the design, the larger cacheline should be selected, the number of times of copying and carrying from the main storage is reduced (the main storage access has the time of interpreting processing requests besides the data transmission time of one data per cycle, and the latter accounts for the main part of the whole request response time of the main storage), and conversely, the smaller cacheline is required, and the rule is embodied very commonly in practice, for example, continuous address access, data in cacheline is continuously used, the locality is good, and the larger cacheline can obtain a higher hit rate; a plurality of data in the cyclic logic are repeatedly used, the jump logic addresses are jumped greatly and then are continuous in a small range, the data locality is weak and the data is suitable for a small cacheline, the situation that too much useless data exist in a large cacheline, precious small storage space is occupied is avoided, and the hit rate is reduced.

Therefore, the method and the device realize the switchable cacheline function by identifying and modifying the key nodes on the basis of the conventional fixed cacheline cache design, so that the memory access efficiency of the cache in the face of the local change type request is improved.

As shown in fig. 4, the core principle is based on the cacheline size of the conventional cache and is counted as X1 gear; merging 2 cachelines of the X1 gear as one cacheline, namely requesting a data block with the size 2 times that of the X1 cacheline to update the merged 2cacheline positions when replacement occurs, and recording the merged 2cacheline positions as the X2 gear, and repeating the steps of X4 and X8 until an Xn gear is reached, wherein n is increased by 2 times of index to reach the size of the whole mem; switching different gears means that the cacheline is movable within the range of 1-n times the size of the X1 gear cacheline, when the locality is good, n is a large value, the locality is fully utilized to reduce corresponding expenses of multiple requests, otherwise, when the locality is poor, n is a small value, and expenses caused by retrieving too much invalid data are avoided.

As shown in fig. 5-6, implementation details of the modification that cacheline is switchable are as follows: in the cache design, a one-to-one correspondence relationship exists between a cache line and a corresponding mark signal, the mark signal corresponding to each cache line in an X1 file is a basic unit, and an Xn file is used for updating the mark signals corresponding to n cache lines at the same time, so that n X1 cache lines are equal to one Xn cache line in updating.

For the replacement logic: the cache line object is used for generating a cache line object to be replaced when a miss occurs (an X1 file points to a certain cache line, and an Xn version expansion points to n caches); the method is specifically implemented by recording the history of the reading and writing sequence of cachelines, then determining a replacement object according to some algorithms, wherein the algorithms have diversity, and take a common Least Recently Used algorithm (lru) as an example, the algorithm is designed to have 8 cachelines in the X1-level cache, the serial numbers are sequentially marked as 0 to 7, and then the access sequence is assumed to be accessed sequentially according to 0 to 7, so that the number 0 is the earliest Used cacheline and is also called the Least Recently Used cacheline, and the probability of revising the cacheline is the lowest in terms of locality, so that the cacheline is Used as the object to be replaced; the specific implementation is that each cacheline is given a weight value, the more recent access weight is higher, the weight is adjusted, and the cacheline with the lowest weight at the same moment is the object to be replaced in the future.

Valid flag for Cacheline (vld): assuming that 8 cache positions are in total in the cache of the X1 gear, when the updating opportunity comes, 1 cache position is selected from 8 cache positions, and 1 cache position is set to be effective, and when the Xn gear is switched, the original control signals are expanded from 1 to n cache positions under the command of a switching control command (from a switching control logic) and are carried out simultaneously; the extension method is exemplified by: assuming that 8 cachelines in the X1 gear are numbered 0-8 in sequence, switching to the X2 gear, namely merging {0,1}, {2,3}, {4,5}, {6,7} to obtain 4X 2 cachelines, wherein the merging follows a continuous principle, namely combinations of {1,2} or {3,5} cannot occur; the original control signal generation logic still continues to use the X1 file design, i.e. a certain 1 number in the fixed pointers 0 to 7, and the corresponding X2cacheline number can be known by simply observing the packet, for example, the original signal pointer 6 indicates that the corresponding X2cacheline {6,7}, and when the update time comes, the X2 file needs to be simultaneously set to vld 1 corresponding to 6,7cacheline to indicate that the data corresponding to cacheline is valid.

For example, the original direction number 3, when the X2 shift is switched, the lowest 1 position 0 of the binary system 11 is 10 (decimal 2), that is, the first number of the combination {2,3} corresponding to the X2 shift, and the 2 consecutive numbers starting from 2 are all the numbers of the combination to which it belongs; when the gear X4 is switched, the lowest 2 position 0 of the binary system 11 is obtained as 00 (decimal 0), namely the first number of the combination {0,1,2,3} corresponding to the gear X4, and the continuous 4 numbers starting from 0 are all the numbers of the combination to which the serial 4 numbers belong; the generalized summary is that switching the Xn gears refers to directing the original signal to the binary low x = log2 (n) position 0 of the cacheline number to obtain the first-order number of the Xn-gear cacheline combination, and then n-1 consecutive numbers are the remaining numbers.

Cacheline in main storage tag (tag): when the X1 file is used, part of bits of an address corresponding to cacheline are reserved and stored in a corresponding tag position when an updating opportunity comes, the Xn file is switched to indicate that n tag positions are updated simultaneously, the corresponding method is expanded from 1 to n and is consistent with the vld processing, and the tag stored value needs to be aligned, corrected and incrementally expanded;

alignment correction, namely switching to an X2 gear, recording tags corresponding to 2X 1 cachelines as tag _ A and tag _ B in sequence, wherein in practice, the requested corresponding tag value may be tag _ B, and at this time, the tag _ B needs to be corrected to be tag _ A (because a request logic is replaced to request the tags corresponding to tag _ A in sequence first), and the alignment method is realized by processing with reference to the low-position 0 method because the extension follows the sequential continuous alignment principle;

increasing the extension progressively, aligning the extension to correct the first update tag, and sequentially adding 1 on the basis of the first tag during multi-tag updating to obtain tag values corresponding to subsequent cacheline;

as shown in fig. 7, the replacement request sending logic: when the internal switching command is valid, the original read-write request is corrected according to the lru output extension replacement object to obtain a replacement request, and the replacement request is sent to the main storage.

The switching control logic: firstly, input switching control signals are synchronously processed to avoid the occurrence of a possible asynchronous sampling condition and improve design compatibility, and specifically, the switching of effective signals and 2 signals of switching codes are realized, wherein the effective signals are single-bit signals to mark the effectiveness of the switching codes, and the signals are effective and can be finished after being received and fed back by a cache, so that the switching codes received by the cache are reliable, and the switching codes are digital codes and respectively correspond to gears with different cache sizes;

switching protection is carried out after synchronous processing, because in the process of reading and writing the cache, the internal mark signal directly influences the execution action, so that the cache needs to wait for the switching action again in the intermittence period after the request is executed, and a new request cannot be received before the switching is finished; generating an internal switching command for commanding switching logic after switching protection;

in summary, after the modification of 2 nodes of the internal flag and the replacement request sending logic is completed, the cacheline switchable function can be realized by adding the control switching logic.

The idea of realizing the large cacheline by splicing the small cacheline is that the problem caused by complicated alignment is avoided in the switching cacheline, so that the realization method based on only 2 control node corrections in the principle has the characteristics of simple design and easiness in realization, and simultaneously the logic cost required to be paid is 1 adder with a bit width of tag and the like, no more than 20 registers and a plurality of combinational logic units;

2, the size of cacheline tends to an optimal value in continuous access of similar locality characteristics, the scheme can provide multi-level switching more flexibly, so that the setting value of cacheline tends to the most significant value as much as possible, and better switching efficiency is obtained; the detailed derivation process is as follows.

The read operation is taken as an analysis object, and the request response is divided into 3 stages as shown in the following figure, namely, the cache returns to the request end, the cache replaces and writes and stores the request and returns to the cache;

the cache returns to the request end, which indicates that the request directly hits or the request does not hit and then the replacement operation is completed, and the time is usually fixed for 1-2 cycles, which is denoted as A;

cache replacement writing-representing the time for writing data into the cache in the replacement process, wherein 1 cycle is required for each data return, and the total time is equal to the number of the request data contained in the cache line, for example, 32b of the request data and 256b of the cache line can contain 256/32=8 request data;

store-return to cache-represents the time from the store-receive request to the data return, which typically varies from tens to hundreds of cycles, here denoted as B;

the cache request response conditions are sorted accordingly as shown in the following table:

according to analysis, no replacement operation is performed under the hit condition, the corresponding period of the request is always A, and the method is irrelevant to the cacheline size; the average request response period when miss misses is A + B + m, wherein A and B are constant;

assuming that there are X total requests, and there are n data used per cacheline on average (the application or program at the request end determines the distribution of the data to be used in the main memory, and the main memory is divided according to the cacheline, and then the number of data used in each cacheline is averaged), the average request response period may be represented as (X/n (a + B + m) + (X-X/n) × a)/X = a + B/n + m/n, where a and B are determined according to the locality of the data at the request end, and when m/n approaches 1, the optimal result may be obtained.

The technical scope of the present invention is not limited to the above description, and those skilled in the art can make various changes and modifications to the above embodiments without departing from the technical spirit of the present invention, and such changes and modifications should fall within the protective scope of the present invention.

Claims

1. A method for rapidly realizing variable cacheline switching is characterized by comprising the following steps:

2. The method of claim 1, wherein the method comprises the following steps: the reference cacheline in step S1 is divided by consecutive 2 indexes.

3. The method of claim 1, wherein the method for fast implementing a variable cacheline switch comprises the following steps: in the step S2, tag identification is a flag of cacheline unique and valid information and is not added newly.

4. The method of claim 1, wherein the method comprises the following steps: the original hit judgment logic and the subsequent processing logic after hit are reserved in the step S2; and when the cache misses, the original replacement logic processing unit is continuously amplified, and the replacement of 1 cache is changed into the replacement of n reference caches.