CN116521578A - Chip system and method for improving instruction cache prefetching execution efficiency - Google Patents

Chip system and method for improving instruction cache prefetching execution efficiency Download PDF

Info

Publication number
CN116521578A
CN116521578A CN202310799269.XA CN202310799269A CN116521578A CN 116521578 A CN116521578 A CN 116521578A CN 202310799269 A CN202310799269 A CN 202310799269A CN 116521578 A CN116521578 A CN 116521578A
Authority
CN
China
Prior art keywords
prefetch
cache
processor
address
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310799269.XA
Other languages
Chinese (zh)
Inventor
陈小平
强鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taichu Wuxi Electronic Technology Co ltd
Original Assignee
Taichu Wuxi Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taichu Wuxi Electronic Technology Co ltd filed Critical Taichu Wuxi Electronic Technology Co ltd
Priority to CN202310799269.XA priority Critical patent/CN116521578A/en
Publication of CN116521578A publication Critical patent/CN116521578A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0656Data buffering arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3814Implementation provisions of instruction buffers, e.g. prefetch buffer; banks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Advance Control (AREA)

Abstract

The invention discloses a chip system and a method for improving the execution efficiency of instruction cache prefetching, belonging to the technical field of computers; the system comprises a processor, a branch predictor, a first-level cache, a prefetch buffer, a prefetch control unit and a second-level cache; the prefetch buffer is used for storing transmission data between devices with unsynchronized speed so as to reduce the mutual waiting time between processes; the prefetch control component is used for controlling the operation behavior of prefetching from the secondary cache; the prefetch control unit is coupled to sequential addresses of instructions executed by the processor and branch prediction addresses output by the branch predictor. The invention adopts a design structure combining sequential prefetching and branch prefetching, thereby improving the accuracy of prefetching addresses; meanwhile, the reading of the depth of the prefetch buffer is controlled, and the improvement of the prefetch coverage rate and the accuracy performance is considered; in addition, prefetch operations that may cause program "run-out" are eliminated; some unnecessary prefetch operations are eliminated.

Description

Chip system and method for improving instruction cache prefetching execution efficiency
Technical Field
The invention relates to the technical field of computers, in particular to a chip system and a method for improving the execution efficiency of instruction cache prefetching.
Background
In modern superscalar multi-core processor architectures, the clock operating frequency of the processor is above 1Ghz, while the corresponding memories for caching data and instructions run at lower-speed operating clocks, and if the processor reads and writes data from the lower-speed memories, the operating performance of the processor is greatly limited. In order to improve the performance of the whole chip system, a buffer part is designed between the processor and the external memory to solve the problem of the working frequency of the processor for accessing data and instructions, and the buffer part is a general design scheme. Multi-core processor architectures typically employ a multi-level cache architecture, with a first level cache designed as instruction cache and data cache and a second level cache designed as multi-core shared cache.
The first-level cache is the memory closest to the processor, when the chip executes a program, the processor frequently performs read-write access to the first-level cache, and the miss frequency of the processor accessing the first-level cache directly determines the system performance. Because the primary cache is limited by capacity and the strong missing condition of filling the primary cache content in the initial stage of power-on, the miss condition can occur when the processor accesses the primary cache, and a plurality of clock cycles can be consumed for reading the secondary cache after the primary cache is missed, so that the pipeline execution of the processor is stopped, and the system performance is obviously reduced. In order to improve system performance and reduce the miss rate of the processor accessing the first-level cache, a prefetching technology is generally adopted to operate in the first-level cache design to relieve the system performance reduction caused by the miss rate of the processor accessing the first-level cache.
Prefetch designs include software prefetching and hardware prefetching. Software prefetching requires that the compiler insert address-predicted prefetch instructions during program compilation, and that the hardware is responsible for performing the prefetch operation when executing the instructions, thus software prefetching increases the instruction overhead of the program. Hardware prefetching is a prediction of a subsequent instruction address of a current program counter after a processor queries for a cache miss during program execution, and conceals the data time of reading the subsequent instruction address in the current program counter when accessing a secondary cache read operation. The prefetches described in this disclosure are hardware prefetches.
The prefetching component brings a certain effect for reducing the cache miss rate of the access instruction of the processor and improving the system performance. The instruction cache with prefetch unit in the prior art, and the system structure design is shown in fig. 1. The detailed operation of the processor to access the instruction cache is as follows: during the execution of the program, the chip system sends an instruction inquiry address to the first-level cache by the processor. After the first-level cache receives the query address of the processor, the first-level cache unit and the prefetch buffer are queried at the same time, and if the query address of the processor is failed to match with the addresses of the first-level cache unit and the prefetch buffer, the first-level cache is queried for miss. The first-level cache starts a second-level cache reading operation, and the second-level cache reading address is the current query cache line address and the pre-fetching address after being predicted by the processor. The time to read the secondary cache typically exceeds 50 clock cycles during which the processor pipeline will stall (except for processors with advanced execution capabilities). After reading the instruction data from the secondary cache, the data associated with the current cache line address is returned to the processor, and the data associated with the prefetching is written back to the prefetching buffer. When the processor executes the instruction next time, if the instruction query address fails to match in the first level cache, but a match is obtained in the prefetch buffer, then it is also counted as a first level cache hit. The first-level buffer returns the data of the pre-fetching buffer to the processor, so that the second-level buffer reading time is saved, and the system performance is improved.
In the current instruction level one cache design, the prefetch design technique includes: next line prefetch (next-line prefetch), branch directed prefetch (branch-directed prefetching), memory stream prefetch over time domain (temporal Instruction streaming), etc. These prefetch techniques can alleviate the miss rate of the processor accessing the first level cache to some extent, but the prior art still has several problems:
1. various prefetch techniques are not designed enough:
for the next-line prefetch design (next-line prefetch), when a miss occurs in the first-level cache, always selecting the next line address of the cache line which is not hit currently as a prefetch address, wherein the prefetch operation design is simple, but the accuracy is lower; particularly when a branch jump occurs in the execution of an instruction program by a processor, a pre-fetch operation performed by a previous pre-fetch unit becomes invalid, so that a pre-fetch design adopting the next line algorithm cannot solve such a cache miss; for most program instruction stream execution, about half of the instruction stream is not executed sequentially, so the accuracy of the next line prefetch operation will be less than 50%;
for directional branch prediction prefetch (branch-directed prefetching), the judgment of the subsequent prefetch address is a design key, and the accuracy of the prefetch is completely dependent on the accuracy of the branch prediction;
for memory stream prefetching (temporal Instruction streaming) over the time domain, the design requires a relatively large miss log table to record the miss log table before the program performs the prefetch; when the processor searches for the first-level cache miss, the prefetch component needs to search the table first and find the corresponding record to obtain a prefetch address; the design disadvantages are: firstly, the miss record table occupies a larger storage resource space, and time cost exists when the miss record table is queried; in addition, a first level cache miss caused by a forced miss cannot be resolved.
2. The prefetch operation causes a dirty cache problem. When the processor searches for the first-level cache miss, if the address prediction is correct, the prefetching operation acquires prefetched data in advance and conceals memory access delay. However, if the address cannot be predicted accurately, the prefetch may contaminate the cache (i.e., the prefetched cache blocks may evict those potentially useful cache blocks), rather reducing the efficiency of the first level cache execution.
3. From the bandwidth overhead of prefetching, when the instruction query address of the processor accesses the instruction cache and the prefetch buffer, and when a cache miss occurs due to failure of query matching, if the prefetching components perform the prefetching operation, the situation that the second-level cache is accessed without prefetching occurs, which results in additional prefetching bandwidth overhead of reading the second-level cache.
Disclosure of Invention
The invention provides a chip system and a method for improving the prefetch execution efficiency of an instruction cache in order to improve the prefetch execution efficiency of the instruction cache.
In order to solve the technical problems, the technical scheme of the invention is as follows:
according to a first aspect of the present disclosure, the present invention provides a chip system for improving instruction cache prefetch execution efficiency, including a processor, a branch predictor, a first level cache, a prefetch buffer, a prefetch control unit, and a second level cache;
the first-level cache and the second-level cache are buffer memories positioned between the processor and the main memory; the prefetch buffer is used for storing transmission data between devices with unsynchronized speed so as to reduce the mutual waiting time between processes; the prefetch control component is used for controlling the operation behavior of prefetching from the secondary cache;
the prefetch control unit is coupled to sequential addresses of instructions executed by the processor and branch prediction addresses output by the branch predictor.
Further, the prefetch buffer is a depth configurable cache unit.
Further, upon processor access, the prefetch buffer is designed as a fully associative prefetch.
The technical scheme adopts a design structure combining sequential prefetching and branch prefetching, thereby improving the accuracy of prefetching addresses.
According to a second aspect of the present disclosure, the present invention provides a method for improving the execution efficiency of instruction cache prefetching, where the method is implemented by using the aforementioned chip system for improving the execution efficiency of instruction cache prefetching;
the method for improving the execution efficiency of the instruction cache prefetching comprises the following steps:
s1, in the process of executing a program, a chip system sends an instruction inquiry address to a first-level cache, and the first-level cache inquires a prefetch buffer and an instruction cache of the first-level cache at the same time after receiving the inquiry address of the processor;
when the query address of the processor is successfully matched with the address of the first-level cache or the prefetch buffer, hit data is returned to the processor;
when the matching of the query address of the processor and the addresses of the first-level cache and the prefetch buffer fails, querying that the first-level cache is not hit, and performing step S2;
s2, when the processor inquires a first-level cache miss, the prefetching control part inquires the addresses needing prefetching to the first-level cache after obtaining the prefetched addresses;
when the pre-fetch address is successfully matched with the first-level cache, the data needing to be pre-fetched is indicated to be already in the first-level cache, and the pre-fetch control part cancels the operation of reading the second-level cache by using the pre-fetch address;
when the first-level cache is failed to be matched by the pre-fetch address, the data needing to be pre-fetched is not in the first-level cache, and the pre-fetch control part reads the second-level cache by using the pre-fetch address.
Compared with the prior art, in the design scheme, when the processor inquires about the miss of the first-level cache, the prefetch control component obtains the prefetched addresses, and then, the prefetch control component does not immediately send the read prefetch operation to the second-level cache, but inquires the addresses needing to be prefetched to the first-level cache; if the prefetch address is successfully matched in the first-level cache, the prefetch control part needs to cancel the prefetch address to read the second-level cache; if the prefetching address is failed to be matched in the first-level cache, the prefetching control part needs to read the second-level cache by using the prefetching address; through the processing, the prefetching control part cancels invalid prefetching operation, and saves the bus bandwidth of the second-level cache.
Further, in step S2, after the instruction data is read from the second level cache, the data associated with the current cache line address is returned to the processor, and the data associated with the prefetch is written back to the prefetch buffer; when the processor executes the instruction next time, if the instruction query address fails to match in the first-level cache, but is matched in the prefetch buffer, then the first-level cache hit is calculated; the first-level buffer returns the prefetched buffer data to the processor, so that the time for reading the second-level buffer is saved, and the system performance is improved.
Further, according to the address classification of the current instruction address and the address of the next instruction, the types of the processor executing instructions include a sequential instruction fetch instruction, an unconditional jump instruction, and a conditional branch jump instruction;
when the processor executes the program instruction, the instruction type output to the first-level cache by the processor is indicated by the sequential branch instruction identifier;
when the sequential branch instruction identification output to the prefetch control component by the processor indicates a sequential instruction fetch instruction, the prefetch address controlled by the prefetch control component is a sequential instruction fetch address;
when the sequential branch instruction identification output to the prefetch control component by the processor indicates an unconditional jump instruction, the prefetch address controlled by the prefetch control component is a branch instruction address;
when the sequential branch instruction identification indication output to the prefetch control component by the processor is a conditional branch jump instruction, the prefetch address controlled by the prefetch control component is a sequential instruction fetch address and a branch instruction address;
the conditional branch jump instruction and the unconditional jump instruction address are calculated by a branch predictor; when the processor queries the first-level cache for a miss, the execution prefetching component can accurately prefetch the address of the next execution instruction according to the sequential branch instruction identification indication.
Further, the width and depth of the prefetch buffer unit are designed by parameter configuration; the width is configurable to adapt to different cache line sizes of caches of different levels and instruction fetch address widths of different processors; the depth configuration needs to balance the prefetch coverage rate and the accuracy rate, and after the prefetch buffer is configured, the prefetch control component can set the size of the prefetched data once when the processor queries a first-level cache miss according to different stages of program operation (such as a cache is in a forced miss stage and a miss stage caused by capacity limitation), so as to achieve the aim of improving the performance of the coverage rate and the accuracy rate.
Furthermore, each unit of the prefetch buffer adopts a fully-connected design, and the processor uses the instruction counter to search the first-level cache and match the cache line address of each unit of the prefetch buffer in the process of running a program;
the successful tag matching of the first level cache or the successful matching of any unit of the prefetch buffer is regarded as the processor inquires about the first level cache hit; after a certain buffer unit of the prefetch buffer is successfully matched, only the buffer unit which is successfully matched can be written back into the first-level cache, so that the problem that the prefetch unit pollutes the cache is solved.
Further, in step S2, when the processor queries the first-level cache miss, the prefetch control unit obtains the prefetched addresses, and then queries all the prefetched addresses and the query address of the current processor by crossing physical page addresses; only when all prefetch addresses and the inquiry address of the processor belong to the same physical page (typically 4 KB), the prefetch control unit performs a prefetch operation on the address.
The technical scheme eliminates the prefetch operation design causing program 'run-out', because the instruction primary cache generally adopts a VIPT design structure, the processor inquires that the instruction primary cache uses virtual addresses, the addresses of the second-level cache are read as physical addresses, and the continuous virtual addresses are possibly discontinuous on the physical addresses. If the data of the prefetch data address and the processor query address do not belong to the same physical page, the prefetch control unit prefetches an erroneous data. The next time the processor executes a subsequent instruction, the "unfortunately" hits the erroneous data when querying the first level cache, which will result in the program "running off.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
1. in the technical scheme of the invention, a design structure combining sequential prefetching and branch prefetching is adopted, so that the accuracy of prefetching addresses is improved;
2. in the technical scheme of the invention, when a processor (CPU) inquires about a first-level cache miss (L1-cache miss), a prefetch control component (prefetch ctrl) obtains a prefetch address and then, instead of immediately sending a read prefetch operation to a second-level cache (L2-cache), the addresses needing to be prefetched are inquired to the first-level cache (L1-cache); if the prefetch address is successfully matched in the first-level cache (L1-cache), the prefetch control component (prefetch ctrl) needs to cancel the prefetch address to read the second-level cache (L2-cache) when the data to be prefetched is already in the first-level cache (L1-cache); if the prefetch address is failed to match in the first-level cache (L1-cache), the data to be prefetched is not in the first-level cache (L1-cache), and a prefetch control component (prefetch ctrl) needs to read the second-level cache (L2-cache) by using the prefetch address; through the processing, the prefetch control component (prefetch ctrl) cancels some invalid prefetch operations, and saves the bus bandwidth of the read secondary cache (L2-cache);
3. in the technical scheme of the invention, each unit of the prefetch buffer adopts a fully-connected design; in the process of running a program, a processor (CPU) uses a Program Counter (PC) to inquire a first-level cache (L1-cache) and simultaneously matches cache line addresses of all units of a prefetch buffer; when the TAG (TAG) of the primary cache (L1-cache) is successfully matched or any unit of the prefetch buffer is successfully matched, the processor (CPU) is considered as a hit for inquiring the primary cache (L1-cache); when a certain unit of the prefetch buffer is successfully matched, only the buffer unit which is successfully matched can be written back into the first-level cache (L1-cache), so that the problem that the prefetch unit pollutes the cache (cache) is solved;
4. in the technical scheme of the invention, the width and the depth of the prefetch buffer unit are designed by adopting parameter configuration; the width is configurable to adapt to different cache line sizes of different levels of caches (L1-caches) and instruction fetch address widths of different processors (CPUs); the depth configuration needs to balance the prefetch coverage rate and the accuracy rate, after the prefetch buffer is configured, a prefetch control component (prefetch ctrl) can set the size of the data to be prefetched once when a processor (CPU) inquires about a first-level cache miss (L1-cache miss) according to different stages of program operation, so that the aim of improving the coverage rate and the accuracy rate performance is fulfilled;
5. in the technical scheme of the invention, when a processor (CPU) inquires about a first-level cache miss (L1-cache miss), after a prefetch control component (prefetch ctrl) obtains prefetch addresses, all prefetch addresses and the inquiry address of the Current Processor (CPU) need to be inquired across physical page addresses, and only when all prefetch addresses and the inquiry address of the processor (CPU) belong to the same physical page, the prefetch control component (prefetch ctrl) can prefetch the address, thereby eliminating prefetch operation which possibly causes program 'run-away';
in general, the invention adopts a design structure combining sequential prefetching and branch prefetching, thereby improving the accuracy of prefetching addresses; meanwhile, the reading of the depth of the prefetch buffer is controlled, and the improvement of the prefetch coverage rate and the accuracy performance is considered; in addition, prefetch operations that may cause program "run-out" are eliminated; some unnecessary prefetch operations are eliminated.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a prior art system-on-chip;
FIG. 2 is a schematic diagram illustrating a chip system for improving the execution efficiency of instruction cache prefetching according to the present invention;
FIG. 3 is a flow chart of a method for improving the instruction cache prefetch execution efficiency according to the present invention;
the figure indicates:
11. a processor a; 12. a first level cache a; 13. prefetch buffer a; 14. a second level cache a;
1. a processor; 2. a branch predictor; 3. first-level caching; 4. a prefetch buffer; 5. a prefetch control section; 6. and (5) a second level cache.
Detailed Description
For a better understanding of the objects, structures and functions of the present invention, the technical solution of the present invention will be described in further detail with reference to the drawings and the specific preferred embodiments.
In the description of the present invention, it should be understood that the terms "left", "right", "upper", "lower", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or element being referred to must have a specific orientation, be configured and operated in a specific orientation, and "first", "second", etc. do not indicate the importance of the components, and thus are not to be construed as limiting the present invention. The specific dimensions used in the examples are for illustration of the technical solution only and do not limit the scope of protection of the invention. It will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
Unless specifically stated or limited otherwise, the terms "mounted," "configured," "connected," "secured," and the like should be construed broadly, as they may be either fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art as the case may be.
The following references the Chinese and English terms appearing in the technical scheme of the invention:
CPU is a processor, branch predictor is a Branch predictor, L1-cache is a first-level cache, L2-cache is a second-level cache, prefetch buffer is a Prefetch control unit, prefetch ctrl is a Prefetch control unit, L1-cache HIT is a first-level cache HIT, L1-cache miss is a first-level cache miss, cache line is a cache line, HIT data is HIT data, return is returned, write back is written back, addr is an address, and PC is a program counter; the English full name of seq is sequence, which refers to sequential execution when executing instructions; the seq address is a sequential instruction fetch address; branch is a branch execution; the branch address is a branch instruction address; TAG is a label; the seq_branch indication is the sequential branch instruction identification;
the L1-Cache comprises a Data Cache and a Instruction Cache, the Data Cache is a Data Cache, the Instruction Cache is an instruction Cache, the Data Cache and the instruction Cache are respectively used for storing Data and executing instructions of the Data, and the Data and the instruction Cache can be simultaneously accessed by a CPU, so that conflicts caused by the contention Cache are reduced, and CPU efficiency is improved.
Example 1:
as shown in fig. 2, the present invention provides a technical solution: a chip system for improving the execution efficiency of instruction cache prefetching comprises a processor 1, a branch predictor 2, a first-level cache 3, a prefetching buffer 4, a prefetching control component 5 and a second-level cache 6;
the first-level cache 3 and the second-level cache 6 are buffer memories positioned between the processor 1 and the main memory; the prefetch buffer 4 is used for storing transmission data between devices with unsynchronized speed so as to reduce the mutual waiting time between processes; the prefetch control unit 5 is used for controlling the operation behavior of prefetching from the secondary cache 6;
the prefetch control unit 5 is connected to a sequential address of instructions executed by the processor 1 and a branch prediction address output from the branch predictor 2.
Further, the prefetch buffer 4 is a depth configurable cache unit.
Further, upon processor access, the prefetch buffer 4 is designed as a fully associative prefetch buffer.
The beneficial effects of the embodiment are that: the design structure combining sequential prefetching and branch prefetching is adopted, so that the accuracy of prefetching addresses is improved.
Example 2:
as shown in fig. 1-3, the present invention further provides a technical solution: the method for improving the instruction cache prefetching execution efficiency is implemented by adopting the chip system for improving the instruction cache prefetching execution efficiency;
the method for improving the execution efficiency of the instruction cache prefetching comprises the following steps:
s1, in the process of executing a program, a chip system sends an instruction inquiry address to a first-level cache 3 by a processor 1, and the first-level cache 3 inquires instruction caches of a prefetch buffer 4 and the first-level cache 3 at the same time after receiving the inquiry address of the processor 1;
when the query address of the processor 1 is successfully matched with the address of the first-level cache 3 or the prefetch buffer 4, hit data is returned to the processor 1;
when the query address of the processor 1 is failed to match with the addresses of the first-level cache 3 and the prefetch buffer 4, querying the record of the first-level cache miss, and performing step S2;
s2, when the processor 1 inquires the record of the first-level cache miss, the prefetch control part 5 obtains the prefetched addresses and inquires the addresses needing prefetching to the first-level cache 3;
when the prefetch address is successfully matched with the first-level cache 3, the prefetch control part 5 cancels the operation of reading the second-level cache 6 by using the prefetch address, wherein the prefetch address indicates that the data to be prefetched is already in the first-level cache 3;
when the prefetch address fails to match with the first-level cache 3, it indicates that the data to be prefetched is not in the first-level cache 3, and the prefetch control unit 5 reads the second-level cache 6 by using the prefetch address.
Referring to fig. 1, fig. 1 is a prior art, and as can be seen from fig. 1, the detailed operation of the processor a11 to access the instruction cache is as follows: during the execution of the program, the processor a11 sends an instruction inquiry address to the first level cache a 12. After receiving the query address of the processor a11, the first-level cache a12 queries the first-level cache unit and the prefetch buffer a13 at the same time, and if the query address of the processor a11 fails to match with the addresses of the first-level cache unit and the prefetch buffer a13, queries the first-level cache miss. The primary cache a12 initiates a read secondary cache a14 operation, where the read secondary cache a14 address is the current query cache line address, and the prefetch address is predicted by the processor a 11. The time to read the secondary cache a14 is typically over 50 clock cycles during which the processor a11 pipeline will stall (except for the processor with advanced execution). When the read instruction data from the secondary cache a14 is read out, the data associated with the current cache line address is returned to the processor a11, and the data associated with the prefetch is written back to the prefetch buffer a13. When the processor a11 executes the instruction next time, if the instruction query address fails to match in the first level cache a12, but a match is obtained in the prefetch buffer a13, then it is also counted as a first level cache a12 hit. The first-level cache a12 returns the data of the pre-fetch buffer a13 to the processor a11, so that the time for reading the second-level cache a14 is saved, and the system performance is improved.
The beneficial effects of the embodiment are that: compared with the prior art of fig. 1, in the design of the present invention, when the processor 1 queries the first-level cache miss, the prefetch control unit 5 obtains the prefetched addresses, and then, instead of immediately sending the read prefetch operation to the second-level cache 6, the addresses to be prefetched are queried to the first-level cache 3; if the prefetch address is successfully matched in the first-level cache 3, the prefetch control unit 5 needs to cancel the prefetch address to read the second-level cache 6, wherein the prefetch control unit indicates that the data to be prefetched is already in the first-level cache 3; if the prefetch address fails to match in the first level cache 3, it indicates that the data to be prefetched is not in the first level cache 3, and the prefetch control unit 5 needs to read the second level cache 6 by using the prefetch address; by doing so, the prefetch control unit 5 cancels some invalid prefetch operations, saving bus bandwidth for reading the secondary cache 6.
Example 3:
referring to fig. 2, in step S2, after instruction data is read from the second level cache 6, data associated with the current cache line address is returned to the processor 1, and data associated with the prefetch is written back to the prefetch buffer 4; when the processor 1 executes the instruction next time, if the instruction inquiry address fails to match in the first-level cache 3, but is matched in the prefetch buffer 4, the instruction inquiry address is also counted as a first-level cache 3 hit; the first-level buffer 3 returns the data of the pre-fetch buffer 4 to the processor 1, so that the time for reading the second-level buffer 6 is saved, and the system performance is improved.
Example 4:
referring to fig. 2, on the basis of embodiment 2, according to the address classification of the current instruction and the address of the next instruction, the types of instructions executed by the processor 1 include a sequential instruction fetch instruction, an unconditional jump instruction, and a conditional branch jump instruction;
when the processor 1 executes the program instruction, the instruction type output by the processor 1 to the first level cache 3 is indicated by the sequential branch instruction identifier;
when the sequential branch instruction identifier output by the processor 1 to the prefetch control unit 5 indicates a sequential instruction, the prefetch address controlled by the prefetch control unit 5 is a sequential instruction address;
when the sequential branch instruction identification outputted by the processor 1 to the prefetch control unit 5 indicates an unconditional jump instruction, the prefetch address controlled by the prefetch control unit 5 is a branch instruction address;
when the sequential branch instruction identification outputted by the processor 1 to the prefetch control unit 5 indicates a conditional branch jump instruction, the prefetch address controlled by the prefetch control unit 5 is a sequential instruction fetch address and a branch instruction address;
the conditional branch jump instruction and the unconditional jump instruction address are calculated by the branch predictor 2; when the processor 1 queries the first level cache 3 for a miss, the execution prefetch unit may accurately prefetch the address to the next execution instruction according to the sequential branch instruction identification indication.
Example 5:
referring to fig. 2, on the basis of embodiment 2, the width and depth of the prefetch buffer 4 unit are designed with parameters; the width is configurable to adapt to different cache line sizes of different levels of caches 3 and instruction fetch address widths of different processors 1; the depth configurable needs to balance the prefetch coverage rate and the accuracy rate, after the prefetch buffer 4 is configured, the prefetch control unit 5 can set the size of the data to be prefetched once when the processor 1 queries the record of the first-level cache miss according to different stages of program operation (such as a cache in a forced miss stage and a miss stage caused by capacity limitation), so as to achieve the aim of improving the performance of the coverage rate and the accuracy rate.
Example 6:
based on embodiment 2, each unit of the prefetch buffer 4 adopts a fully-associative design, and during the running process of a program, the processor 1 uses a program counter to query the first-level cache 3 and match the cache line address of each unit of the prefetch buffer 4;
the successful tag matching of the first level cache 3 or the successful matching of any unit of the prefetch buffer 4 is regarded as the processor inquires about the first level cache hit; after a certain buffer unit of the prefetch buffer 4 is successfully matched, only the buffer unit with the matched buffer unit can be written back into the first-level cache 3, so that the problem that the prefetch unit pollutes the cache is solved.
Example 7:
on the basis of embodiment 2, in step S2, when the processor 1 queries the record of the first-level cache miss, the prefetch control unit 5 needs to perform a cross-physical page address query on all the prefetch addresses and the query address of the current processor 1 after obtaining the prefetch address; only when all the prefetch addresses and the inquiry address of the processor 1 belong to the same physical page (typically 4 KB), the prefetch control unit 5 performs a prefetch operation on the obtained prefetch addresses.
The present embodiment eliminates the prefetch operation design that causes program "run-out", because the instruction primary cache generally adopts VIPT design structure, the processor queries that the instruction primary cache uses virtual addresses, the addresses of the read secondary cache 6 are physical addresses, and the consecutive virtual addresses may be discontinuous in physical addresses. If the data of the prefetch data address and the processor lookup address do not belong to the same physical page, the prefetch control section 5 prefetches an erroneous data. The next time processor 1 executes a subsequent instruction, the "unfortunately" hits the erroneous data when querying level one cache 3, which will result in the program "running off.
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (10)

1. The chip system for improving the execution efficiency of the instruction cache prefetching is characterized in that: the system comprises a processor (1), a branch predictor (2), a first-level cache (3), a prefetch buffer (4), a prefetch control unit (5) and a second-level cache (6);
the first-level cache (3) and the second-level cache (6) are buffer memories positioned between the processor (1) and the main memory; the prefetch buffer (4) is used for storing transmission data between devices with unsynchronized speed so as to reduce the mutual waiting time between processes; the prefetch control means (5) is for controlling the operation behavior of prefetching from the secondary cache (6);
the prefetch control unit (5) is connected to a sequential address of instructions executed by the processor (1) and a branch prediction address output from the branch predictor (2).
2. The system on a chip of claim 1, wherein the prefetch buffer (4) is a depth configurable cache unit.
3. The system on a chip of claim 1, wherein the mapping strategy of the prefetch buffer (4) is fully associative.
4. A method for improving the execution efficiency of instruction cache prefetching, which is implemented by adopting the chip system for improving the execution efficiency of instruction cache prefetching according to any one of claims 1 to 3;
the method for improving the execution efficiency of the instruction cache prefetching comprises the following steps:
s1, in the process of executing a program, a processor (1) sends an instruction inquiry address to a first-level cache (3), and after the first-level cache (3) receives the inquiry address of the processor (1), the instruction caches in a prefetch buffer (4) and the first-level cache (3) are inquired at the same time;
when the query address of the processor (1) is successfully matched with the address of the first-level cache (3) or the prefetch buffer (4), hit data is returned to the processor (1);
when the query address of the processor (1) is failed to match with the addresses of the first-level cache (3) and the prefetch buffer (4), querying the record of the first-level cache miss, and performing step S2;
s2, the processor (1) inquires the record of the first-level cache miss, and after the prefetching control part (5) obtains prefetched addresses, the addresses needing prefetching are inquired to the first-level cache (3);
when the pre-fetch address is successfully matched with the first-level cache (3), the pre-fetch control part (5) cancels the operation of reading the second-level cache (6) by using the pre-fetch address, wherein the operation indicates that the data needing to be pre-fetched is already in the first-level cache (3);
when the first-level cache (3) is failed to be matched with the prefetch address, the prefetch control part (5) uses the prefetch address to read the second-level cache (6) to indicate that the data needing to be prefetched is not in the first-level cache (3).
5. The method according to claim 4, wherein in step S2, after the instruction data is read from the secondary cache (6), the data associated with the current cache line address is returned to the processor (1), and the data associated with the prefetch is written back to the prefetch buffer (4).
6. The method according to claim 4, wherein the types of instructions executed by the processor (1) include sequential instruction fetch, unconditional jump, and conditional branch jump instructions; when the processor (1) executes program instructions, the instruction type output by the processor (1) to the first-level cache (3) is indicated by a sequential branch instruction identifier; both conditional branch jump instructions and unconditional jump instruction addresses are calculated by the branch predictor (2).
7. The method for improving instruction cache prefetch execution efficiency of claim 6, wherein:
when the sequential branch instruction identification output by the processor (1) to the pre-fetching control part (5) indicates a sequential instruction fetch, the pre-fetching address controlled by the pre-fetching control part (5) is a sequential instruction fetch address;
when the sequential branch instruction identification output by the processor (1) to the prefetch control unit (5) indicates an unconditional jump instruction, the prefetch address controlled by the prefetch control unit (5) is a branch instruction address;
when the sequential branch instruction identification outputted from the processor (1) to the prefetch control unit (5) indicates a conditional branch jump instruction, the prefetch address controlled by the prefetch control unit (5) is a sequential instruction fetch address and a branch instruction address.
8. A method for improving the execution efficiency of instruction cache prefetching according to claim 4, wherein the prefetch buffer (4) employs a cache unit with configurable width and depth; the width of the prefetch buffer (4) can be configured to accommodate different instruction fetch address widths of the processor (1) and different cache line sizes of the level one cache (3); the depth of the prefetch buffer (4) can be configured to balance prefetch coverage and accuracy, and after the prefetch buffer (4) is configured, the prefetch control unit (5) sets the size of the prefetch data once when the processor (1) queries the record of the first level cache miss according to different stages of program operation.
9. The method according to claim 4, wherein each unit of the prefetch buffer (4) adopts a fully-associated mapping strategy, and the processor (1) uses a program counter to query the first level cache (3) and match the cache line address of each unit of the prefetch buffer (4) during the running process of the chip system;
the tag of the first-level cache (3) is successfully matched or any unit of the prefetch buffer (4) is successfully matched, and the processor (1) inquires about the first-level cache hit; after a certain buffer unit of the prefetch buffer (4) is successfully matched, only the buffer unit which is successfully matched is written back into the first-level cache (3).
10. The method according to claim 4, wherein in step S2, the processor (1) queries the record of the first-level cache miss, and the prefetch control unit (5) queries all prefetch addresses and the query address of the current processor (1) across physical page addresses after obtaining the prefetch addresses; when all the prefetch addresses and the inquiry address of the processor (1) belong to the same physical page, the prefetch control unit (5) performs a prefetch operation on the obtained prefetch addresses.
CN202310799269.XA 2023-07-03 2023-07-03 Chip system and method for improving instruction cache prefetching execution efficiency Pending CN116521578A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310799269.XA CN116521578A (en) 2023-07-03 2023-07-03 Chip system and method for improving instruction cache prefetching execution efficiency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310799269.XA CN116521578A (en) 2023-07-03 2023-07-03 Chip system and method for improving instruction cache prefetching execution efficiency

Publications (1)

Publication Number Publication Date
CN116521578A true CN116521578A (en) 2023-08-01

Family

ID=87392540

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310799269.XA Pending CN116521578A (en) 2023-07-03 2023-07-03 Chip system and method for improving instruction cache prefetching execution efficiency

Country Status (1)

Country Link
CN (1) CN116521578A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101495962A (en) * 2006-08-02 2009-07-29 高通股份有限公司 Method and apparatus for prefetching non-sequential instruction addresses
CN110520836A (en) * 2017-02-03 2019-11-29 爱丁堡大学董事会 Branch target buffer for data processing equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101495962A (en) * 2006-08-02 2009-07-29 高通股份有限公司 Method and apparatus for prefetching non-sequential instruction addresses
CN110520836A (en) * 2017-02-03 2019-11-29 爱丁堡大学董事会 Branch target buffer for data processing equipment

Similar Documents

Publication Publication Date Title
KR102369500B1 (en) Adaptive prefetching in a data processing apparatus
US7383391B2 (en) Prefetch mechanism based on page table attributes
Hsu et al. Prefetching in supercomputer instruction caches
US8140768B2 (en) Jump starting prefetch streams across page boundaries
US5694568A (en) Prefetch system applicable to complex memory access schemes
JP5279701B2 (en) Data processor with dynamic control of instruction prefetch buffer depth and method
US5721864A (en) Prefetching instructions between caches
US5623627A (en) Computer memory architecture including a replacement cache
JP3739491B2 (en) Harmonized software control of Harvard architecture cache memory using prefetch instructions
US5996061A (en) Method for invalidating data identified by software compiler
US6138212A (en) Apparatus and method for generating a stride used to derive a prefetch address
US9396117B2 (en) Instruction cache power reduction
US7516276B2 (en) Runtime register allocator
US6175898B1 (en) Method for prefetching data using a micro-TLB
KR100234647B1 (en) Data processing system with instruction prefetch
CN112416817B (en) Prefetching method, information processing apparatus, device, and storage medium
US8060701B2 (en) Apparatus and methods for low-complexity instruction prefetch system
WO2005088455A2 (en) Cache memory prefetcher
US6098154A (en) Apparatus and method for generating a stride used to derive a prefetch address
US7313655B2 (en) Method of prefetching using a incrementing/decrementing counter
US8707014B2 (en) Arithmetic processing unit and control method for cache hit check instruction execution
CN108874691B (en) Data prefetching method and memory controller
CN108874690B (en) Data prefetching implementation method and processor
CN116521577B (en) Chip system and method for fast processing instruction cache of branch prediction failure
CN116521578A (en) Chip system and method for improving instruction cache prefetching execution efficiency

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20230801