CN104252425B

CN104252425B - The management method and processor of a kind of instruction buffer

Info

Publication number: CN104252425B
Application number: CN201310269557.0A
Authority: CN
Inventors: 郭旭斌; 侯锐; 冯煜晶; 苏东锋
Original assignee: Huawei Technologies Co Ltd; Institute of Computing Technology of CAS
Current assignee: Huawei Technologies Co Ltd; Institute of Computing Technology of CAS
Priority date: 2013-06-28
Filing date: 2013-06-28
Publication date: 2017-07-28
Anticipated expiration: 2033-06-28
Also published as: CN104252425A; WO2014206217A1

Abstract

The embodiment of the present invention provides management method and processor, the actual computer field of a kind of instruction buffer, can expand the instruction buffer capacity of hardware thread, reduces the miss rate of instruction buffer, improves systematic function.Hardware thread in the shared instruction caching of the processor is identified for recognizing the corresponding hardware thread of cache lines in shared instruction caching, privately owned instruction buffer is used to store the instruction cache line replaced out from shared instruction caching, also include missing caching, when the hardware thread of processor is obtaining instruction from instruction buffer, the shared instruction in access instruction caching caches privately owned instruction buffer corresponding with hardware thread simultaneously, determine that shared instruction caches privately owned instruction buffer corresponding with hardware thread with the presence or absence of instruction, and instruction is obtained from shared instruction caching or the corresponding privately owned instruction buffer of hardware thread according to judged result.The embodiment of the present invention is used for the instruction buffer of management processor.

Description

The management method and processor of a kind of instruction buffer

Technical field

The present invention relates to the management method and processor of computer realm, more particularly to a kind of instruction buffer.

Background technology

CPU (Central Processing Unit, CPU) caching (Cache Memory) be located at CPU with Temporary storage between internal memory, Capacity Ratio internal memory is much smaller, solves CPU arithmetic speeds and memory read-write speed is unmatched Contradiction, accelerates CPU reading speed.

In multiline procedure processor, multiple hardware threads obtain instruction from same I-Cache (instruction buffer), work as I- When the instruction to be obtained is not present in Cache, while miss request is sent to next stage Cache, other are switched to hard Part thread accesses I-Cache continues fetching, reduces pause of the streamline caused by I-Cache is lacked, improves flowing water Line efficiency.But when being due to the shared I-Cache inadequate resources that each hardware thread is assigned to, the increase of I-Cache miss rates, I- The miss request that Cache is sent to next stage Cache can frequently occur, and from next stage Cache fetch instruction backfill when, in thread When data increase, the Cache rows where causing the instruction inserted are filled into the I-Cache of missing and will not used immediately immediately Arrive, and the Cache rows replaced out are possible to be used again on the contrary.

In addition, when adjusting Thread (thread) scheduling strategy according to Cache hit situations, can try one's best at one section It is interior to ensure the thread that priority scheduling access instruction shoots straight in Cache, but for being total to that each hardware thread is assigned to The problem of enjoying I-Cache inadequate resources is not improved.

The content of the invention

Embodiments of the invention provide the management method and processor of a kind of instruction buffer, can expand the finger of hardware thread Buffer memory capacity is made, the miss rate of instruction buffer is reduced, systematic function is improved.

To reach above-mentioned purpose, embodiments of the invention are adopted the following technical scheme that

There is provided a kind of processor for first aspect, it is characterised in that including program counter, register file, instruction prefetch portion Part, Instruction decoding part, instruction issue unit, scalar/vector, ALU, shared floating point unit, data buffer storage And internal bus, in addition to:

Shared instruction is cached, and the shared instruction for storing all hardware thread, including label storage array and data are deposited Array is stored up, the label storage array is used to store label, and the data storage array includes instruction and the hardware thread of storage Mark, the hardware thread is identified for recognizing the corresponding hardware thread of cache lines in the shared instruction caching；

Privately owned instruction buffer, it is described privately owned for storing the instruction cache line replaced out from shared instruction caching Instruction buffer is corresponded with the hardware thread；

Missing caching, for that when fetched instruction is not present in shared instruction caching, will delay from the shared instruction The cache lines fetched in the next stage caching deposited are stored in the missing of hardware thread caching, in fetched instruction correspondence Hardware thread fetching when, the cache lines in the missing buffer are backfilled in shared instruction caching, the missing Caching is corresponded with the hardware thread.

With reference in a first aspect, in the first mode in the cards of first aspect, in addition to：

Label CL Compare Logic, for when the hardware thread fetching, the corresponding privately owned instruction of the hardware thread to be delayed The physical address that label in depositing is changed with translation look aside buffer is compared, by the privately owned instruction buffer and the label CL Compare Logic is connected, to cause the hardware thread accesses the privately owned instruction while shared instruction caching is accessed to delay Deposit.

With reference to the first mode in the cards of first aspect, in second of mode in the cards, the processing Device is multiline procedure processor, and the structure of the privately owned instruction buffer is complete association structure, and the complete association structure is main storage In any one block instruction caching mapping privately owned instruction buffer in any one block instruction caching.

It is described shared in the third mode in the cards with reference to second of mode in the cards of first aspect Instruction buffer, privately owned instruction buffer and missing caching are static storage chip or dynamic memory chip.

Second aspect there is provided a kind of management method of instruction buffer, including：

When the hardware thread of processor is obtaining instruction from instruction buffer, while accessing being total in the instruction buffer Enjoy instruction buffer and the corresponding privately owned instruction buffer of the hardware thread；

Determine that the shared instruction caches privately owned instruction buffer corresponding with the hardware thread and whether there is the instruction, And according to judged result from the shared instruction cache or the corresponding privately owned instruction buffer of the hardware thread in obtain the finger Order.

With reference to second aspect, in the first mode in the cards of second aspect, the shared instruction caching includes Label storage array and data storage array, the label storage array are used to store label, and the data storage array includes The instruction of storage and hardware thread mark, the hardware thread are identified for recognizing the cache lines pair in the shared instruction caching The hardware thread answered；

The structure of the privately owned instruction buffer is complete association structure, and the complete association structure is any one in main storage Any one block instruction caching in the block instruction caching mapping privately owned instruction buffer, the privately owned instruction buffer and the hardware Thread is corresponded.

It is described to determine in second of mode in the cards with reference to the first mode in the cards of second aspect The shared instruction caches privately owned instruction buffer corresponding with the hardware thread and whether there is the instruction, and is tied according to judgement Fruit obtains the instruction from shared instruction caching or the corresponding privately owned instruction buffer of the hardware thread to be included：

If the shared instruction caches privately owned instruction buffer corresponding with the hardware thread has the instruction simultaneously, The instruction is obtained from shared instruction caching；

If there is the instruction in the shared instruction caching and the corresponding privately owned instruction buffer of the hardware thread do not deposited In the instruction, then the instruction is obtained from shared instruction caching；

Do not deposited in the instruction and shared instruction caching if the corresponding privately owned instruction buffer of the hardware thread is present In the instruction, then the instruction is obtained from the corresponding privately owned instruction buffer of the hardware thread.

With reference to second of mode in the cards of second aspect, in the third mode in the cards, methods described Also include：

If the instruction is all not present in the shared instruction caching and the privately owned instruction buffer, pass through the hardware lines The next stage caching that journey is cached to the shared instruction sends cache miss request,

If there is the instruction in the next stage caching, obtained by the hardware thread from next stage caching The instruction is taken, and the cache lines where the instruction are stored in the corresponding missing caching of the hardware thread, described During hardware thread fetching, the cache lines are backfilled in the shared instruction caching；

If the instruction is not present in the next stage caching, send described to main storage by the hardware thread Miss request, obtains the instruction, and the cache lines where the instruction are stored in into the hardware from the main storage In the corresponding missing caching of thread, in the hardware thread fetching, the cache lines are backfilled to the shared instruction and cached In；

Wherein, the missing caching is corresponded with the hardware thread.

With reference to the third mode in the cards of second aspect, in the 4th kind of mode in the cards, by described in , will be described slow if idling-resource is not present in the shared instruction caching when cache lines are backfilled in the shared instruction caching The first cache lines in the row replacement shared instruction caching are deposited, the cache lines are backfilled in the shared instruction caching, Identified simultaneously according to the hardware thread for the first hardware thread for obtaining first cache lines, first cache lines are stored in In the corresponding privately owned instruction buffer of first hardware thread；

Wherein, first cache lines are to be determined by least recently used to algorithm.

With reference to the 4th kind of mode in the cards of second aspect, in the 5th kind of mode in the cards, it will replace When the first cache lines out are stored in the corresponding privately owned instruction buffer of acquisition first hardware thread, if described first is hard Idling-resource is not present in the corresponding privately owned instruction buffer of part thread, then by the first hardware thread described in first cache line replacement First cache lines are backfilled to first hardware thread corresponding by the second cache lines in corresponding privately owned instruction buffer In privately owned instruction buffer；

Wherein, second cache lines are by described least recently used to algorithm determination.

The embodiment of the present invention provides the management method and processor of a kind of instruction buffer, processor include program counter, Register file, instruction prefetch part, Instruction decoding part, instruction issue unit, scalar/vector, ALU, altogether Enjoy floating point unit, data buffer storage and internal bus, in addition to shared instruction caching, privately owned instruction buffer, missing caching and mark Sign CL Compare Logic.Wherein, shared instruction is cached, the shared instruction for storing all hardware thread, including label storage array And data storage array, data storage array includes the instruction of storage and hardware thread is identified, and hardware thread is identified for recognizing The corresponding hardware thread of cache lines in shared instruction caching, privately owned instruction buffer is replaced for storing from shared instruction caching The instruction cache line swapped out, privately owned instruction buffer is corresponded with hardware thread, label CL Compare Logic, for being taken when hardware thread During finger, the physical address that the label in the corresponding privately owned instruction buffer of the hardware thread and translation look aside buffer are changed is carried out Compare, privately owned instruction buffer is connected with label CL Compare Logic, to cause hardware thread to be visited while shared instruction caching is accessed Privately owned instruction buffer is asked, when the hardware thread of processor is obtaining instruction from instruction buffer, while in access instruction caching Shared instruction cache corresponding with hardware thread privately owned instruction buffer, determine the private corresponding with hardware thread of shared instruction caching There is instruction buffer with the presence or absence of instruction, and it is slow from shared instruction caching or the corresponding privately owned instruction of hardware thread according to judged result Deposit middle obtain to instruct, the instruction buffer capacity of hardware thread can be expanded, the miss rate of instruction buffer is reduced, systematicness is improved Energy.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is a kind of processor structure schematic diagram provided in an embodiment of the present invention；

Fig. 2 is a kind of management method schematic flow sheet of instruction buffer provided in an embodiment of the present invention；

Fig. 3 is to be provided in an embodiment of the present invention a kind of while the logic for accessing shared instruction caching and privately owned instruction buffer is shown It is intended to；

Fig. 4 for it is provided in an embodiment of the present invention it is a kind of ask to fetch cache lines according to cache miss when logical schematic.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

In the design of modern multiline procedure processor, with increasing for hardware thread number, each hardware thread can be assigned to Shared resource can Shortcomings, for example, for L1 (Level1) Cache this important shared resource in Cache (caching) It is even more so.The instruction buffer capacity for the L1Cache that each hardware thread is assigned to is too small, can there is a situation where to be not hit by L1, L1 miss rates increase, cause L1Cache communicated with L2Cache increase and from L2Cache fetching, or taken from main storage Refer to, power consumption of processing unit increase.

The embodiment of the present invention provides a kind of processor 01, as shown in figure 1, including program counter 011, register file 012, Instruction prefetch part 013, Instruction decoding part 014, instruction issue unit 015, scalar/vector 016, ALU 017th, floating point unit 018, data buffer storage 019 and internal bus are shared, in addition to：

Shared instruction caches (I-Cache) 020, privately owned instruction buffer 021, missing caching (Miss Buffer) 022 and mark Sign (Tag) CL Compare Logic 023.

Wherein, shared instruction caching 020, the shared instruction for storing all hardware thread, including label storage array (Tag Array) 0201 and data storage array (Data Array) 0202, label storage array 0201 are used to store label, number Include instruction 02021 and hardware thread mark (the Thread ID) 02022 of storage, hardware thread mark according to storage array 0202 02022 is used to recognize the corresponding hardware thread of cache lines in shared instruction caching 020.

Privately owned instruction buffer 021, the instruction cache line replaced out in 020, privately owned finger are cached for storing from shared instruction Order caching 021 is corresponded with hardware thread.

Missing caching 022, for that when fetched instruction is not present in shared instruction caching 020, will be cached from shared instruction The cache lines fetched in 020 next stage caching be buffered in the missing of hardware thread caching, corresponding in fetched instruction During hardware thread fetching, the cache lines lacked in caching 022 are backfilled in shared instruction caching, missing caching 022 and hardware Thread is corresponded.

Label CL Compare Logic, for when hardware thread fetching, by the corresponding privately owned instruction buffer of the hardware thread PA (the Physis that label is changed with TLB (Translation Look-aside Buffers, translation look aside buffer) Adress) it is compared, privately owned instruction buffer 021 is connected with label CL Compare Logic, to cause hardware thread accessing shared refer to Privately owned instruction buffer 021 is accessed while order caching 020.

Wherein, TLB or for page table buffering, the inside storage be some page table file (virtual addresses to physical address Conversion table, by TLB can be converted to physical address by the virtual address of fetched instruction, delay by physical address with privately owned instruction After the label deposited is compared, if physical address is identical with the label in privately owned instruction buffer so that hardware thread is being accessed Shared instruction also accesses privately owned instruction buffer while caching.

For example, PC (Program Counter, program counter) has 16, is PC0~PC15, a processor The number of logic processor core (hardware thread) and PC number are consistent in core.

Logic processor verification in GRF (General Register File, register file), a processor core should It is consistent with PC quantity in one GRF, quantity.

Fetch (instruction prefetch part), which is used to obtain, to be instructed, and Decoder (Instruction decoding part) is used to solve instruction Code, Issue is instruction issue unit, for firing order, AGU (Address Generator Unit, scalar/vector) To carry out the module of all address computations, generation one is used for accessing the address that memory is controlled.ALU (Arithmetic Logic Unit, ALU) is CPU (Central Processing Unit, central processing unit) Execution unit, the ALU that can be made up of " And Gate " (with door) and " Or Gate " (OR gate).Shared floating-point Unit (Shared Float Point Unit) is exclusively carries out the circuit unit of floating-point operation arithmetic in processor, data are delayed Depositing (D-Cache) is used for data storage, and internal bus is used to connect each part in processor.

For example, processor 01 is multiline procedure processor, and the structure of privately owned instruction buffer 021 is complete association structure, entirely Associative structure is that any one block instruction in main storage caches any one block instruction caching mapped in privately owned instruction buffer.

For example, shared instruction caching 020, privately owned instruction buffer 021 and missing caching 022 be static storage chip or Dynamic memory chip.

For example, it can be increased newly in I-Cache Data Array (instruction cache data storage array) Thread ID (hardware thread mark), the Thread ID are used to represent that Cache Line are sent by which hardware thread Cache Miss (caching is not hit by) requests are fetched.

For example, when the I-Cache that hardware thread accesses L1 is not hit by, i.e. to be obtained in I-Cache in the absence of hardware thread During the instruction obtained, L1 sends Cache Miss to L1 next stage caching L2Cache and asked, if L2Cache is hit, i.e., When there is the hardware thread instruction to be obtained in L2Cache, the hardware thread is by the caching where the instruction of this in L2Cache Row (Cache Line) is backfilled in L1Cache or the hardware thread is when receiving the Cache Line of return, not directly Connect and insert the Cache Line in L1Cache, but the Cache Line are stored in the corresponding Miss of the hardware thread In Buffer, when taking turns to hardware lines layer fetching, the Cache Line are inserted in L1Cache.

So, replaced when the cache lines where the instruction of this in L2Cache are backfilled in L1Cache by the hardware thread When, do not discard the Cache Line replaced out directly, can be according to the corresponding hardware threads of Cache Line replaced out Thread ID, the corresponding private of the corresponding hardware threads of Cache Line replaced out is inserted by the Cache Line replaced out Have in instruction buffer.

For example, it can be due to be not present caused by idling-resource in L1Cache to occur replacement, replaced out Cache Line can be obtained according to LRU (Least Recently Used, least recently used to) algorithm.

Wherein, once lru algorithm is exactly instruction buffer lacks, then by the time being not used by a most long finger Caching is replaced out in order, in other words, and what caching retained first is the instruction being often used recently.

For example, a hardware thread can access I-Cache privates corresponding with the hardware thread simultaneously in fetching There is Cache.

If there is fetched instruction in I-Cache, fetched instruction is not present in the corresponding privately owned Cache of the hardware thread, from Fetched instruction is obtained in I-Cache；

If fetched instruction is not present in I-Cache there is fetched instruction in the corresponding privately owned Cache of the hardware thread, from Fetched instruction is obtained and there is fetched instruction in the corresponding privately owned Cache of the hardware thread；

If there is fetched instruction simultaneously in I-Cache privately owned Cache corresponding with the hardware thread, from I-Cache Obtain fetched instruction；

If in I-Cache privately owned Cache corresponding with the hardware thread all be not present fetched instruction, the hardware thread to I-Cache next stage caching sends Cache Miss requests, to obtain fetched instruction.

For example, strategy is transferred according to hardware thread, when next cycle is switched to other thread fetchings, visited While asking shared instruction caching, the new corresponding privately owned Cache of hardware thread is connected with Tag (label) CL Compare Logic, will The Tag CL Compare Logics that the privately owned Cache is read and TLB (Translation Look-aside Buffers, translation lookaside buffering Area) PA (Physics Address, physical address) of output is compared, and generates privately owned Cache Miss signals and privately owned Cache data outputs.When there is fetched instruction in the corresponding privately owned Cache of the new hardware thread, privately owned Cache Miss believe Number represent exist instruction, and have instruction export.

Therefore, the embodiment of the present invention provides a kind of processor, and the processor includes program counter, register file, instruction Prefetch part, Instruction decoding part, instruction issue unit, scalar/vector, ALU, shared floating point unit, number According to caching and internal bus, in addition to shared instruction caching, privately owned instruction buffer, missing is cached and label CL Compare Logic, Increase hardware thread mark in the data storage array of shared instruction caching, in cache miss, the cache lines fetched are What the cache miss request sent by which hardware thread was fetched, when shared instruction caching is replaced, by what is replaced out Cache lines are deposited into the corresponding privately owned instruction buffer of corresponding hardware thread according to hardware thread mark, and missing caching is used for Cache lines are not backfilled in shared instruction caching directly by hardware thread when receiving the cache lines that cache miss request is returned, But cache lines are stored in missing caching, when taking turns to the hardware thread fetching, cache lines are backfilled to shared instruction In caching, the probability for the instruction buffer that bottom cache lines are replaced out will be accessed to by reducing, in addition, increased privately owned instruction Caching, increases the buffer memory capacity of each hardware thread, improves systematic function.

Further embodiment of this invention provides a kind of management method of instruction buffer, as shown in Fig. 2 including：

101st, when the hardware thread of processor is obtaining instruction from instruction buffer, access instruction is cached processor simultaneously In shared instruction cache corresponding with hardware thread privately owned instruction buffer.

Exemplary, the processor (Central Processing Unit, CPU) can be multiline procedure processor.One Physics kernel can have a multiple hardware threads, also referred to as logic core or logic processor, but a hardware thread not generation Each hardware thread is each logic by the logic processor for a schedulable by one physics kernel of table, Windows Processor can be with the code of runs software thread.Instruction buffer can be the caching of the shared instruction in the L1Cache in processor (I-Cache) and hardware thread privately owned instruction buffer.Wherein L1Cache includes data buffer storage (D-Cache) and instruction buffer (I-Cache)。

Specifically, the privately owned Cache of one piece of complete association can be set in each hardware thread, i.e., privately owned Cache with it is hard Part thread is corresponded.Wherein, complete association structure is the privately owned instruction buffer of any one block instruction caching mapping in main storage In any one block instruction caching.

Furthermore it is also possible to increase Tag (label) CL Compare Logic, when hardware thread fetching, actively by hardware thread Privately owned Cache be connected with Tag CL Compare Logics, so, when a hardware thread fetching, while accessing I-Cache and this is hard The corresponding privately owned Cache of part thread.

102nd, processor determines that shared instruction caches privately owned instruction buffer corresponding with hardware thread with the presence or absence of instruction, and Enter step 103 or 104 or 105 or 106 afterwards.

It is exemplary, hardware thread accesses I-Cache privately owned Caches corresponding with the hardware thread simultaneously when, sentence simultaneously Disconnected I-Cache privately owned Caches corresponding with the hardware thread whether there is fetched instruction.

If if the multiline procedure processor has 32 hardware threads, 32 hardware threads share one piece of 64KB I-Cache, I.e. shared instruction buffer memory capacity is 64KB.Each hardware thread includes the privately owned Cache of one piece of 32 tunnel complete association, can store 32 are replaced Cache Line (cache lines) out, and each Cache Line includes 64Bytes, and so, each piece privately owned Cache capacity is 2KB.

When increase newly a 32 road Tag CL Compare Logic when, hardware thread access I-Cache shared instructions caching while, The PA (Physics Address, physical address) that the 32 road Tag that the hardware thread is read and TLB is exported is compared, and raw Into privately owned Cache Miss signals and privately owned Cache data outputs.If 32 road Tag are identical with PA, privately owned Cache Miss letters Number represent that the privately owned Cache of the hardware thread has fetched instruction, privately owned Cache data are effective instruction.As shown in Figure 3.

If the 103, shared instruction caches corresponding with hardware thread privately owned instruction buffer and there is instruction simultaneously, processor from Instruction is obtained in shared instruction caching.

Exemplary, when being conducted interviews at the same time to I-Cache and privately owned Cache, if I-Cache and privately owned Cache are same When there is fetched instruction, then instruction is obtained from I-Cache.

If the 104, there is instruction in shared instruction caching and instruction be not present in the corresponding privately owned instruction buffer of hardware thread, Processor obtains instruction from shared instruction caching.

Exemplary, if there is fetched instruction in I-Cache, fetched instruction is not present in the privately owned Cache of hardware thread, then There is fetched instruction from I-Cache.

If the 105, the corresponding privately owned instruction buffer of hardware thread, which exists in instruction and shared instruction caching, is not present instruction, Processor obtains instruction from the corresponding privately owned instruction buffer of hardware thread.

Exemplary, if being not hit by I-Cache, i.e., in the absence of fetched instruction, in the corresponding privately owned Cache of hardware thread There is fetched instruction, then obtain instruction from the corresponding privately owned Cache of hardware thread.So, by actively choosing hardware thread Corresponding privately owned Cache participates in Tag and compared, and can expand the Cache capacity that each hardware thread is assigned to, increase hardware thread Instruction buffer hit rate.

If the 106, instruction is all not present in shared instruction caching and privately owned instruction buffer, processor is by hardware thread to altogether The next stage caching for enjoying instruction buffer sends cache miss request.

Exemplary, if fetched instruction, the hardware is not present in I-Cache and the corresponding privately owned Cache of hardware thread Thread sends Cache Miss (cache miss) to I-Cache next stage caching.

For example, L1Cache and the corresponding privately owned Cache of hardware thread all be not present fetched instruction, then hardware thread to L1Cache next stage caching L2Cache sends Cache Miss, in the hope of obtaining fetched instruction from L2Cache.

If the 107, there is instruction in next stage caching, processor is obtained by hardware thread from next stage caching to be referred to Order, and the cache lines where instruction are stored in the corresponding missing caching of hardware thread, in hardware thread fetching, it will cache Row is backfilled in shared instruction caching.

Exemplary, when there is fetched instruction in L2Cache, then being instructed, directly not referred to this from L2Cache Cache Line where order are backfilled in L1Cache, but the Cache Line where fetched instruction are stored in into the hardware In the corresponding Miss Buffer of thread (missing caching), when taking turns to the hardware thread fetching, the Cache Line are inserted In L1Cache.

Wherein, Miss Buffer are corresponded with hardware thread, i.e., each hardware thread has a Miss Buffer, Each hardware thread caches the Cache Line that Cache Miss requests are returned using a Miss Buffer, this be by Replaced when assuming that the Cache Line are backfilled into L1Cache, the Cache Line replaced out are likely to be will be by The presence for Cache Line, the Miss Buffer having access to optimizes Cache Line backfill opportunity, and reducing will be accessed To the probabilities of caching that are replaced out of Cache Line.

If instruction the 108, is not present in next stage caching, processor sends missing by hardware thread to main storage please Ask, instruction is obtained from main storage, and the cache lines where instruction are stored in the corresponding missing caching of hardware thread, During hardware thread fetching, cache lines are backfilled in shared instruction caching.

Exemplary, if fetched instruction is also not present in L2Cache, hardware thread sends Cache to main storage Miss is asked, in the hope of obtaining fetched instruction from main storage.If there is fetched instruction in main storage, institute's fetching is obtained Order, and the Cache Line where fetched instruction are stored in the corresponding Miss Buffer of the hardware thread, until taking turns to this During hardware thread fetching, the Cache Line are inserted in L1Cache.

Can also be that when fetched instruction also is not present in L2Cache, hardware thread sends Cache Miss to L3Cache please Ask, if there is fetched instruction in L3Cache, obtain fetched instruction, if fetched instruction is not present in L3Cache, to main memory Reservoir sends Cache Miss requests to obtain fetched instruction.

Wherein, the cross-over unit between CPU and Cache is word, when CPU will read a word in main memory, sends the word Memory address reach Cache and main memory simultaneously, L1Cache or L2Cache or L3Cache can Cache control logics according to Word is judged whether according to the Tag mark parts of address, if hit, CPU obtains the word, if miss, read with hosting Cycle reads and exported to CPU from main memory, even if current CPU only reads a word, Cache controllers are also in main storage One comprising the word complete Cache row is copied in Cache, this to be known as to the Cache operations for transmitting data line Cache rows are filled.

In addition, when being backfilled to cache lines in shared instruction caching, if idling-resource is not present in shared instruction caching, The first cache lines during cache line replacement shared instruction is cached, cache lines are backfilled in shared instruction caching, while basis The hardware thread mark of the first hardware thread of the first cache lines is obtained, the first cache lines are stored in the first hardware thread correspondence Privately owned instruction buffer in.Wherein, the first cache lines be by LRU (Least Recently Used, it is least recently used to arrive ) algorithm determination.

Exemplary, it can increase in I-Cache Data Array (instruction cache data storage array) Thread ID (hardware thread mark), the hardware thread is identified for representing which hardware lines a Cache Line is by What the Cache Miss requests that journey is sent were fetched.So, when setting the privately owned of one piece of complete association in each hardware thread After Cache, when I-Cache is replaced, the Cache Line replaced out are not discarded directly, and can be according to the Thread In ID, the privately owned Cache for the hardware thread that the Cache Line replaced out are inserted to Thread ID marks, this is due to replace The Cache Line gone out have possibility to be accessed quickly.As shown in Figure 4.

When that will replace the first cache lines out and be stored in the corresponding privately owned instruction buffer of the first hardware thread of acquisition, If idling-resource is not present in the corresponding privately owned instruction buffer of the first hardware thread, by first the first hardware thread of cache line replacement First cache lines are backfilled to the corresponding privately owned instruction of the first hardware thread by the second cache lines in corresponding privately owned instruction buffer In caching.Wherein, the second cache lines are determined by lru algorithm.

So, the Instruction Cache appearance that each hardware thread is assigned to effectively is expanded by increasing privately owned Cache Amount, increases the hit rate of the Instruction Cache of hardware thread, reduces the communication between I-Cache and next stage Cache, together When Cache Line backfill opportunity is optimized by increased Miss Buffer, reduce the Cache that will be accessed to The probability that Line is replaced out, increased Tag CL Compare Logics so that access I-Cache when simultaneously access shared instruction caching and Privately owned instruction buffer, adds the hit rate of instruction buffer.

The embodiment of the present invention provides a kind of management method of instruction buffer, when the hardware thread of processor is from instruction buffer During middle acquisition instruction, while the shared instruction in access instruction caching caches privately owned instruction buffer corresponding with hardware thread, really Determine shared instruction and cache privately owned instruction buffer corresponding with hardware thread with the presence or absence of instruction, and referred to according to judged result from shared Instruction is obtained in order caching or the corresponding privately owned instruction buffer of hardware thread, if shared instruction caching and privately owned instruction buffer are not In the presence of instruction, then the next stage caching cached by hardware thread to shared instruction, which sends cache miss, asks, and will instruct institute Cache lines be stored in hardware thread it is corresponding missing caching in, in hardware thread fetching, cache lines are backfilled to shared ,, will if idling-resource is not present in shared instruction caching when being backfilled to cache lines in shared instruction caching in instruction buffer The first cache lines in cache line replacement shared instruction caching, cache lines are backfilled in shared instruction caching, while according to obtaining The hardware thread mark of the first hardware thread of the first cache lines is taken, the first cache lines the first hardware thread is stored in corresponding It in privately owned instruction buffer, can so expand the instruction buffer capacity of hardware thread, reduce the miss rate of instruction buffer, improve system System performance.

In several embodiments provided herein, it should be understood that disclosed processor and method, it can pass through Other modes are realized.For example, apparatus embodiments described above are only schematical, for example, the division of the unit, It is only a kind of division of logic function, there can be other dividing mode when actually realizing, such as multiple units or component can be with With reference to or be desirably integrated into another system, or some features can be ignored, or not perform.It is another, it is shown or discussed Coupling each other or direct-coupling or communication connection can be by some interfaces, the INDIRECT COUPLING of device or unit or Communication connection, can be electrical, machinery or other forms.

In addition, in equipment and system in each of the invention embodiment, each functional unit can be integrated in a processing In unit or the independent physics of unit includes, can also two or more units it is integrated in a unit. And above-mentioned each unit can both be realized in the form of hardware, it would however also be possible to employ hardware adds the form of SFU software functional unit real It is existing.

Realizing all or part of step of above method embodiment can be completed by the related hardware of programmed instruction, preceding The program stated can be stored in a computer read/write memory medium, and upon execution, it is real that execution includes the above method to the program The step of applying；And foregoing storage medium includes：USB flash disk, mobile hard disk, read-only storage (Read Only Memory, abbreviation ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic disc or CD etc. are various to store The medium of program code.

The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims

1. a kind of processor, it is characterised in that including program counter, register file, instruction prefetch part, instruction code Part, instruction issue unit, scalar/vector, ALU, shared floating point unit, data buffer storage and internal bus, Characterized in that, also including：

Shared instruction is cached, the shared instruction for storing all hardware thread, including label storage array and data storage battle array Row, the label storage array is used to store label, and the data storage array includes instruction and the hardware thread mark of storage, The hardware thread is identified for recognizing the corresponding hardware thread of cache lines in the shared instruction caching；

Privately owned instruction buffer, for storing the instruction cache line replaced out from shared instruction caching, the privately owned instruction Caching is corresponded with the hardware thread；

Missing caching, for when fetched instruction is not present in shared instruction caching, by what is cached from the shared instruction The cache lines fetched in next stage caching are stored in the missing of hardware thread caching, corresponding hard in the fetched instruction During part thread fetching, the cache lines in the missing caching are backfilled in shared instruction caching, the missing caching with The hardware thread is corresponded.

2. processor according to claim 1, it is characterised in that also include：

Label CL Compare Logic, for when the hardware thread fetching, by the corresponding privately owned instruction buffer of the hardware thread The physical address changed with translation look aside buffer of label be compared, the privately owned instruction buffer is compared with the label patrols Collect and be connected, to cause the hardware thread to access the privately owned instruction buffer while shared instruction caching is accessed.

3. processor according to claim 2, it is characterised in that the processor is multiline procedure processor, described privately owned The structure of instruction buffer is complete association structure, and the complete association structure is any one block instruction caching mapping institute in main storage State any one block instruction caching in privately owned instruction buffer.

4. processor according to claim 3, it is characterised in that the shared instruction caching, privately owned instruction buffer and institute It is static storage chip or dynamic memory chip to state missing caching.

5. a kind of management method of instruction buffer, it is characterised in that including：

When the hardware thread of processor is obtaining instruction from instruction buffer, while accessing the shared finger in the instruction buffer Order caching privately owned instruction buffer corresponding with the hardware thread；

Determine that the shared instruction caches privately owned instruction buffer corresponding with the hardware thread and whether there is the instruction, and root It is judged that result from the shared instruction cache or the corresponding privately owned instruction buffer of the hardware thread in obtain the instruction；

Wherein, the shared instruction caching includes label storage array and data storage array, and the label storage array is used for Label is stored, the data storage array includes instruction and the hardware thread mark of storage, and the hardware thread is identified for knowing The corresponding hardware thread of cache lines in not described shared instruction caching；

The structure of the privately owned instruction buffer is complete association structure, and the complete association structure is any one piece of finger in main storage Any one block instruction caching in the order caching mapping privately owned instruction buffer, the privately owned instruction buffer and the hardware thread Correspond.

6. method according to claim 5, it is characterised in that the determination shared instruction caching and the hardware lines The corresponding privately owned instruction buffer of journey whether there is the instruction, and be cached according to judged result from the shared instruction or described hard The instruction is obtained in the corresponding privately owned instruction buffer of part thread to be included：

If the shared instruction caches privately owned instruction buffer corresponding with the hardware thread has the instruction simultaneously, from institute State in shared instruction caching and obtain the instruction；

If there is the instruction in the shared instruction caching and institute be not present in the corresponding privately owned instruction buffer of the hardware thread Instruction is stated, then obtains the instruction from shared instruction caching；

If the corresponding privately owned instruction buffer of the hardware thread, which exists in the instruction and shared instruction caching, is not present institute Instruction is stated, then obtains the instruction from the corresponding privately owned instruction buffer of the hardware thread.

7. method according to claim 6, it is characterised in that methods described also includes：

If the instruction is all not present in shared instruction caching and the privately owned instruction buffer, by the hardware thread to The next stage caching of the shared instruction caching sends cache miss request；

If there is the instruction in the next stage caching, institute is obtained from next stage caching by the hardware thread Instruction is stated, and the cache lines where the instruction are stored in the corresponding missing caching of the hardware thread, in the hardware During thread fetching, the cache lines are backfilled in the shared instruction caching；

If the instruction is not present in the next stage caching, the missing is sent to main storage by the hardware thread Request, obtains the instruction, and the cache lines where the instruction are stored in into the hardware thread from the main storage In corresponding missing caching, in the hardware thread fetching, the cache lines are backfilled in the shared instruction caching；

Wherein, the missing caching is corresponded with the hardware thread.

8. method according to claim 7, it is characterised in that cached the cache lines are backfilled into the shared instruction When middle, if idling-resource is not present in shared instruction caching, by shared instruction caching described in the cache line replacement First cache lines, the cache lines are backfilled in the shared instruction caching, while according to acquisition first cache lines First cache lines are stored in the corresponding privately owned instruction of the first hardware thread and delayed by the hardware thread mark of the first hardware thread In depositing；

9. method according to claim 8, it is characterised in that be stored in acquisition institute the first cache lines out will be replaced When stating in the corresponding privately owned instruction buffer of the first hardware thread, if the corresponding privately owned instruction buffer of first hardware thread is not deposited In idling-resource, then second in the corresponding privately owned instruction buffer of the first hardware thread described in first cache line replacement is delayed Row is deposited, first cache lines are backfilled in the corresponding privately owned instruction buffer of first hardware thread；