CN101706755A

CN101706755A - Caching collaboration system of on-chip multi-core processor and cooperative processing method thereof

Info

Publication number: CN101706755A
Application number: CN200910186558A
Authority: CN
Inventors: 吴俊敏; 赵小雨; 隋秀峰
Original assignee: Suzhou Institute for Advanced Study USTC
Current assignee: Suzhou Institute for Advanced Study USTC
Priority date: 2009-11-24
Filing date: 2009-11-24
Publication date: 2010-05-12

Abstract

The invention discloses a cache cooperative system of an on-chip multi-core processor and a cooperative processing method thereof. The system comprises a memory and an on-chip multi-core processor coupled with the memory, and is characterized in that each processor core of the multi-core processor comprises a private L1 cache and a private L2 cache which has strict inclusion relation with the private L1 cache; all the processor cores of the multi-core processor share a centralized consistent catalog which is used for realizing consistent strategy in the L1 cache, the L2 cache and the memory. The cooperative cache system has the advantages of small hardware expense, good expandability and large system throughput and the like.

Description

The caching collaboration system of chip multi-core processor and collaboration processing method thereof

Technical field

The invention belongs to the technical field of memory of the processor of information handling system, be specifically related to a kind of caching collaboration system and collaboration processing method thereof of chip multi-core processor.

Background technology

Along with integrated circuit technology develops to nanoscale, the process that constantly dwindles has satisfied the chip microminiaturization, high-speed and higher integrated level direction develops.Chip multi-core processor (ChipMulti-Processor, CMP) be a kind of architecture Design that occurs in the nineties in 20th century, be researchist's proposition at first by Stanford Univ USA, its thought is to utilize the abundant a plurality of processor cores of transistor resources integration on single chip, develops degree of parallelisms at all levels such as instruction-level, thread-level by the mode of multi-core parallel concurrent execution and improves performance.In the chip multi-core processor environment, multinuclear moves simultaneously, and storage resources such as the limited Cache of contention access single-chip, memory bandwidth cause memory access conflict aggravation, and traditional memory access bottleneck will become more outstanding.

The delay that it is interconnection line that integrated circuit technology enters a marked change that the deep-submicron stage brings replaces gate delay and forms the principal element of taking time delay as the leading factor.The delay of signal is made of gate delay and interconnection line delay two parts in the integrated circuit.In the integrated circuit (IC) design before deep-submicron, because the interconnection line delay is minimum, need only consider the delay of door itself, therefore, the design of chip can be divided into logical design and two separate stages of physics realization.Yet, enter deep-submicron after the stage, along with reducing of transistor feature size, the speed of door is more and more faster, and the principal element that the restricting circuits performance improves no longer is a switching speed, but interconnect delay.The generation of interconnect delay mainly contains two aspects: the one, because the lifting of circuit clock frequency, the wavelength of signal enters millimeter or micron order, thereby can be comparable with interconnect length, signal need consume certain hour by the line of certain-length, and this time delay is called propagation delay; The 2nd, owing to the capacitive coupling between interconnection line and inductive coupling effect produce delay.

In order to alleviate the pressure of multinuclear to memory access, polycaryon processor generally adopts the mode of integrated high capacity Cache on sheet to improve the performance of storage system.High capacity Cache can directly increase Cache hit rate on the sheet, reduce the frequency of the outer memory access of sheet, but because the Cache area is excessive, be dispersed in the diverse location of chip, under the influence of wire delay, the Cache visit of same level but different distance presents different communication delay, and promptly so-called non-consistent Cache visit (Non-Uniform Catch Aceess, NUCA).

Up to the present, the second-level cache of chip multi-core processor has two kinds of basic design schemes: privately owned Cache and shared Cache.The CMP system of privately owned structure, each processor core has oneself independently second-level cache, the one-level Cache of this part second-level cache and processor core is similar, only serve the processor core of place node, can not directly be visited, so on average hitting of second-level cache postpones relatively low but the miss rate height by other processor core; In the CMP system of shared structure, second-level cache is adopted the operating strategy of unified addressing, the processor core on the sheet can directly conduct interviews to all second-level caches, in the shared structure on the sheet Cache resource have balanced more and sufficient utilization factor.Because the intended application of CMP is very abundant, different programs can show significantly different memory access feature, even the performance of same program different phase also has very big-difference.In the face of using the diversity of behavioural characteristic, shared or privately owned second-level cache structure is difficult to all provide support preferably to a series of different application.

Under the NUCA structure, for different operating loads, pure privately owned design and pure shared design all can not reach optimum performance, academia has proposed mixed C ache design proposal for this reason, the privately owned Cache structure of i.e. shared Cache structure by in logic or capacity sharing policy expansion, make the second-level cache resource to be shared by a plurality of processor core. cooperation Cache scheme is one of typical case's representative of mixed C ache design, it generally is the average retardation that reduces the memory access request by the mutual cooperation between privately owned second-level cache, the capacity of finishing under the support of centralized consistance engine is stolen with Cache and is overflowed to the data of Cache. but traditional cooperation Cache does not control effectively and manages coordination strategy, this is complicated day by day in loadtype, under the huge day by day situation of working set scale, can bring negative influence to performance.

Summary of the invention

The object of the invention is to provide a kind of caching collaboration system of chip multi-core processor, solved that traditional cooperation Cache does not control effectively and manages coordination strategy in the prior art, caused under the situation complicated day by day in loadtype, that the working set scale is huge day by day performance being brought negative problems such as influence.

In order to solve these problems of the prior art, technical scheme provided by the invention is:

A kind of caching collaboration system of chip multi-core processor, comprise storer and with the chip multi-core processor of storer coupling, it is characterized in that each processor core of described polycaryon processor include privately owned on-chip cache and with the privately owned second level cache of the strict relation of inclusion of on-chip cache; All processor cores of described polycaryon processor have centralized consistance catalogue, and described centralized consistance catalogue is used for realizing the consistance strategy at on-chip cache, second level cache and storer.

Preferably, described on-chip cache is privately owned one-level instruction and data high-speed cache, and each processor core of described polycaryon processor is provided with disappearance formation MissQ in on-chip cache; Described disappearance formation MissQ is used to handle the request of native processor core or sends remote request by searching centralized consistance catalogue to non-native processor core.

Preferably, when described disappearance formation MissQ handles the native processor core request, after being used to accept and handle the data access request of native processor core or searching centralized consistance catalogue by the consistance request of this data access in the native processor core processing and visit second level cache.

Preferably, each processor core of described polycaryon processor connects by router in the heart, and described disappearance formation MissQ is by router and the communication of non-native processor core.

Preferably, be provided with write-back formation WrBQ in the on-chip cache of each processor core, described write-back formation WrBQ is used for writing back in the second level cache at the dirty data that the native processor core is replaced in on-chip cache; Described centralized consistance catalogue comprises the data characteristics position Tag of the second level cache of all processor cores.

Another object of the present invention is to provide a kind of cache cooperation disposal route of chip multi-core processor, it is characterized in that said method comprising the steps of:

(1) the native processor core is to the processing that conducts interviews of local one-level data cache; Visit successfully then return data and withdraw from; When the visit disappearance, Access status is put in the disappearance formation, carries out step (2);

(2) visit native processor core second level cache; Hit when second level cache visit, then return the second level cache hit condition and give the disappearance formation, data are returned to on-chip cache and withdraw from by the disappearance formation; Lost efficacy when the second level cache visit, then carried out step (3);

(3) at centralized consistance directory search, when in centralized consistance catalogue, finding there are these data on the target processor core non-indigenous, then this second level cache miss request in the disappearance formation is sent in the disappearance formation of target processor core through route, this data are found in this request in the formation of target processor core processing disappearance; If in centralized consistance catalogue, do not find to have these data on the non-native processor core, then carry out step (4);

(4) access memory, reading of data is passed to local second level cache from internal memory, withdraws from.

Preferably, when the target processor core has a plurality of processor core, select the processor core of sequence number minimum in the described step (3).

Preferably, described step (3) also comprises after the target processor core finds these data, when the disappearance formation request of target processor core is read-only and this data when being clean, copy data on the target processor core and and carry out adaptability revision centralized consistance catalogue; When the disappearance formation request of target processor core is write request or this data when being dirty data, then data migtation is carried out adaptability revision in the heart and to centralized consistance catalogue to target processor core or other processor cores.

Preferably, when the data block replacement takes place,, then need not to move in the described method if find in the centralized consistance catalogue that there are a plurality of copies in this data block; If only a copy is then moved to certain processor core in the heart by network-on-chip with data block, and is revised centralized consistance catalogue simultaneously.

Preferably, also carry out the determining step of Streaming Media program in the described method in the data migration process; When to certain hour at interval in the MPKI parameter of data to be migrated when having properties of flow, then stop data migtation, directly copy on the target processor core and handle.

With respect to scheme of the prior art, advantage of the present invention is:

1. comprise the specific implementation process of cooperating between the hardware configuration design of the Cache that cooperates and Cache in the technical solution of the present invention, improve processor performance by the cooperation between privately owned Cache on the chip multi-core processor, in the analog processor environmental testing, show, adopt cooperation of the present invention Cache system can reduce average memory access delay and the crash rate of chip multi-core processor Cache, improve Cache structure extensibility.

2, the privately owned second-level cache structure of the present invention physically to distribute, form a Cache structure of sharing in logic by the mutual cooperation between a plurality of secondary C ache bodies, this cooperation Cache storage system has realized strict relation of inclusion, finish the migration of data by network-on-chip, and passed through the flow awareness cooperation technology and improved processor performance, thereby make cooperation Cache have average memory access postpone little, advantage such as the Cache miss rate is low, and extensibility is good.

3, between one-level Cache and second-level cache, guaranteed strict relation of inclusion in the hardware design of technical solution of the present invention, thereby reduced the hardware spending of consistance catalogue, present commercial polycaryon processor adopts mostly and comprises the formula design simultaneously, so the cooperation Cache that the present invention proposes has stronger practicality.In addition, directly between processor core, carry out data migtation during data migtation of the present invention, do not need to be subjected to the control of consistance engine, avoided that the consistance engine becomes system bottleneck when data migtation, strengthened the extensibility of system; Another advantage of the present invention is to adopt in the technical solution of the present invention flow awareness cooperation technology to reduce the pollution of Streaming Media program to other processor core second-level caches, guarantees the overall handling capacity of multiple nucleus system.

Description of drawings

Below in conjunction with drawings and Examples the present invention is further described:

The structural representation of the cooperation Cache system of Fig. 1 centralized consistance catalogue for the embodiment of the invention has;

Fig. 2 closes on the selection hierarchy chart for the P0 processor of the embodiment of the invention four core processors;

Fig. 3 is that the P0 processor of 16 core processors closes on the selection hierarchy chart;

Fig. 4 is the variation relation figure of gobmk (SPEC 2006) MPKI with the Cache capacity;

Fig. 5 is the variation relation figure of bwaves (SPEC 2006) MPKI with the Cache capacity;

Fig. 6 cooperates Cache with respect to the speed-up ratio figure that shares Cache in the embodiment of the invention simulation test.

Embodiment

Below in conjunction with specific embodiment such scheme is described further.Should be understood that these embodiment are used to the present invention is described and are not limited to limit the scope of the invention.The implementation condition that adopts among the embodiment can be done further adjustment according to the condition of concrete producer, and not marked implementation condition is generally the condition in the normal experiment.

Embodiment

Present embodiment adopts the emulation mode technology to realize the cooperation Cache on the polycaryon processor.Employing is based on the multinuclear simulator of Simplescalar, and the relevant configured parameter of assembly is as shown in table 1, and the hardware configuration of the chip multi-core processor that this configuration realizes as shown in Figure 1.

Table 1 simulator relevant configured parameter

Each processor core of chip multi-core processor has privately owned one-level instruction and data Cache, also has a privately owned second-level cache body simultaneously, and one-level Cache and second-level cache are strict relation of inclusion.Each processor core has the formation (MissQ) of a disappearance, and this disappearance formation is accepted and handled the data access request or the consistance request of each processor core and visit second-level cache or router, and it handles local or long-range request; Each processor core has a write-back formation (WrBQ), and this write-back formation writes back in the second-level cache replacing the dirty data that comes out among the one-level Cache, and local request is only handled in the write-back formation.Whole multinuclear on-chip processor has a centralized consistance catalogue (CCD), and centralized consistance catalogue is by the centralized directory forming of Tag of all nuclear second-level caches.Connect by router (Router) between each nuclear.

The cooperation Cache systems support high speed buffer consistency agreement of present embodiment, simple data transport and search procedure comprise following several steps between the Cache of different processor core:

1) processor core conducts interviews to local one-level Cache.Visit successfully then return data, withdraw from; The visit disappearance then is put into miss status in the disappearance formation, changes step 2;

2) in the disappearance formation one-level Cache disappearance item is handled, the result of processing is the visit second-level cache.If the second-level cache visit is hit, return the second-level cache hit condition so and give the disappearance formation, by the disappearance formation data are returned to one-level Cache, withdraw from; If the second-level cache visit was lost efficacy, search centralized consistance catalogue so, change step 3;

3) there are these data if in centralized consistance catalogue, find on other nuclear, so this second-level cache miss request in the disappearance formation is sent in the disappearance formation of the processor (that minimum processor of processor numbering) that has these data through route, change step 4; If do not have these data on other nuclear, access memory so, reading of data is passed to second-level cache from internal memory, thus may cause second-level cache data migtation to other processor core, withdraw from;

4) target nuclear is handled this request in the disappearance formation, finds data.If this request is read-only and data are clean, copy the data to so on the target nuclear.If this request is that the write request or the data of searching are dirty data, so this data migtation to target nuclear.More than operation will be done corresponding modification to centralized consistance catalogue.

Cooperation Cache based on privately owned second-level cache is realized forms a Cache structure of sharing in logic by the mutual cooperation between a plurality of second-level cache bodies, and it relates generally to the problem of following four aspects.

(1) Cache is to the data transmission method between the Cache

The cooperation Cache design proposal that present embodiment proposed has realized the clean data transmission of Cache to Cache, the transmission of this clean data needs the support of Cache consistency protocol, wherein the most important needs to specify a copy to respond the request of disappearance processor core to these data exactly when having a plurality of clean copies to exist in the sheet.The principle that present embodiment adopted is that the processor core of selection sequence number minimum is responsible for providing data as the owner of clean data.In addition, catalogue is positioned at the central authorities of all processor cores, visit fast is provided can for all second-level cache bodies.

(2) data migration method

In cooperation Cache design, when the data block replacement takes place, search centralized consistance catalogue and find that there are a plurality of copies in this piece, need not migration so; If this piece has only a copy, [" based on the agreement of overflowing that pushes away " that propose among the Jichuan Chang and Gurindar S.Sohi Cooperative Caching for ChipMultiprocessors.In Proc.of the 33rd International Symposium onComputer Architecture (ISCA ' 06) 1 moves to it on processor core of certain appointment by network-on-chip can to adopt document so, revise CCD simultaneously, if visit again same data block and then can directly obtain next time, reduced the outer memory access number of times of sheet in sheet.

The selection of migration target has 2 kinds of schemes, and the one, select the target of a nuclear at random as migration, another kind is to close on selection, the processor core of promptly selecting to close on is examined as target processor.Core distributes as shown in Figure 2, when P0 need data migtation take place, contiguously selects always preferential the selection from the near processor core of P0 as migration target (P1 and P2).The advantage of selection scheme is to realize simply at random, but may affect to performance.When processor core is increased to 16, as shown in Figure 3, when P0 need data migtation take place, contiguous select always preferential the selection from the near P1 of P0 and P4 processor core as the migration target.P0 exist certain probability with data migtation to P15, if P0 visits these data once more and then need experience a large amount of network delays, and close on the generation that selection scheme can be avoided above-mentioned situation to a certain extent.

(3) the replacement method of migration initiation

Source nuclear is selected data migtation to another nuclear, data may all be in effective status and another nuclear is gone up mutually on the same group, if the migration to data does not limit, another nuclear selects a blocks of data to move on other nuclear again from this group so, this migration may hand on, net result is to have a circulation, has one all the time in migration.For fear of the generation of this situation, present embodiment has adopted 1-FWD technology in the document [Jichuan Chang and Gurindar S.Sohi Cooperative Caching for ChipMultiprocessors.In Proc.of the 33rd International Symposium onComputer Architecture (ISCA ' 06)].Be that data block is moved on the target processor nuclear,, so directly from this group, select a blocks of data to replace away, do not have the situation of circulation migration if the data on this nuclear respective sets all are valid data.

(4) flow awareness collaboration method

The Streaming Media program has very poor spatial locality to the visit of internal memory, accessed data seldom can be accessed again earlier, two programs as shown in Figure 4 and Figure 5, after distributing certain Cache space to them, its MPKI (per thousand Cache of instruction disappearance number) continues to rest on a higher state and very little with the increase variation of Cache capacity, has properties of flow.In the time of design cooperation Cach e, need to consider that this load is to cooperation Cache Effect on Performance.When a Streaming Media program run is on certain processor core, a large amount of capacity disappearances can take place in second-level cache, migration strategy according to the preamble description, having lot of data moves on other processor cores, and these data can not be reused substantially, this will cause the effective Cache space on the target processor nuclear to be reduced, and the generation capacity lacks, thereby influences the handling capacity of whole multiple nucleus system.

At above-mentioned situation, in case the solution that present embodiment adopts is to detect certain to load on the characteristic that has the Streaming Media program in certain period time interval, then stop being replaced out the migration of data, handle in a conventional manner.

In the simulation test, the chip multi-core processor of present embodiment adopts respectively to share under Cache and two kinds of situations of cooperation Cache tests its IPC value with typical SPLASH2 test procedure respectively, calculates speed-up ratio.

Experimental result: it is as shown in table 2 that chip multi-core processor adopt to be shared the IPC value that records under Cache and the cooperation Cache situation, corresponding speed-up ratio as shown in Figure 4, chip multi-core processor employing Cache collaboration method has good performance as we know from the figure.

The IPC of table 2 test procedure

Test procedure	Share Cache (IPC)	Cooperation Cache (IPC)
Test procedure	Share Cache (IPC)	Cooperation Cache (IPC)	??FMM	??0.9689	??1.0752
??RADIX	??1.0739	??1.6982	??FMM	??0.9689	??1.0752
??RADIX	??1.0739	??1.6982	??NON_LU	??0.8868	??0.9180
??CON_LU	??0.8025	??0.9147	??NON_LU	??0.8868	??0.9180
??CON_LU	??0.8025	??0.9147	??FFT	??1.1308	??1.2734
??WATER_NS	??0.9141	??0.9703	??FFT	??1.1308	??1.2734
??WATER_NS	??0.9141	??0.9703	??WATER_SP	??0.9696	??1.0742

Comprehensively above-mentioned, it is little that the cooperation Cache that adopts present embodiment to realize has a hardware spending, and extensibility is good, the advantage that throughput of system is big, experimental result shows, compares with shared second-level cache design, and the IPC of the cooperation Cache processor that the present invention realizes on average improves 16.6%.

Above-mentioned example only is explanation technical conceive of the present invention and characteristics, and its purpose is to allow the people who is familiar with this technology can understand content of the present invention and enforcement according to this, can not limit protection scope of the present invention with this.All equivalent transformations that spirit is done according to the present invention or modification all should be encompassed within protection scope of the present invention.

Claims

1. the caching collaboration system of a chip multi-core processor, comprise storer and with the chip multi-core processor of storer coupling, it is characterized in that each processor core of described polycaryon processor include privately owned on-chip cache and with the privately owned second level cache of the strict relation of inclusion of on-chip cache; All processor cores of described polycaryon processor have centralized consistance catalogue, and described centralized consistance catalogue is used for realizing the consistance strategy at on-chip cache, second level cache and storer.

2. the caching collaboration system of chip multi-core processor according to claim 1, it is characterized in that described on-chip cache is privately owned one-level instruction and data high-speed cache, each processor core of described polycaryon processor is provided with disappearance formation MissQ in on-chip cache; Described disappearance formation MissQ is used to handle the request of native processor core or sends remote request by searching centralized consistance catalogue to non-native processor core.

3. the caching collaboration system of chip multi-core processor according to claim 2, when it is characterized in that described disappearance formation MissQ handles the native processor core request, after being used to accept and handle the data access request of native processor core or searching centralized consistance catalogue by the consistance request of this data access in the native processor core processing and visit second level cache.

4. the caching collaboration system of chip multi-core processor according to claim 4, it is characterized in that each processor core of described polycaryon processor connects by router in the heart, described disappearance formation MissQ is by router and the communication of non-native processor core.

5. the caching collaboration system of chip multi-core processor according to claim 1, it is characterized in that being provided with write-back formation WrBQ in the on-chip cache of each processor core, described write-back formation WrBQ is used for replacing the dirty data that comes out in the native processor core in on-chip cache and writes back in the second level cache; Described centralized consistance catalogue comprises the data characteristics position Tag of the second level cache of all processor cores.

6. the cache cooperation disposal route of a chip multi-core processor is characterized in that said method comprising the steps of:

7. method according to claim 6 is characterized in that in the described step (3) when the target processor core has a plurality of processor core, selects the processor core of sequence number minimum.

8. method according to claim 6, it is characterized in that described step (3) also comprises after the target processor core finds these data, when the disappearance formation request of target processor core is read-only and this data when being clean, copy data on the target processor core and and carry out adaptability revision centralized consistance catalogue; When the disappearance formation request of target processor core is write request or this data when being dirty data, then data migtation is carried out adaptability revision in the heart and to centralized consistance catalogue to target processor core or other processor cores.

9. method according to claim 8 is characterized in that in the described method when data block taking place replace, if find in the centralized consistance catalogue that there are a plurality of copies in this data block, then need not migration; If only a copy is then moved to certain processor core in the heart by network-on-chip with data block, and is revised centralized consistance catalogue simultaneously.

10. method according to claim 8 is characterized in that also carrying out in the data migration process in the described method determining step of Streaming Media program; When to certain hour at interval in the MPKI parameter of data to be migrated when having properties of flow, then stop data migtation, directly copy on the target processor core and handle.