CN112083957A

CN112083957A - Bandwidth control device, multithread controller system and memory access bandwidth control method

Info

Publication number: CN112083957A
Application number: CN202010991780.6A
Authority: CN
Inventors: 姚涛; 时兴; 贾琳黎; 林江
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-15
Anticipated expiration: 2040-09-18
Also published as: CN112083957B

Abstract

The application provides a bandwidth control device, a multithreading controller system and a memory access bandwidth control method, wherein the bandwidth control device is respectively connected with an LLC (logical link control) and a processor core, the processor core supports multithreading, and the processor core is communicated with a multi-level cache; the bandwidth control equipment is used for acquiring a first memory access instruction sent by the LLC to the lower-level storage unit; the bandwidth control device is used for determining a first processing priority corresponding to the first thread identifier and determining the limit rate of the first thread for sending the memory access instruction after a preset clock period; and the bandwidth control equipment is used for sending the limiting rate to the first processor core and indicating the processor core to limit the number of the memory access instructions sent by the first thread in a preset clock cycle according to the limiting rate. The limitation on the access bandwidth of the low-priority thread is realized in the link of the processor core, so that more cache resources can be used in the cache when the high-priority thread generates the access instruction, and the bandwidth resource limitation on the low-priority thread and the smooth operation of the high-priority thread are considered.

Description

Bandwidth control device, multithread controller system and memory access bandwidth control method

Technical Field

The application relates to the field of computers, in particular to a bandwidth control device, a multithreading controller system and a memory access bandwidth control method.

Background

In a multi-core multithreaded processor, cache bandwidth management supporting Quality of Service (QoS) enables more orderly execution of programs. QoS provides better service to high priority threads by limiting the bandwidth resources occupied by low priority threads.

However, in practice, multiple threads often share multiple resources in a multi-level cache system of a processor, and there is resource contention among the threads. Because a Last level cache (LLC for short) needs to serve multiple processor cores and multiple threads, resource contention in the LLC is particularly intense. On the other hand, limiting low priority threads in an LLC instead causes low priority threads to constantly occupy various resources in the LLC, resulting in less available resources for high priority threads, even if requests for high priority threads are blocked because the resources are occupied. Therefore, the prior art cannot limit the bandwidth resources of the low priority threads and simultaneously cannot interfere the operation of the high priority threads.

Disclosure of Invention

An object of the embodiments of the present application is to provide a bandwidth control device, a multithread controller system, and a memory access bandwidth control method, so as to solve the problem that the prior art cannot limit bandwidth resources of low priority threads and simultaneously does not interfere with the operation of high priority threads.

In a first aspect, an embodiment of the present application provides a bandwidth control device, where the bandwidth control device is respectively connected to a last-level cache LLC in a multi-level cache and at least one processor core, where the at least one processor core supports multithreading, and the at least one processor core is in communication with the multi-level cache; the bandwidth control device is configured to obtain a first access instruction sent by the LLC to a lower-level storage unit, where the first access instruction carries a first core identifier of a first processor core that generates the first access instruction and a first thread identifier of a first thread that the first processor core runs and generates the first access instruction; the bandwidth control device is used for determining a first processing priority corresponding to the first thread identifier and determining the limiting rate of the first thread for sending the memory access instruction after a preset clock cycle according to the first processing priority; and the bandwidth control device is used for sending the limiting rate to the first processor core and indicating the processor core to limit the number of the access instructions sent by the first thread after a preset clock cycle according to the limiting rate.

In the foregoing embodiment, the bandwidth control apparatus may obtain a memory access instruction issued by the LLC, and obtain, from the memory access instruction, a thread identifier of a thread that generates the memory access instruction. And determining the processing priority of the thread generating the memory access instruction according to the thread identification, further calculating the limit rate according to the processing priority, and then sending the limit rate to the processor core from which the memory access instruction comes by the bandwidth control equipment. The processor core can limit the number of memory access instructions sent by the same thread in a unit clock cycle according to a limiting rate, and the processing priority of the limited thread is lower priority. By the mode, the limitation on the access bandwidth of the thread with low priority is realized in the link of the processor core (namely the dispatching stage of the processor instruction), and the limitation on the access bandwidth of the thread with low priority in the multi-level cache is correspondingly reduced; the reduction of the access instructions generated by the threads with low priority in the multi-level cache enables more cache resources which can be used by the threads with high priority in the cache to generate the access instructions, so that the bandwidth resource limitation of the threads with low priority and the smooth operation of the threads with high priority are both realized.

In one possible design, the bandwidth control device includes a plurality of control calculation units, at least one of the plurality of control calculation units is in a running state, the number of control calculation units in the running state is the same as the number of all processing priorities of all threads supported by the multithreaded processor system, and the control calculation units in the running state correspond to the processing priorities one to one; the bandwidth control device is used for determining a first control calculation unit corresponding to the first processing priority; the bandwidth control device is used for calculating the limit rate of the first thread sending the memory access instruction after the preset clock cycle by using the first control calculation unit.

In the above-described embodiment, the control calculation unit in the running state in the bandwidth control apparatus corresponds one-to-one to all the processing priorities of all the threads supported by the multithreaded processor system. The bandwidth control device may select a first control calculation unit of the corresponding processing priority from the plurality of control calculation units in the operating state, and realize the calculation of the limit rate by using the first control calculation unit. Each control calculation unit in the running state is provided with a component with parameters corresponding to the processing priority, so that the limit rate corresponding to the processing priority is calculated.

In one possible design, the first control calculation unit includes a token generator, a token counter, and a limit rate calculation unit, and both the token generator and the limit rate calculation unit are connected to the token counter; the token generator is used for generating a corresponding token for each thread belonging to the first priority, and the generation period of the generated token is the same as the preset sending period of the access instruction corresponding to the first priority; the token counter is used for adding one to the current number of the tokens of the first thread when receiving the tokens newly generated by the token generator for the first thread; the token counter is further configured to reduce the current number of tokens of the first thread by one when the LLC sends a memory access instruction of the first thread to a lower-level storage unit; the limit rate calculation unit is used for calculating a difference value between the initial number of the tokens of the first thread and the current number of the tokens of the first thread in the current clock cycle, and calculating the limit rate of the first thread for sending the access instruction after the preset clock cycle according to the difference value.

In the above embodiment, the token generation cycle of the token generator included in the first control calculation unit may be the same as a preset sending cycle of the memory access instruction, where the preset sending cycle may be obtained according to a set memory access bandwidth, and the memory access bandwidth may be set by a user according to a processing priority of a thread. The token counter can increase the current number of the tokens by one when receiving the tokens newly generated by the token generator; and when the memory access instruction is received, reducing the current number of the token by one. By the method, the limit rate calculation unit can obtain the change trend and the change value of the number of the tokens in the time period of the current clock cycle; and calculating the limit rate according to the change trend and the change value of the number of the tokens. The memory access bandwidth can be set according to the processing priority of the threads, so that the calculated limit rate is related to the processing priority, the threads with different processing priorities can be distinguished and treated, and the processing efficiency is improved.

In one possible design, the limit rate calculating unit is configured to calculate a limit rate variation according to a difference between the initial number of tokens of the first thread and the current number of tokens of the first thread in a current clock cycle and a difference between the initial number of tokens of the first thread and the current number of tokens of the first thread in a latest one or more clock cycles; the limit rate calculation unit is used for calculating the sum of the historical limit rate and the limit rate variable quantity, wherein the sum is the limit rate of the first thread sending the access instruction after the preset clock cycle.

In the above embodiment, the rate of change of the limit rate may be calculated according to the token difference of the current clock cycle and the token difference of the historical clock cycle, and then the sum of the rate of change of the limit rate and the historical rate of limit may be calculated, so as to obtain a new limit rate, where the limit rate is a limit rate at which the processor core sends the access instruction after the preset clock cycle. And calculating the limiting rate by combining historical data, so that the limiting rate is more in line with the requirement of memory access bandwidth control.

In a second aspect, the present application provides a multithreaded processor system, including at least one processor core, a multi-level cache, and the bandwidth control device according to the first aspect or any optional implementation manner of the first aspect, where the at least one processor core supports multithreading, the at least one processor core communicates with the multi-level cache, the multi-level cache includes a last level cache LLC, and the LLC and the at least one processor core are both connected to the bandwidth control device.

In the above embodiment, the limitation on the access bandwidth of the low-priority thread is realized in the link of the processor core, and the limitation on the access bandwidth of the low-priority thread in the multi-level cache is correspondingly reduced; the reduction of the access instructions generated by the threads with low priority in the multi-level cache enables more cache resources which can be used by the threads with high priority in the cache to generate the access instructions, so that the bandwidth resource limitation of the threads with low priority and the smooth operation of the threads with high priority are both realized.

In one possible design, the processor core comprises an instruction output logic module and an execution/access unit, wherein the instruction output logic module is connected with the execution/access unit, and an emission limiting module is arranged in the instruction output logic module; and the emission limiting module in the processor core is used for receiving the limiting rate sent by the bandwidth control equipment and indicating the emission limiting module to limit the number of the memory access instructions sent by the first thread after a preset clock cycle according to the limiting rate.

In a third aspect, an embodiment of the present application provides a memory access bandwidth control method, which is applied to a multithreaded processor system in any optional implementation manner of the second aspect and the second aspect, and the method includes: the bandwidth control device acquires a first memory access instruction sent by the LLC to a lower-level storage unit, wherein the first memory access instruction carries a first core identifier of a first processor core which generates the first memory access instruction and a first thread identifier of a first thread which is operated by the first processor core and generates the first memory access instruction; the bandwidth control equipment determines a first processing priority corresponding to the first thread identifier, and determines the limit rate of the first thread for sending the memory access instruction after a preset clock cycle according to the first processing priority; and the bandwidth control equipment sends the limiting rate to the first processor core and instructs the processor core to limit the number of the memory access instructions sent by the first thread after a preset clock cycle according to the limiting rate.

In one possible design, the bandwidth control device includes a plurality of control calculation units, at least one of the plurality of control calculation units is in a running state, the number of control calculation units in the running state is the same as the number of all processing priorities of all threads supported by the multithreaded processor system, and the control calculation units in the running state correspond to the processing priorities one to one; the determining, according to the first processing priority, a limit rate at which the first thread sends the memory access instruction after a preset clock cycle includes: the bandwidth control equipment determines a first control calculation unit corresponding to the first processing priority; and the bandwidth control equipment calculates the limit rate of the first thread sending the memory access instruction after a preset clock cycle by using the first control calculation unit.

In one possible design, the first control calculation unit includes a token generator, a token counter, and a limit rate calculation unit, and both the token generator and the limit rate calculation unit are connected to the token counter; the token generator is used for generating a corresponding token for each thread belonging to the first priority, and the generation cycle of the token is the same as the preset sending cycle of the access instruction corresponding to the first priority; the bandwidth control device calculates the limit rate of the processor core sending the access instruction after the preset clock period by using the first control calculation unit, and the limit rate comprises the following steps: when the token counter receives a token newly generated by the token generator for the first thread, adding one to the current number of the tokens of the first thread; when the LLC sends the access instruction of the first thread to a lower-level storage unit, the token counter reduces the current number of tokens of the first thread by one; the limit rate calculation unit calculates the difference value between the initial number of the tokens of the first thread and the current number of the tokens of the first thread in the current clock cycle, and calculates the limit rate of the first thread for sending the access instruction after the preset clock cycle according to the difference value.

In a possible design, the calculating, according to the difference, a limit rate at which the first thread sends the memory access instruction after a preset clock cycle includes: the limit rate calculation unit calculates a limit rate variation according to a difference between the initial number of tokens of the first thread and the current number of tokens of the first thread in a current clock cycle and a difference between the initial number of tokens of the first thread and the current number of tokens of the first thread in one or more latest clock cycles; and the limiting rate calculation unit calculates the sum of the historical limiting rate and the limiting rate variable quantity, wherein the sum is the limiting rate of the first thread sending the access instruction after a preset clock cycle.

In one possible design, the first processor core comprises an instruction output logic module and an execution/access unit, the instruction output logic module is connected with the execution/access unit, and an emission limiting module is arranged in the instruction output logic module; the bandwidth control device sending the limit rate to the first processor core, including: and the bandwidth control equipment sends a limiting rate to an emission limiting module in the first processor core, and instructs the emission limiting module to limit the number of the memory access instructions sent by the first thread after a preset clock cycle according to the limiting rate.

In the foregoing embodiment, the bandwidth control device specifically sends the limit rate to the emission limit module in the processor core, so that the emission limit module limits the emission rate of the instruction output logic module of the processor core according to the limit rate, thereby limiting the number of the memory access instructions sent by the first thread after the preset clock cycle.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a block diagram illustrating a schematic structure of a multithread controller system provided by an embodiment of the present application;

FIG. 2 is a schematic block diagram illustrating one embodiment of a multithreaded controller system as provided in the embodiments of the present application;

fig. 3 shows a schematic block diagram of a bandwidth control apparatus;

fig. 4 is a schematic block diagram showing a single control calculation unit in the bandwidth control apparatus;

FIG. 5 shows a schematic block diagram of any of a plurality of processor cores;

fig. 6 is a schematic flow chart illustrating a memory access bandwidth control method provided in an embodiment of the present application;

fig. 7 is a flowchart illustrating a specific step of step S120 in fig. 6;

fig. 8 shows a flowchart illustrating a specific step of step S122 in fig. 7.

Detailed Description

Compared with the embodiment, the traditional cache bandwidth management supporting the QoS is usually to limit the low-priority threads in the multi-level cache system, but limit the low-priority threads, so that the low-priority threads occupy various resources in the cache for a long time. For example, in LLC, the memory access instruction corresponding to a low priority thread is often stored in missing cache queue misqueue or request cache queue ReqQueue for a long time, resulting in less available resources for a high priority thread in misqueue or ReqQueue. The MissQueue is used for storing the access instruction to be sent to the next-level storage unit, and the ReqQueue is used for storing the access instruction received from the previous-level cache.

The bandwidth control device, the multithread controller system and the memory access bandwidth control method provided by the embodiment of the application limit the memory access bandwidth of the thread with low priority through a link of a processor core, namely a Dispatch stage (Dispatch/issue) of a processor instruction, so that the limitation of the memory access bandwidth of the thread with low priority in a multi-level cache is reduced, more cache resources can be used in the cache when the thread with high priority generates the memory access instruction, and the bandwidth resource limitation of the thread with low priority and the smooth operation of the thread with high priority are considered.

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Referring to fig. 1, fig. 1 shows a schematic structural block diagram of a multithread controller system provided in an embodiment of the present application, where the system 10 includes a plurality of processor cores 110, a multi-level cache, and a bandwidth control device 210. Each processor core 110 of the plurality of processor cores 110 can support multithreading, with the plurality of processor cores 110 each communicating with multiple levels of cache.

For details, referring to fig. 1, it is not assumed that the number of the processor cores 110 is n, and the multi-level caches have m levels, and the multi-level caches include a first-level cache 120 and a second-level cache 130 … m, where a last-level cache in the multi-level caches is denoted as LLC 140, that is, in the multithread controller system shown in fig. 1, the m-level cache is LLC 140. The LLC 140 and the n processor cores 110 are each connected to a bandwidth control device 210.

Any one of the multiple threads supported by any one of the processor cores 110 may generate a number of memory accesses within a clock cycle of the processor core 110, where a memory access is an instruction that accesses particular data in a memory location. The access instruction can preferentially search whether the specific data to be accessed is cached in the high-level cache, and if the specific data is not cached in the high-level cache, the specific data is called to be 'missing' (Miss) in the high-level cache; if the specific data is cached in the high-level cache, the specific data is said to be "Hit" (Hit) in the high-level cache.

Referring to fig. 1, a memory access instruction generated by a thread in a current clock cycle of the processor core 110 preferentially searches for specific data to be accessed in the primary cache 120, and if the specific data to be accessed is not found in the primary cache 120, it is determined that the primary cache 120 is missing, and the memory access instruction is sent to the secondary cache 130 by the primary cache 120; the access instruction continues to search for specific data to be accessed in the secondary cache 130, if the specific data to be accessed is not found in the secondary cache 130, the secondary cache 130 is determined to be missing, the access instruction is sent to the tertiary cache … by the secondary cache 130, and the like, until the m-1 level cache is missing, and the access instruction is sent to the m-level cache, namely the LLC 140 by the m-1 level cache. If LLC 140 is still missing, LLC 140 will send the memory access instruction to the next memory unit. The generated memory access instruction carries identity information SrcID, the SrcID comprises a processor Core identification Core ID and a thread identification TID, wherein the processor Core 110 identification Core ID is the identification of the processor Core 110 from which the memory access instruction comes, and the thread identification TID is the identification of the thread from which the memory access instruction comes.

The Access instruction is sent to a lower-level Memory system, which is usually a Dynamic Random Access Memory (DRAM), through an on-chip interconnect system. The on-chip interconnect system may be connected to IO devices and other processor systems in addition to DRAM.

Optionally, in a specific embodiment, the primary cache 120 may include a primary instruction cache (abbreviated as L1I) and a primary data cache (abbreviated as L1D), and the memory access instruction may be selectively sent to one of L1I and LID by the processor core 110 according to a data type of specific data to be accessed by the processor core. For example, if the data type of the specific data to be accessed by the memory access instruction is an instruction, the processor core 110 may send the memory access instruction to the L1I, so that the memory access instruction may be sent to the secondary cache 130 by the L1I when the memory access instruction is missing from the L1I; if the data type of the specific data to be accessed by the memory access instruction is data, the processor core 110 may send the memory access instruction to the L1D, so that the memory access instruction may be sent to the secondary cache 130 by the L1D when the memory access instruction is missing from the L1D. In another possible implementation, the memory access instruction may also continue to search for specific data to be accessed in L1D when L1I is missing, and the memory access instruction is sent to the secondary cache 130 when L1D is also missing; or the access instruction may continue to look for the particular data to be accessed in L1I when L1D is missing, and the access instruction is sent to the secondary cache 130 when L1I is also missing. The specific process by which the memory access instruction looks up in the primary cache 120 should not be construed as limiting the application.

The bandwidth control device 210 may obtain a memory access instruction sent by the LLC 140 to the lower-level storage unit, and obtain, through a series of calculations, a limit rate of the memory access instruction sent by the processor core 110 supported by the thread that generates the memory access instruction after a preset clock cycle, and send the limit rate to the processor core 110. The specific structure of the bandwidth control device 210 and the calculation process of the limiting rate will be described in detail below.

The processor core 110 may limit, according to the limit rate, the number of new memory access instructions sent by the thread generating the memory access instruction in the unit clock cycle after the preset clock cycle. The preset clock cycle refers to a number of clock cycles counted from the current clock cycle of the processor core 110. Because the signal transmission requires time, the limit rate calculated by the current clock cycle often acts on the access instruction issued by the processor core 110 after the preset time period. The predetermined clock period may be 10 clock periods, or may be other numbers, such as 12 clock periods, and the specific number of predetermined clock periods should not be construed as limiting the present application.

For ease of description, the embodiment of the multithreaded controller system shown in fig. 2 is not described as an example. That is, it is not required to set the number of the processor cores 110 to be 4, which are Core 0, Core 1, Core 2, and Core 3, where each processor Core 110 of the 4 processor cores 110 supports two threads; the multistage cache is not provided with 3 stages in total, namely L1, L2 and L3, wherein L1 comprises L1D and L1I.

Referring to fig. 3, fig. 3 shows a schematic block diagram of the bandwidth control device 210, and the bandwidth control device 210 includes a request allocating unit 211, a plurality of control calculating units 212, a control signal allocating unit 213, and a QoS control management register 214.

The request allocating unit 211 is configured to determine, according to the SrcID of the access instruction, a thread that generates the access instruction and a processing priority corresponding to the thread, where the processing priority is a Class of Service (CoS for short). The request allocating unit 211 may determine a control calculating unit 212 corresponding to the CoS from among the plurality of control calculating units 212, and allocate the access instruction to the control calculating unit 212 described above.

The number of the plurality of control calculation units 212 may be the same as the total number of threads supported by the LLC 140, for example, for the multithread controller system shown in fig. 2, there are 4 processor cores 110, and each processor core 110 supports two threads, which are supported by the LLC 140 (i.e., three-level cache), so that the total number of threads supported by the LLC 140 is 4 × 2 — 8, and the number of the control calculation units 212 is the same as the total number of threads supported by the LLC 140, and is also 8, and the control calculation units 7 are respectively the control calculation unit 0, the control calculation unit 1, and the control calculation unit 2 ….

The 8 threads all have their own CoS. Alternatively, one thread may correspond to one CoS, or a plurality of threads may correspond to one CoS, that is, the number of CoS is less than or equal to the number of threads supported by LLC 140. The number of the control computing units 212 in the operating state in the plurality of control computing units 212 is the same as the number of CoS, that is, the number of the control computing units 212 in the operating state is less than or equal to the total number of the control computing units 212.

The control calculating unit 212 is configured to calculate a limiting rate according to the CoS corresponding to the control calculating unit and the received access instruction, and input the limiting rate to the control signal allocating unit 213. The specific process of controlling the calculation unit 212 to calculate the limiting rate is described in detail below.

The control signal distribution unit 213 is configured to convert the limiting rate into an emission control signal, and feed back the emission control signal to the processor core 110 supporting the thread according to the mapping relationship between the thread and the CoS.

The QoS control management register 214 is used to maintain the bandwidth quota of each CoS in the system, control the parameters required by the computing unit 212, and also maintain the mapping relationship between each thread and the CoS, that is, the mapping relationship between SrcID and the CoS. The control calculation unit 212 corresponding to each CoS may also be dynamically allocated through the QoS control management register 214. For example, it is not required to set CoS to have x levels in total, and in the embodiment of the present application, y levels among the x levels may be used, so that there are y control calculation units 212 in an operating state in the control calculation units 212, and the y control calculation units 212 in an operating state correspond to the y levels of CoS one to one.

Referring to fig. 4, fig. 4 shows a schematic block diagram of a single control calculation unit 212, the control calculation unit 212 includes a token generator 2121, a token counter 2122, and a limit rate calculation unit 2123, and the token generator 2121 and the limit rate calculation unit 2123 are connected to the token counter 2122.

The token generator 2121 is configured to generate tokens according to a generation cycle of tokens corresponding to itself. The generation period of the token is the same as the preset sending period of the memory access instruction, the preset sending period of the memory access instruction is obtained according to the set memory access bandwidth, and the memory access bandwidth can be set by a user according to the CoS of the thread.

Alternatively, in a specific embodiment, the generation process of the token generator 2121 may be calculated as follows:

if the frequency of the LLC 140 is set to F GHz, the access bandwidth of the CoS of the thread is N × 1/8GByte/S, the thread sends N access instructions in T clock cycles, and the data length requested by the access instructions is 64Byte, the relationship between the clock cycle and the master frequency can be obtained as follows:

N*64Byte/T*F GHz＝N*1/8GB/s

from the above formula, it can be obtained: t512 × F.

If N memory access instructions need to be sent in T clock cycles, the preset sending cycle of one memory access instruction is T/N512 × F/N, and the generation cycle of the token generator 2121 is the same as the preset sending cycle of the memory access instruction, which is 512 × F/N.

Alternatively, the token generator 2121 may be a counter that steps to 1, and is incremented by 1 every clock cycle, and when the count is incremented to 512 × F/N clock cycles, a token is output, and the counter is cleared and incremented from the beginning.

Alternatively, token generator 2121 may be a step configurable counter, for example, with a memory bandwidth of N × 1/8GByte/S, and the step of token generator 2121 may be configured as N, which increments N every clock cycle until the value of the counter is greater than or equal to T, outputs a token, and the value of the counter is subtracted by T and starts to increment again.

In another embodiment, T may be an integer power of 2, for example, let T be 2048, and M access instructions are issued in T clock cycles, then:

M*64Byte/2048*F GHz＝N*1/8GB/s

from the above formula, it can be obtained: m is 4N/F.

Therefore, the bit width of the counter of the token generator 2121 can be set to 11 bits, the counter accumulates M every clock cycle, and whenever the accumulation result is greater than or equal to 2048, the carry overflow of the highest bit is generated, and then the token output is generated.

It should be understood that the generation process of the token generator 2121 can be implemented by other schemes besides the above-mentioned manner, and the specific generation process of the token generator 2121 should not be construed as limiting the present application.

The token counter 2122 is configured to, when receiving the tokens generated by the token generator 2121, add one to the current number of tokens, and use the number obtained by adding one to the current number of tokens as a new current number of tokens; and when the memory access instruction is received, reducing the current number of the token by one, and taking the reduced current number of the token as the current number of a new token.

Optionally, to avoid the occurrence of negative token number, a larger initial fixed value may be set for the token counter 2122, and the initial fixed value may be set according to the size of the miss buffer queue of the LLC 140 and the memory access latency. For example, it is not assumed that the size of the miss buffer queue is 100, the memory access delay is 100 clock cycles, and at most 100 memory access requests are issued in a unit clock cycle, and the initial fixed value may be set to a value greater than 100, for example, 256, so as to avoid the occurrence of negative token number.

If the number of tokens is maintained at the initial fixed value within a period of time, it indicates that the number of LLC 140 memory access instructions of one or more threads corresponding to the CoS satisfies the bandwidth quota; if the number of the tokens is less than the initial fixed value, the access bandwidth used by one or more threads corresponding to the CoS is over large; and if the number of the tokens is larger than the initial fixed value, indicating that the access bandwidth used by one or more threads corresponding to the CoS is insufficient.

The limit rate calculation unit 2123 may calculate a difference between the initial number of tokens and the current number of tokens in the current clock cycle, and calculate a limit rate for a thread supported by the processor core 110 to send a memory access instruction after a preset clock cycle according to the difference.

Optionally, in a specific embodiment, in order to limit a certain thread, the token sent by the token generator 2121 may carry identification information representing the identity of the thread, such as a Core identification Core ID and a thread identification TID, and the token counter 2122 may calculate respective differences for different threads according to the Core identification Core ID and the thread identification TID of the thread.

Alternatively, in another embodiment, the number of token generators 2121 may be the same as the number of threads, and the number of token counters 2122 may also be the same as the number of threads, that is, each thread has a corresponding token generator and token counter, and at this time, the difference of the number of tokens obtained by each token counter corresponds to each thread.

The specific process of calculating the limiting rate by the limiting rate calculating unit 2123 is described in detail later. If the initial number of the tokens is larger than the current number of the tokens, the difference value is a positive number; and if the initial number of the tokens is less than the current number of the tokens, the difference value is a negative number.

Referring to fig. 5, fig. 5 is a schematic block diagram of any processor Core 110 in the plurality of processor cores 110, and the processor Core 0 is taken as an example for illustration.

The processor core 110 comprises an instruction output logic module 111 and an execution/access unit 112, wherein the instruction output logic module 111 is connected with the execution/access unit 112, and an emission limiting module is arranged in the instruction output logic module 111. Since the processor core 110 supports two threads, the processor core 110 further includes an address decoding module 113 for thread 0 and an address decoding module 114 for thread 1. The number of the emission limiting modules is also two, wherein one emission limiting module 1111 is connected to the addressing decoding module of the thread 0, and the other emission limiting module 1112 is connected to the addressing decoding module of the thread 1. Optionally, the number of the emission limiting modules may be multiple, that is, the emission limiting modules may correspond to the threads one to one, and the number of the emission limiting modules may also be one, that is, one emission limiting module manages multiple threads.

The control signal allocation unit 213 in the bandwidth control device 210 is specifically configured to convert the limiting rate into an emission control signal, and send the emission control signal to a corresponding emission limiting module in the processor core 110 according to a thread corresponding to the emission control signal. For example, if the limit rate corresponds to thread 0 of Core 0, the limit rate may be converted to an issue control signal: limit rate _ T0 and is sent to an issue limit module 1111 connected to the thread 0 fetch decode module; if the limit rate corresponds to thread 1 of Core 0, the limit rate may be converted to an emission control signal: the limit rate _ T1 and is sent to an issue limit module 1112 coupled to the fetch decode module of thread 1. The transmission limiting module is used for limiting the transmission rate of the instruction output logic module 111 according to the limiting rate, so that the number of the memory access instructions sent by a certain thread after a preset clock cycle is limited. The instruction output logic module 111 may output the instruction by dispatching the instruction or by transmitting the instruction, and the specific manner of outputting the instruction should not be construed as a limitation to the present application.

Fig. 6 is a schematic flow diagram of a memory access bandwidth control method provided in an embodiment of the present application, where the memory access bandwidth control method is executed by the bandwidth control device 210, and the memory access bandwidth control method shown in fig. 6 includes the following steps S110 to S130:

in step S110, the bandwidth control device 210 obtains a first access instruction sent by the LLC 140 to the lower-level storage unit.

The first memory access instruction carries a first core identifier of a first processor core which generates the first memory access instruction and a first thread identifier of a first thread which is operated by the first processor core and generates the first memory access instruction. For illustrative purposes, it is not assumed that the first access instruction is from thread L0 of core 0 in FIG. 2, i.e., the first core is identified as core 0 and the first thread is identified as L0.

The identity information is the above-described SrcID, the Core ID is the Core ID, and the thread ID is the TID.

The lower level memory unit is typically a DRAM, but may be other memory units such as a memory, and the specific type of the lower level memory unit should not be construed as limiting the present application.

After the access instruction is sent to the lower-level storage unit, whether specific data to be accessed is cached in the lower-level storage unit or not is searched in the lower-level storage unit, and if the specific data is not cached in the lower-level storage unit, the specific data is called to be missed in the lower-level storage unit; if the specific data is cached in the lower-level memory unit, the specific data is said to hit in the lower-level memory unit.

Step S120, the bandwidth control device determines a first processing priority corresponding to the first thread identifier, and determines a limit rate of the first thread sending the access instruction after a preset clock cycle according to the first processing priority.

The processing priority is the CoS described above, and each thread has a corresponding CoS, which may be one thread corresponding to one CoS, or multiple threads corresponding to one CoS. The bandwidth control device 210 may determine, according to the SrcID of the memory access instruction, a thread that generates the memory access instruction, and perform the calculation of the limit rate according to the CoS corresponding to the thread.

For the thread L0 of core 0 in FIG. 2, the CoS of the thread L0 of core 0 can be identified according to the correspondence between the thread and the CoS, and is not set as CoS 10. The bandwidth control device may determine, according to the first processing priority CoS10, a limit rate at which the thread L0 of the first thread core 0 sends the access instruction after a preset clock cycle.

Step S130, the bandwidth control device sends the limiting rate to the first processor core, and instructs the processor core to limit the number of the access instructions sent by the first thread after a preset clock cycle according to the limiting rate.

Continuing with the above example, the processor core 0 may limit the thread L0 supported by the processor core 0 according to the limit rate sent by the bandwidth control device, which is specifically represented as: the limit thread L0 issues the number of memory access instructions after a preset clock cycle.

The bandwidth control device 210 may obtain a memory access instruction issued by the LLC 140, and obtain a thread identifier of a thread that generates the memory access instruction from the memory access instruction. According to the thread identification, the processing priority of the thread generating the memory access instruction is determined, the limit rate is calculated according to the processing priority, and then the bandwidth control device 210 sends the limit rate to the processor core 110 from which the memory access instruction comes. Processor core 110 may limit the number of memory access instructions sent by the same thread per clock cycle based on a limit rate, the processing priority of the limited thread being generally a lower priority.

Through the mode, the limitation on the access bandwidth of the thread with low priority is realized in the link of the processor core 110, namely the dispatching stage of the processor instruction, and the limitation on the access bandwidth of the thread with low priority in the multi-level cache is correspondingly reduced; the reduction of the access instructions generated by the threads with low priority in the multi-level cache enables more cache resources which can be used by the threads with high priority in the cache to generate the access instructions, so that the bandwidth resource limitation of the threads with low priority and the smooth operation of the threads with high priority are both realized.

Referring to fig. 7, fig. 7 shows a flowchart of the specific step of step S120, which specifically includes the following steps S121 to S122:

in step S121, the bandwidth control device determines a first control calculation unit corresponding to the first processing priority.

The processing priorities and the control calculating units are in a one-to-one mapping relationship, and the first control calculating unit corresponding to the first processing priority CoS10 is not assumed to be the control calculating unit 1 in fig. 3.

The QoS control management register 214 in the bandwidth control device 210 holds the mapping relationship between the thread and the processing priority CoS, and also holds the correspondence relationship between the CoS and the control calculation unit 212, so that the CoS corresponding to the thread can be selected first, the control calculation unit 212 corresponding to the CoS can be selected, and the control calculation unit 212 can be referred to as the first control calculation unit 212.

And step S122, the bandwidth control device calculates the limit rate of the first thread sending the memory access instruction after the preset clock cycle by using the first control calculation unit.

The control calculation units 212 in the bandwidth control device 210 in the running state correspond to the processing priorities corresponding to the thread identifications one to one. The bandwidth control device 210 may select a first control calculation unit 212 of the corresponding processing priority from the plurality of control calculation units 212 in the operating state, and implement calculation of the limit rate using the first control calculation unit 212. Each control calculation unit 212 in the operating state has a component whose parameter corresponds to the processing priority, and calculates the limit rate corresponding to the processing priority.

A first control calculation unit: the control calculation unit 1 may specifically perform the calculation of the limit rate by:

referring to fig. 8, fig. 8 shows a flowchart of the specific step of step S122, which specifically includes the following steps S1221 to S1223:

step S1221, when the token counter receives the token newly generated by the token generator for the first thread, the token counter increases the current number of tokens of the first thread by one.

In step S1222, the token counter decreases the current number of tokens of the first thread by one when the LLC sends the access instruction of the first thread to the lower-level storage unit.

Wherein, the processing priority corresponding to the thread identifier of the target access instruction corresponds to the first control computing unit 212.

Step S1223, the limit rate calculating unit calculates a difference value between the initial number of tokens of the first thread and the current number of tokens of the first thread in the current clock cycle, and calculates the limit rate of the first thread for sending the access instruction after the preset clock cycle according to the difference value.

If the initial number of the tokens is larger than the current number of the tokens, the difference value is a positive number; and if the initial number of the tokens is less than the current number of the tokens, the difference value is a negative number.

The token generation period of the token generator 2121 included in the first control calculation unit 212 may be the same as a preset sending period of the access instruction, where the preset sending period may be obtained according to a set access bandwidth, and the access bandwidth may be set by a user according to a processing priority of a thread. The token counter 2122 may increment the current number of tokens by one when receiving a token newly generated by the token generator 2121; and when the memory access instruction is received, reducing the current number of the token by one. In this way, the limiting rate calculating unit 2123 may obtain a variation trend and a variation value of the number of tokens in the time period of the current clock cycle; and calculating the limit rate according to the change trend and the change value of the number of the tokens. The memory access bandwidth can be set according to the processing priority of the threads, so that the calculated limit rate is related to the processing priority, the threads with different processing priorities can be distinguished and treated, and the processing efficiency is improved

Alternatively, in step S1223, the limit ratio may be specifically calculated as follows: the limit rate calculation unit calculates a limit rate variation according to a difference between the initial number of tokens of the first thread and the current number of tokens of the first thread in a current clock cycle and a difference between the initial number of tokens of the first thread and the current number of tokens of the first thread in one or more latest clock cycles; and the limiting rate calculation unit calculates the sum of the historical limiting rate and the limiting rate variable quantity, wherein the sum is the limiting rate of the first thread sending the access instruction after a preset clock cycle.

The most recent multiple clock cycles may be two or more, and for convenience of description, two clock cycles are not taken as an example:

for example, Err0, Err1, Err2 are not introduced, where Err0 represents the difference between the initial number of tokens and the current number of tokens in the current clock cycle; err1 represents the difference between the initial number of tokens and the current number of tokens in the previous clock cycle; err2 represents the difference between the initial number of tokens and the current number of tokens in the first two clock cycles.

The value of Err1 may be obtained by Err0 latching for one clock cycle and the value of Err2 may be obtained by Err1 latching for one clock cycle.

After determination of Err0, Err1, Err2, according to the formula:

the Throttle _ Value _ delta is calculated from the change in the limiting rate by K0 Err0+ K1 Err1+ K2 Err 2. The K0, the K1 and the K2 are three parameters and can be obtained through multiple experimental verification, and for convenience of calculation, the K0, the K1 and the K2 can be integer powers of 2.

After the constraint rate variation quantity, throttlevalue, delta, is calculated, the following formula can be used:

the thread _ Value _ d1+ thread _ Value _ delta calculates a limit rate thread _ Value of the memory access instruction sent by the processor core 110 after a preset clock cycle. Wherein, the thread _ Value _ d1 is a history limit rate, and can be obtained by latching one clock cycle by the thread _ Value.

And calculating the change rate of the limit rate according to the token difference value of the current clock cycle and the token difference value of the historical clock cycle, and then calculating the sum of the change rate of the limit rate and the historical limit rate to obtain a new limit rate, wherein the limit rate is the limit rate of the processor core 110 for sending the memory access instruction after the clock cycle is preset. And calculating the limiting rate by combining historical data, so that the limiting rate is more in line with the requirement of memory access bandwidth control.

Optionally, in a specific embodiment, after the limit rate thread _ Value is calculated, the specific limit number may be determined according to a corresponding relationship between the limit rate thread _ Value and the limit number. After the limiting rate is calculated, the method and the device are used for limiting the number of the memory access instructions sent by the same thread which sends the memory access instructions in the unit clock period after the preset clock period according to the limiting rate. Therefore, the specific limit number can be specified according to the corresponding relationship between the limit rate thread _ Value and the limit number.

For example, the restriction rate Throttle _ Value can correspond to the following five mapping intervals: (0,1/8 max _ rate ], (1/8 max _ rate,3/8 max _ rate ], (3/8 max _ rate,5/8 max _ rate ], (5/8 max _ rate,7/8 max _ rate ], (7/8 max _ rate, max _ rate ]. wherein, the number of limits for (0,1/8 × max _ rate ] is 0, (1/8 × max _ rate,3/8 × max _ rate ] is 1, (3/8 × max _ rate,5/8 × max _ rate ] is 2, (5/8 × max _ rate,7/8 × max _ rate ] is 3, (7/8 × max _ rate, max _ rate ] is 4. max _ rate is a user setting that can be obtained through multiple experiments.

That is, if the constraint rate Throttle _ Value falls within the mapping interval: (0,1/8 max _ rate ], the corresponding limit number is 0, i.e., the processor core 110 does not need to limit the thread corresponding to the limit rate;

if the restriction rate Throttle _ Value falls in the mapping interval: (1/8 max _ rate,3/8 max _ rate ], the corresponding limit number is 1, that is, the processor core 110 needs to limit the thread corresponding to the limit rate, specifically, the limit is that the number of the access instructions sent by the thread corresponding to the limit rate in a unit clock cycle is reduced by 1 compared with the set value;

if the restriction rate Throttle _ Value falls in the mapping interval: (3/8 max _ rate,5/8 max _ rate ], the corresponding limit number is 2, that is, the processor core 110 needs to limit the thread corresponding to the limit rate, specifically, the limit is that the number of the memory access instructions sent by the thread corresponding to the limit rate in a unit clock cycle is reduced by 2 compared with the set value;

if the restriction rate Throttle _ Value falls in the mapping interval: (5/8 max _ rate,7/8 max _ rate ], the corresponding limit number is 3, that is, the processor core 110 needs to limit the thread corresponding to the limit rate, specifically, the limit is that the number of the memory access instructions sent by the thread corresponding to the limit rate in a unit clock cycle is reduced by 3 compared with the set value;

if the restriction rate Throttle _ Value falls in the mapping interval: (7/8 × max _ rate, max _ rate ], the corresponding limit number is 4, that is, the processor core 110 needs to limit the thread corresponding to the limit rate, specifically, the limit is to reduce the number of the memory access instructions sent by the thread corresponding to the limit rate in a unit clock cycle by 4 compared with the set value.

The setting value of the memory access instruction sent by the thread in the unit clock cycle may be set according to the actual requirement of the multithread controller system, for example, the setting value is not set to be 4. It should be understood that the above stated values may also be other values, and the specific values thereof should not be construed as limiting the application.

In a specific implementation manner provided in the embodiment of the present application, the limit rate calculated by the bandwidth control device 210 may be sent to an emission limit module in the instruction output logic module 111, may also be sent to an instruction fetch unit in the processor core 110, and may also be sent to a prefetch unit in the multi-level cache, so as to dynamically control the operation speed of the units, thereby achieving the purpose of controlling the access bandwidth of the thread corresponding to the limit rate in the LLC 140.

Because the present invention limits the resources unique to the threads, when a thread with low priority is controlled, the high priority can use the resources shared among the threads, such as the bandwidth resources of the LLC 140, thereby realizing higher performance. Meanwhile, the control of the threads occurs in a link of the processor core 110, namely, a Dispatch/issue stage of processor instructions, so that mutual interference among the threads is avoided, the threads with low priority can be freed from resources, and the execution performance of the threads with high priority is improved.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The bandwidth control equipment is characterized in that the bandwidth control equipment is respectively connected with a last-level cache LLC (logical Link control) in a multi-level cache and at least one processor core, the at least one processor core supports multithreading, and the at least one processor core is communicated with the multi-level cache;

the bandwidth control device is configured to obtain a first access instruction sent by the LLC to a lower-level storage unit, where the first access instruction carries a first core identifier of a first processor core that generates the first access instruction and a first thread identifier of a first thread that the first processor core runs and generates the first access instruction;

the bandwidth control device is used for determining a first processing priority corresponding to the first thread identifier and determining the limiting rate of the first thread for sending the memory access instruction after a preset clock cycle according to the first processing priority;

and the bandwidth control device is used for sending the limiting rate to the first processor core and indicating the processor core to limit the number of the access instructions sent by the first thread after a preset clock cycle according to the limiting rate.

2. The bandwidth control apparatus according to claim 1, wherein the bandwidth control apparatus includes a plurality of control calculation units, at least one of the plurality of control calculation units being in a running state, the number of control calculation units in the running state being the same as the number of all processing priorities of all threads supported by the multithreaded processor system, the control calculation units in the running state corresponding one-to-one to the processing priorities;

the bandwidth control device is used for determining a first control calculation unit corresponding to the first processing priority;

the bandwidth control device is used for calculating the limit rate of the first thread sending the memory access instruction after the preset clock cycle by using the first control calculation unit.

3. The bandwidth control device according to claim 2, wherein the first control calculation unit includes a token generator, a token counter, and a limit rate calculation unit, the token generator and the limit rate calculation unit each being connected to the token counter;

the token generator is used for generating a corresponding token for each thread belonging to a first priority, and the generation period of the generated token is the same as the preset sending period of the access instruction corresponding to the first priority;

the token counter is used for adding one to the current number of the tokens of the first thread when receiving the tokens newly generated by the token generator for the first thread;

the token counter is further configured to reduce the current number of tokens of the first thread by one when the LLC sends a memory access instruction of the first thread to a lower-level storage unit;

the limit rate calculation unit is used for calculating a difference value between the initial number of the tokens of the first thread and the current number of the tokens of the first thread in the current clock cycle, and calculating the limit rate of the first thread for sending the access instruction after the preset clock cycle according to the difference value.

4. The bandwidth control apparatus according to claim 3,

the limit rate calculation unit is used for calculating a limit rate variation according to a difference value between the initial number of tokens of the first thread in a current clock cycle and the current number of tokens of the first thread and a difference value between the initial number of tokens of the first thread in one or more latest clock cycles and the current number of tokens of the first thread;

the limit rate calculation unit is used for calculating the sum of the historical limit rate and the limit rate variable quantity, wherein the sum is the limit rate of the first thread sending the access instruction after the preset clock cycle.

5. A multithreaded processor system comprising at least one processor core, a multi-level cache, and the bandwidth control device of any of claims 1-4, the at least one processor core supporting multithreading, the at least one processor core in communication with the multi-level cache, the multi-level cache comprising a last level cache, LLC, the LLC and the at least one processor core both being connected to the bandwidth control device.

6. The multithreaded processor system of claim 5 wherein the processor core comprises an instruction output logic module and an execution/memory access unit, the instruction output logic module being coupled to the execution/memory access unit, the instruction output logic module having an emission limit module disposed therein;

and the emission limiting module in the processor core is used for receiving the limiting rate sent by the bandwidth control equipment and indicating the emission limiting module to limit the number of the memory access instructions sent by the first thread after a preset clock cycle according to the limiting rate.

7. A memory access bandwidth control method applied to the multithreaded processor system described in any one of claims 5 to 6, the method comprising:

the bandwidth control equipment acquires a first memory access instruction sent by an LLC (logical link control) to a lower-level storage unit, wherein the first memory access instruction carries a first core identifier of a first processor core generating the first memory access instruction and a first thread identifier of a first thread which is operated by the first processor core and generates the first memory access instruction;

the bandwidth control equipment determines a first processing priority corresponding to the first thread identifier, and determines the limit rate of the first thread for sending the memory access instruction after a preset clock cycle according to the first processing priority;

and the bandwidth control equipment sends the limiting rate to the first processor core and instructs the processor core to limit the number of the memory access instructions sent by the first thread after a preset clock cycle according to the limiting rate.

8. The method of claim 7, wherein determining a limit rate at which the first thread sends memory access instructions after a preset clock cycle based on the first processing priority comprises:

the bandwidth control equipment determines a first control calculation unit corresponding to the first processing priority;

and the bandwidth control equipment calculates the limit rate of the first thread sending the memory access instruction after a preset clock cycle by using the first control calculation unit.

9. The method as claimed in claim 8, wherein the bandwidth control device calculates, by using the first control calculation unit, a limit rate of the processor core sending the memory access instruction after a preset clock cycle, and includes:

when a token counter receives a token newly generated by a token generator for the first thread, adding one to the current number of the tokens of the first thread;

when the LLC sends the access instruction of the first thread to a lower-level storage unit, the token counter reduces the current number of tokens of the first thread by one;

and the limit rate calculation unit calculates the difference value between the initial number of the tokens of the first thread and the current number of the tokens of the first thread in the current clock cycle, and calculates the limit rate of the first thread for sending the access instruction after the preset clock cycle according to the difference value.

10. The method of claim 9, wherein said calculating a limit rate for said first thread to issue memory access instructions after a predetermined clock cycle based on said difference comprises:

the limit rate calculation unit calculates a limit rate variation according to a difference between the initial number of tokens of the first thread and the current number of tokens of the first thread in a current clock cycle and a difference between the initial number of tokens of the first thread and the current number of tokens of the first thread in one or more latest clock cycles;

and the limiting rate calculation unit calculates the sum of the historical limiting rate and the limiting rate variable quantity, wherein the sum is the limiting rate of the first thread sending the access instruction after a preset clock cycle.

11. The method of claim 7, wherein sending the limit rate to the first processor core by the bandwidth control device comprises:

and the bandwidth control equipment sends a limiting rate to an emission limiting module in the first processor core, and instructs the emission limiting module to limit the number of the memory access instructions sent by the first thread after a preset clock cycle according to the limiting rate.