CN113190350B - LLC (logical Link control) distribution method for mixed deployment of off-line containers - Google Patents

LLC (logical Link control) distribution method for mixed deployment of off-line containers Download PDF

Info

Publication number
CN113190350B
CN113190350B CN202110480587.0A CN202110480587A CN113190350B CN 113190350 B CN113190350 B CN 113190350B CN 202110480587 A CN202110480587 A CN 202110480587A CN 113190350 B CN113190350 B CN 113190350B
Authority
CN
China
Prior art keywords
cache
llc
performance
scheme
total
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110480587.0A
Other languages
Chinese (zh)
Other versions
CN113190350A (en
Inventor
王振宇
吴俊贤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110480587.0A priority Critical patent/CN113190350B/en
Publication of CN113190350A publication Critical patent/CN113190350A/en
Application granted granted Critical
Publication of CN113190350B publication Critical patent/CN113190350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses an LLC (logical Link control) allocation method oriented to mixed deployment in an offline container, which is applied to a cache allocation program and comprises the following steps of: s1, acquiring all online service and offline task container processes in the host where the cache allocation program is located, and performing performance monitoring; s2, calculating the use characteristic of the cache according to the performance monitoring result, and judging whether the process belongs to a cache sensitive type or a problem type; s3, limiting the total available LLC cache amount of the problem type process; s4, collecting memory access records of the cache sensitive process, calculating a multiplexing time histogram, and substituting the histogram into an average eviction model to calculate a loss rate curve; and S5, substituting the loss rate curves of all applications and the performance monitoring data collected in the step S1 into a cache allocation annealing algorithm, and calculating an allocation scheme capable of ensuring the online service performance and the offline task performance. The method solves the problem of CPU cache allocation in an offline task mixed deployment environment, and ensures the performance of online service and offline tasks.

Description

LLC (logical Link control) distribution method for mixed deployment of off-line containers
Technical Field
The invention belongs to the technical field of computers, and particularly relates to an LLC (logical Link control) distribution method for mixed deployment of off-line containers.
Background
In recent years, a large number of internet companies and internet services are in endless, and the number of users and the number of applications are more increased year by year, which puts a very high performance requirement on an infrastructure of a new era, namely a cloud data center. In a container runtime environment, one server typically hosts hundreds or thousands of containers, and multiple containers compete violently for computing resources, resulting in extended response times for applications. Under the circumstance, how to efficiently utilize the existing computing resources becomes the direction for enterprises and researchers to struggle.
With the improvement of CPU performance, slow IO rate has become the largest factor that restricts performance. For this reason, CPU caches are used to narrow the huge gap between register and memory rates. A typical cache of a modern CPU has 3 levels, and the CPU cache is an N-way set associative cache, that is, one memory address is mapped to one cache set through a mapping function, and a plurality of cache ways store data in the cache set. The sizes of L1, L2, and L3 (i.e., LLC) are becoming larger, and access delays are also becoming larger. The multi-level cache can well utilize the space and time locality of memory access, and the memory access rate is greatly improved.
However, one server of the cloud container platform needs to bear hundreds or even thousands of containers, processes of each container compete for only a few MB of cache space, and each cache line is mutually evicted, so that the Miss rate is increased due to the intense competition, and finally the memory access rate is decreased.
With the introduction of CAT technology by intel, operations and maintenance personnel can restrict the use of llc (lastlevelcache) by various processes, and allocate different cache lines to different processes. However, CAT can only allocate 16 schemes at most, and how to reasonably use the schemes to achieve the maximum effect becomes a problem which must be researched.
Disclosure of Invention
The invention mainly aims to overcome the defects and shortcomings of the prior art, provides an LLC (logical Link control) allocation method oriented to offline container mixed deployment, solves the problem of CPU (Central processing Unit) cache allocation in an offline task mixed deployment environment, and ensures the performance of online service and offline tasks.
In order to achieve the purpose, the invention adopts the following technical scheme:
an LLC (logical Link control) allocation method oriented to mixed deployment of offline containers is applied to a cache allocation program and comprises the following steps of:
s1, acquiring all online service and offline task container processes in the host where the cache allocation program is located, and respectively performing performance monitoring under the conditions of two LLC cache ways and all LLC cache ways;
s2, calculating the use characteristics of the cache according to the results of the performance monitoring of the process twice, and judging whether the process belongs to a cache sensitive type or a problem type;
s3, limiting the total available LLC cache amount of the problem type process;
s4, collecting memory access records of the cache sensitive process, calculating a multiplexing time histogram by adopting a pond sampling method, and substituting the histogram into an average eviction model to calculate a loss rate curve;
and S5, substituting the loss rate curves of all the application programs and the performance monitoring data collected in the step S1 into a cache allocation annealing algorithm, and calculating an allocation scheme capable of ensuring the online service performance and the offline task performance.
Furthermore, the online service is an application program which is oriented to the user and has service quality requirements; the offline task is a batch processing application program which does not need to face a user;
the other types of processes in the host where the cache allocation program is located are not in the cache control range;
the performance monitoring specifically comprises the steps of monitoring 8 events by adopting a CPU performance counter for the processes of cache sensitive application programs and problem application programs, wherein the 8 events specifically comprise instruction completion number, CPU operation cycle number, total read instruction number, total write instruction number, total memory access cycle number, total data access cycle number, total LLC loss number and total LLC hit number;
the specific steps of the performance monitoring are as follows:
and setting two cache ways used by the monitored process by adopting CAT technology, monitoring 8 events for 1 minute by using a monitoring tool, then setting all the cache ways used by the monitored process, repeating the monitoring process, and recording data obtained by monitoring twice.
Further, in step S2, the change before and after the performance event count and the absolute value are used to calculate the cache usage characteristic of the process, where the cache usage characteristic calculation further includes calculating the following performance indicators:
the calculation formula of the instruction number per cycle is as follows:
instruction completion number ÷ CPU run cycle number
The LLC access number per period is calculated by the following formula:
(total LLC lost + total LLC hit) ÷ instruction complete number
The average LLC cache loss penalty is calculated according to the following formula:
total number of memory access cycles divided by total number of LLC losses
The LLC loss rate is calculated by the following formula:
LLC total lost ÷ (LLC total hit + LLC total lost)
The LLC loss number of every thousand instructions is calculated according to the following formula:
total LLC loss ÷ instruction completion number 1000
The LLC hit number per thousand instructions is calculated by the following formula:
total number of LLC hits/instruction completion number 1000
The LLC access number of each thousand instructions is calculated by the following formula:
(total LLC hits + total LLC misses) ÷ instruction completion number 1000
The non-access instruction has the following calculation formula of the number of cycles per instruction:
(CPU operating cycle number-total data access cycle number) ÷ (instruction completion number- (total number of read instructions + total number of write instructions))
The average cache access delay period is calculated by the following formula:
(total number of data accesses-total number of memory accesses) ÷ ((total number of read instructions + total number of write instructions) -LLC lost number)
Calculate per cycle instruction number variation, per thousand instruction LLC lost number variation and per thousand instruction LLC hit number variation, the formula of calculation of three variation is:
(Whole cache way Performance event count-two cache way Performance event count) ÷ two cache way Performance event count | (complete cache way Performance event count-two cache way Performance event count |)
Judging the cache use characteristics of the process, wherein the judgment conditions are as follows:
and (3) volatile type: the LLC loss number of each thousand instructions is more than 4, the LLC hit number of each thousand instructions is less than 0.5, and the variation is less than 0.3;
a overlord type: the instruction number per cycle is less than or equal to 0.6, the LLC loss number per thousand instructions is greater than or equal to 10, the LLC hit number per thousand instructions is greater than or equal to 10, and the variation is less than 0.3;
high sensitivity: the instruction number per cycle is less than 1.3, the variation is greater than 0.1, and the LLC loss number variation is greater than 0.3 per thousand instructions;
medium sensitive: the instruction number per cycle is more than 1.3, the variation is more than 0.1, and the LLC loss number variation is more than 0.3 per thousand instructions;
low sensitivity: except for volatile Huo type, super-Rabbit type, high sensitivity type and medium sensitivity type.
Further, the problem-type processes are namely swing-holly-type and rabdosia-type applications; the sensitive processes are high sensitive, medium sensitive and low sensitive applications;
the problem-limited process is limited in the space used by the cache by restricting the use of only 4 cache ways for the swing-Hull and Blake applications by CAT technology.
Further, the specific steps of acquiring the memory and calculating the loss rate curve in step S4 are as follows:
screening out a target thread needing memory access acquisition, wherein the thread needs to belong to the processes of the online service and the offline task, does not belong to a problem process and a low-sensitivity process, and runs for 1 hour;
developing a memory multiplexing time histogram acquisition program by using a Pin tool, acquiring all memory access addresses of a target process, and then sampling by using a pond sampling method to obtain a multiplexing time histogram;
and substituting the multiplexing time histogram into a cache average eviction time model to calculate a loss rate curve.
Further, the pond sampling method comprises the following specific steps:
creating a pond with the size of k; when the ith address arrives, if the address is not in the pond, the address is put into the pond according to the probability of min (1, k/i), if the pond is full, one address in the pond is randomly selected, and a new address is put into the pond after deletion; when a new address is put into the pond, marking the new address as not sampled, recording an address serial number i, when the address is accessed for the second time, making the access serial number j, recording the interval j-i of the two accesses, namely the multiplexing time of the address, and marking the address as sampled; traversing address records in the pond after sampling the access address, if the address records are sampled, adding one to the record number corresponding to the multiplexing time, and if the address records are not sampled, adding one to the record number with infinite multiplexing time; and obtaining a multiplexing time histogram after all the addresses are traversed.
Further, the cache average eviction time model specifically includes:
let n be the total number of data accesses, rt (t) be the total number of data with multiplexing time t, and f (t) be the data proportion with multiplexing time t, then:
Figure BDA0003048415330000051
for one access, define p (t) as the probability that its multiplexing time is greater than t:
Figure BDA0003048415330000052
the condition of whether to move now turns into a probability, p (t), that is understood to mean that a cache line with reuse time t moves p (t) positions in one unit time; at this time, p (t) is also understood as the moving rate of the cache line; at position m, the data arrival time is TmIts moving speed v (T)m) Comprises the following steps:
v(Tm)=P(Tm) (3)
the following relationship is obtained from equations (2) and (3):
Figure BDA0003048415330000061
for v (T)m) Integrating to obtain the moving distance of the data in the stack; from TmTo Tm+1Data moves a distance in the stack; when the movement of all positions m is accumulated, the size c of the whole stack is obtained:
Figure BDA0003048415330000062
the left side of equation (5) is varied:
Figure BDA0003048415330000063
thus, equation (6) is obtained:
Figure BDA0003048415330000071
AET (c), namely the average eviction time when the cache size is c, is calculated through the formula (6); the loss rate when the buffer size is c is:
mr(c)=P(AET(c)) (7)
substituting the values 1,2,3, …, c into equation (7) yields the relationship between buffer size and loss rate, i.e. loss rate curve.
Further, the cache allocation annealing algorithm specifically includes:
initializing the cache space usage of all target threads needing memory access acquisition, initializing the current temperature T to be 100000, initializing all threads to divide cache equally, initializing all cache allocation schemes, setting the initialized cache allocation schemes as optimal schemes, and then calculating by using the following steps:
s51, predicting the average cycle number and the loss rate of the instructions of the operation cycle of the process by using the cache space usage;
s52, predicting the new cache space usage by using the predicted average cycle number and loss rate of the instructions;
s53, updating the predicted operation period to 95% of the original value;
s54, repeating the step S52 and the step S53 until the predicted operation period reaches the minimum value or the amplitude change of the cache space usage reaches the minimum value;
s55, calculating a performance index by using the predicted average instruction period and loss rate;
s56, calculating a neighbor scheme, randomly selecting a scheme to be modified, and randomly selecting a modification method; after the modification is finished, verifying whether the new scheme is not accessed, if not, re-executing the step, and if yes, continuing to execute the next step;
s57, substituting the neighbor solution into the step S51 to the step S55 to obtain the value of the prediction performance index of the neighbor solution;
s58, comparing the performance index of the current optimal scheme with the performance index of the adjacent scheme, and if the current index is better, setting the adjacent scheme as the optimal scheme;
s59, calculating the difference value of the performance indexes of the current scheme and the neighbor scheme, wherein the difference value is defined as that if the index is larger, the index of the current scheme is better, the index of the neighbor scheme is subtracted from the index of the current scheme, otherwise, the difference value is opposite;
s510, whether the neighbor scheme is set as the current scheme or not is determined, and when the difference value is less than 0 or a random number from 0 to 1 is greater than or equal to
Figure BDA0003048415330000081
Replacing the current scheme with a neighbor scheme;
s511, updating the temperature T to 95% of the original temperature T;
s512, when the temperature T is higher than the lowest temperature, returning to the step S56, otherwise, continuing to obtain an optimal scheme;
and S513, setting the optimal scheme into each thread actually through the CAT technology.
Further, the performance index specifically includes:
the overall throughput is defined as the sum of the instruction cycles of all threads, and the larger the index value is, the better the index value is;
the loss number of each thousand instructions of the online service is defined as the average value of the loss numbers of each thousand instructions of all online service threads, and the smaller the index value is, the better the index value is;
the performance reduction ratio is defined as the ratio of the instruction periodicity of the thread before no management and control to the performance periodicity of the cache after management and control, the index is defined as the average value of the performance reduction ratios of all the online service threads, and the smaller the index value is, the better the index value is;
the maximum performance reduction ratio of the online service is defined as the maximum value of the performance reduction ratios of all the online service threads, and the smaller the index value is, the better the index value is.
Further, the modification method specifically includes the following 4 types:
moving the thread of the scheme to another random scheme, and ensuring that the cache way of the other scheme is inconsistent with the cache way of the scheme;
the cache way moves to the left or right by one position in the scheme;
the scheme adds a cache way which must be added on two sides of the existing cache way and is not added when the cache way is full;
according to the scheme, one cache way is reduced, the reduction must be carried out on two sides of the existing cache way in the scheme, and the reduction is not carried out when only 1 cache way is left.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the method has higher accuracy and universality of cache classification by two-stage performance sampling, and can screen out problem type and sensitive type processes or threads under the condition of shorter test time; the original cache allocation annealing algorithm is suitable for single-process single-thread application in a general scene, the method is more biased to a mixed deployment scene of an online container and an offline container, and the method is expanded into a multi-thread method; the cache allocation based on the scheme of the invention can bring about 10 percent of performance improvement on average and 40 percent of performance improvement at most, and the performance loss of the limited problem-type process is only 5 percent.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Examples
In this embodiment, currently, in the performance event collection portion, because the container cloud generally runs based on the Linux operating system, the Linux operating system can be used to provide a perf tool to monitor the performance events of the process, and support various performance events and custom events of the CPU. In terms of memory collection, a PEBS mechanism (collectively called prediseevent-based sampling) supported by the CPU of intel can record the memory address accessed by the CPU or a Pin tool provided by the intel official, and supports the development of Pintool for collecting the memory access address. In the cache way allocation, the CAT technology provided by intel can support setting a cache allocation scheme, that is, which cache lines can be used by the scheme, and a cache scheme to which a process belongs.
Therefore, the steps of the embodiment of the invention are as follows:
and monitoring the processes of all online service containers and offline task containers through per, acquiring and classifying performance data, and limiting the cache way of the problem type application program by means of CAT technology. Then, screening target threads, acquiring memory access addresses and calculating a loss rate curve through a Pin tool or a PEBS mechanism, calculating an optimal allocation scheme according to the loss rate curves of all the target threads by using a cache allocation annealing algorithm, and implementing the allocation scheme by using a CAT technology.
As shown in fig. 1, the present invention provides an LLC allocation method oriented to mixed deployment in offline container, which is applied to a cache allocation program, and includes the following steps:
s1, acquiring all online services and offline task container processes in the host where the cache allocation program is located, and performing performance monitoring under the conditions of two LLC cache ways and all LLC cache ways, specifically:
the online service is an application program which is oriented to users and has service quality requirements; the offline task is a batch processing application program which does not need to be oriented to a user;
the other types of processes in the host where the cache allocation program is located are not in the cache control range;
the performance monitoring is specifically a process of two application programs of a cache sensitive type and a problem type, and a CPU performance counter is adopted to monitor the instruction completion number, the CPU operation cycle number, the total read instruction number, the total write instruction number, the total memory access cycle number, the total data access cycle number, the total LLC loss number and the total LLC hit number;
the specific steps of the performance monitoring are as follows:
and (3) setting two cache ways used by the monitored process by adopting CAT technology, monitoring the 8 events for 1 minute by using a monitoring tool, then setting all the cache ways used by the monitored process, repeating the monitoring process, and recording data obtained by monitoring twice.
In this embodiment, the following are specifically mentioned:
on the intel skylackecpu architecture, the specific events measured are shown in table 1 below:
Figure BDA0003048415330000111
TABLE 1
The process sampling is divided into two stages, two cache ways and sampling of all the cache ways are used, the sampling time of the two stages is 1 minute, after results are obtained, the process is classified, and sampling data used by a subsequent cache allocation annealing algorithm are also obtained in the full cache way sampling of the stage.
S2, calculating the use characteristics of the cache according to the results of the performance monitoring of the process twice, and judging whether the process belongs to a cache sensitive type or a problem type;
calculating the oversize cache use of the process by adopting the front and back change and the absolute value of the performance time counting, wherein the cache use characteristic calculation further comprises the following steps:
the following performance indicators were calculated:
number of instructions per cycle; number of LLC accesses per cycle; average LLC cache loss penalty; LLC loss rate; LLC miss per thousand instructions; LLC number of hits per thousand instructions; LLC access number per thousand instructions; number of cycles per instruction for non-memory access instructions; average cache access delay period; as shown in table 2 below;
Figure BDA0003048415330000121
TABLE 2
Calculate 3 variable quantities, instruction number variable quantity per cycle, the number variable quantity is lost to every thousand instruction LLC and the number variable quantity is hit to every thousand instruction LLC, and the computational formula of three variable quantities is:
(all cache way performance event count-two cache way performance event count)/two cache way performance event count |;
judging the cache use characteristics of the process, wherein the judgment conditions are shown in the following table 3;
Figure BDA0003048415330000122
TABLE 3
S3, limiting the total available LLC buffer amount of the problem type process;
the problem-type process is the volatile Huo and Rabdosia type application; the sensitive processes are high sensitive, medium sensitive and low sensitive applications;
the problem process is limited in the space used by the cache by CAT technology, and the two kinds of processes are limited to use 4 cache ways at most.
In this embodiment, the following are specifically mentioned: after the process classification result is obtained according to step S2, processes belonging to the volatile type and the blush type are selected, then the CLOS with sequence number 1 in CAT is set to the lowest 4 cache ways, and then these processes are set to the scheme using CLOS 1.
S4, collecting memory access records of the cache sensitive process, calculating a multiplexing time histogram by adopting a pond sampling method, and substituting the histogram into an average eviction model to calculate a loss rate curve;
the specific steps of acquiring the memory and calculating the loss rate curve in step S4 are as follows:
screening out a target thread needing memory access acquisition, wherein the thread needs to belong to the processes of the online service and the offline task in the step S1, does not belong to the problem-type process and the low-sensitivity process, and needs to run for 1 hour;
developing a memory multiplexing time histogram acquisition program by using a Pin tool, acquiring all memory access addresses of a target process by using the program, and then sampling by using a pond sampling method;
the general flow of the pond sampling method is as follows:
creating a pond with the size of k; when the ith address arrives, if the address is not in the pond, the address is put into the pond according to the probability of min (1, k/i), if the pond is full, one address in the pond is randomly selected, and a new address is put into the pond after deletion; when a new address is put into the pond, marking the new address as not sampled, recording an address serial number i, when the address is accessed for the second time, making the access serial number j, recording the interval j-i of the two accesses, namely the multiplexing time of the address, and marking the address as sampled; traversing address records in the pond after sampling the access address, if the address records are sampled, adding one to the record number corresponding to the multiplexing time, and if the address records are not sampled, adding one to the record number with infinite multiplexing time; after all addresses are traversed, obtaining a multiplexing time histogram;
the multiplexing time histogram is substituted into a cache average eviction time model to calculate a loss rate curve, wherein the cache average eviction time model specifically comprises:
let n be the total number of data accesses, rt (t) be the total number of data with multiplexing time t, and f (t) be the data proportion with multiplexing time t, then:
Figure BDA0003048415330000141
for one access, define p (t) as the probability that its multiplexing time is greater than t:
Figure BDA0003048415330000142
the condition of whether to move now turns into a probability, p (t), that is, the cache line with reuse time t moves p (t) positions in one unit time; at this time, p (t) is also understood as the moving rate of the cache line; at position m, the data arrival time is TmIts moving speed v (T)m) Comprises the following steps:
v(Tm)=P(Tm) (3)
from the above equations (2) and (3), the following relationship is derived:
Figure BDA0003048415330000143
the derivation process is as follows:
Figure BDA0003048415330000144
for v (T)m) Integrating to obtain the moving distance of the data in the stack; from TmTo Tm+1Data moves a distance in the stack; when the movement of all positions m is accumulated, the size c of the whole stack is obtained:
Figure BDA0003048415330000151
the left side of equation (5) is varied:
Figure BDA0003048415330000152
thus, equation (6) is obtained:
Figure BDA0003048415330000153
AET (c), namely the average eviction time when the cache size is c, is calculated through the formula (6); the loss rate when the buffer size is c is:
mr(c)=P(AET(c)) (7)
substituting the values 1,2,3, …, c into equation (7) yields the relationship between buffer size and loss rate, i.e. loss rate curve.
In this embodiment, there are two methods for collecting memory access records, the first is memory sampling based on PEBS, which has small intrusiveness on the program and low performance impact, but has low sampling precision. The second method is the memory sampling based on Pin, and the method has high precision, but the acquisition is slow and the influence on the performance of the process is large. In the implementation process, which memory collection mode is used is determined by actual requirements.
The PEBS-based implementation method comprises the following steps:
the PEBS can be used on all performance counters, but only for a few performance events that support the PEBS. Enabling the PEBS requires setting the PEBS buffer base address, the buffer write location (i.e., index), the buffer maximum size (i.e., absolute maximum), the interrupt threshold (the address in the buffer that generates the performance counter interrupt), and the CounterReset, in addition to setting the enable on the specified MSR. The performance counter then starts counting down from the CounterReset, and when the count down reaches 0, a PEBS assesist event is generated that triggers the PEBS record to be written into the buffer index location, which then moves the index back, resets the counter to the CounterReset and counts down again. The format of the PEBS record on the Skylake microprocessor architecture is shown in table 4 below. When the index reaches the position of the interrupt threshold, a performance counter interrupt is generated, the PEBS record of the buffer is read through a preset interrupt service program, and the index is reset, so that the sampling can be restarted.
Figure BDA0003048415330000161
TABLE 4
The data required for implementation of the method is at 98H. On the Linux system, collection of PEBS records is supported, but the Linux kernel does not process these records and does not export the records anywhere. The reason is that Linux only uses precision characteristics, i.e. PEBS can guarantee that the occurrence of an attribute event differs by 1 to 2 clock cycles from the occurrence of a performance event that causes the counter to count down to 0. The required PEBS record is acquired by means of the PEBS processing function of Linux, which requires the KProbe and Trace Point functions of Linux.
The general flow of acquiring the PEBS records by using the KProbe and the Tracepoint is that a KProbe PreHandler is added to a PEBS processing function of a kernel, a function provided by us is executed before the function is entered, the function reads all PEBS records, and then a self-defined Tracepoint hook function is called. The perf can be used for tracking tracepoint and recording events, and the function of the perf is used for storing PEBS records obtained from all tracepoint calls.
The reading of the PEBS record needs to know the base address and index of the PEBS Buffer. This information is stored in the CPU's Debug Store MSR register, which in Linux is represented by MSR _ IA32_ DS _ AREA, whose contents are readable using the rdmsrl function. The content of the MSR may be converted into a debug _ store structure, then the base address and the index may be obtained by reading the pebs _ buffer _ base and the pebs _ index in the structure, and then the address with the offset of 98H and the length of 64 bits may be read.
And after the address is obtained, operating a program to read the output file of the perf, and sampling the multiplexing time histogram of the address according to a pond sampling method.
The memory collection method based on the Pin tool comprises the following steps:
the specific analysis using Pin requires the development of a Pintool, which will call the function of Pin to execute a self-defined analysis logic, specifically defining the following:
a life cycle function;
the method is executed when a process or a thread is started or closed, and is used for initializing or writing final data and other operations;
a binary code modification function, which is responsible for analyzing a binary instruction of a process, and inserting an analysis function defined by Pintool into a specified position, such as the front or the back of the instruction, to declare required parameters, such as registers, thread IDs and the like, wherein the function is usually executed only once and cannot be called after the modification is finished;
and analyzing the function, namely a user-defined function provided by the user, and executing an analysis task by the analyzing function according to the obtained actual parameters.
In the binary code modification function, whether an instruction is a memory access instruction is checked, Pin provides APIs of INS _ IsMemoryRead, INS _ HasMemoryRead2 and INS _ IsMemoryWrite for inquiring whether a single instruction is a read-write operation, and whether the instruction is the memory access instruction can be known only by providing instruction parameters. For the access instruction, an analysis function for collecting the access address is inserted in front of the access instruction, the analysis function obtains the access address, and the multiplexing time of each address is acquired. And the life cycle function executes the operations of pool initialization and result file writing.
And S5, substituting the loss rate curves of all applications and the performance monitoring data collected in the step S1 into a cache allocation annealing algorithm, and calculating an allocation scheme capable of ensuring the online service performance and the offline task performance.
The cache allocation annealing algorithm specifically comprises the following steps:
firstly, initializing the cache space usage of all threads, initializing the current temperature T as 100000, initializing all threads to divide the cache equally, initializing all cache allocation schemes, setting the initialized cache allocation schemes as optimal schemes, and then calculating by using the following steps:
s51, predicting the average cycle number and the loss rate of the instructions of the operation cycle of the process by using the cache space usage;
s52, predicting the new cache space usage by using the predicted average cycle number and loss rate of the instructions;
s53, updating the predicted operation period to 95% of the original value;
s54, repeating the step S52 and the step S53 until the predicted operation period reaches the minimum value or the amplitude change of the cache space usage reaches the minimum value;
s55, calculating a performance index by using the predicted average instruction cycle and loss rate, wherein the performance index comprises:
1) the overall throughput is defined as the sum of the instruction cycles of all threads, and the larger the index value is, the better the index value is;
2) the average loss number of every thousand instructions of the online service is defined as the average value of the loss numbers of every thousand instructions of all online service threads, the calculation method is shown in table 2, and the smaller the index value is, the better the index value is;
3) the performance reduction ratio is defined as the ratio of the instruction periodicity of the thread before no management and control to the performance periodicity of the cache after management and control, the index is defined as the average value of the performance reduction ratios of all the online service threads, and the smaller the index value is, the better the index value is;
4) the maximum performance reduction ratio of the online service is defined as the maximum value of the performance reduction ratios of all the online service threads, and the smaller the index value is, the better the index value is.
S56, calculating a neighbor scheme, randomly selecting a scheme to be modified, and randomly selecting the following four modification methods:
1) moving the thread of the scheme to another random scheme, and ensuring that the cache way of the other scheme is inconsistent with the cache way of the scheme;
2) the cache way moves to the left or right by one position in the scheme;
3) the scheme adds a cache way which must be added on two sides of the existing cache way and is not added when the cache way is full;
4) according to the scheme, one cache way is reduced, the reduction must be carried out on two sides of the existing cache way in the scheme, and the reduction is not carried out when only 1 cache way is left.
After the modification is finished, whether the new scheme is not accessed or not needs to be verified, if not, the step is executed again, and if yes, the next step is continued;
s57, substituting the neighbor solution into the step S51 to the step S55 to obtain the value of the prediction performance index of the neighbor solution;
s58, comparing the performance index of the current optimal scheme with the performance index of the adjacent scheme, and if the current index is better, setting the adjacent scheme as the optimal scheme;
s59, calculating the difference value of the performance indexes of the current scheme and the neighbor scheme, wherein the difference value is defined as that if the index is larger, the index of the old scheme is better, the index of the neighbor scheme is subtracted from the index of the old scheme, otherwise, the difference value is opposite;
s510, whether the neighbor scheme is set as the current scheme or not is determined, and when the difference value is less than 0 or a random number from 0 to 1 is greater than or equal to
Figure BDA0003048415330000201
Replacing the current scheme with a neighbor scheme;
s511, updating the temperature T to 95% of the original temperature T;
s512, when the temperature T is higher than the lowest temperature, returning to the step S56, otherwise, continuing the next step to obtain an optimal scheme;
and S513, setting the optimal scheme into each thread actually through the CAT technology.
It should also be noted that in this specification, terms such as "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. An LLC (logical Link control) allocation method oriented to mixed deployment in an offline container is applied to a cache allocation program and comprises the following steps of:
s1, acquiring all online service and offline task container processes in a host where the cache allocation program is located, and respectively performing performance monitoring under the conditions of two LLC cache ways and all LLC cache ways; the online service is an application program which is oriented to users and has service quality requirements; the offline task is a batch processing application program which does not need to face a user;
the other types of processes in the host where the cache allocation program is located are not in the cache control range;
the performance monitoring specifically comprises the steps of monitoring 8 events by adopting a CPU performance counter for the processes of cache sensitive application programs and problem application programs, wherein the 8 events specifically comprise instruction completion number, CPU operation cycle number, total read instruction number, total write instruction number, total memory access cycle number, total data access cycle number, total LLC loss number and total LLC hit number;
the specific steps of the performance monitoring are as follows:
setting two cache ways used by a monitored process by adopting CAT technology, monitoring the 8 events for 1 minute by using a monitoring tool, then setting all the cache ways used by the monitored process, repeating the monitoring process, and recording data obtained by monitoring twice;
s2, calculating the use characteristics of the cache according to the results of the performance monitoring of the process twice, and judging whether the process belongs to a cache sensitive type or a problem type; calculating the cache use characteristics of the process by adopting the change and the absolute value before and after the counting of the performance events, wherein the calculation of the cache use characteristics further comprises the following steps of:
the calculation formula of the instruction number per cycle is as follows:
instruction completion number ÷ CPU run cycle number
The LLC access number per period is calculated by the following formula:
(total LLC lost + total LLC hit) ÷ instruction complete number
The average LLC cache loss penalty is calculated according to the following formula:
total number of memory access cycles divided by total number of LLC losses
The LLC loss rate is calculated by the following formula:
LLC total lost ÷ (LLC total hit + LLC total lost)
The LLC loss number of every thousand instructions is calculated according to the following formula:
total LLC loss ÷ instruction completion number 1000
The LLC hit number per thousand instructions is calculated by the following formula:
total number of LLC hits/instruction completion number 1000
The LLC access number of each thousand instructions is calculated by the following formula:
(total LLC hits + total LLC misses) ÷ instruction completion number 1000
The non-access instruction has the following calculation formula of the number of cycles per instruction:
(CPU operating cycle number-total data access cycle number) ÷ (instruction completion number- (total number of read instructions + total number of write instructions))
The average cache access delay period is calculated by the following formula:
(total number of data accesses-total number of memory accesses) ÷ ((total number of read instructions + total number of write instructions) -LLC lost number)
Calculate per cycle instruction number variation, per thousand instruction LLC lost number variation and per thousand instruction LLC hit number variation, the formula of calculation of three variation is:
(Whole cache way Performance event count-two cache way Performance event count) ÷ two cache way Performance event count | (complete cache way Performance event count-two cache way Performance event count |)
Judging the cache use characteristics of the process, wherein the judgment conditions are as follows:
and (3) volatile type: the LLC loss number of each thousand instructions is more than 4, the LLC hit number of each thousand instructions is less than 0.5, and the variation is less than 0.3;
a overlord type: the instruction number per cycle is less than or equal to 0.6, the LLC loss number per thousand instructions is greater than or equal to 10, the LLC hit number per thousand instructions is greater than or equal to 10, and the variation is less than 0.3;
high sensitivity: the instruction number per cycle is less than 1.3, the variation is greater than 0.1, and the LLC loss number variation is greater than 0.3 per thousand instructions;
medium sensitive: the instruction number per cycle is more than 1.3, the variation is more than 0.1, and the LLC loss number variation is more than 0.3 per thousand instructions;
low sensitivity: other conditions except volatile Huo type, super-Rabdosia type, high sensitivity type and medium sensitivity type;
s3, limiting the total available LLC cache amount of the problem type process; problem-type processes, namely, volatile Huo-type and Rabdosia-type applications; the sensitive processes are high sensitive, medium sensitive and low sensitive applications;
the method specifically comprises the steps that the cache use space of the problem limiting process limits the use of at most 4 cache ways of the swing type and the tyrant type application programs through CAT technology;
s4, collecting memory access records of the cache sensitive process, calculating a multiplexing time histogram by adopting a pond sampling method, and substituting the histogram into an average eviction time model to calculate a loss rate curve; the mean eviction time model is specifically:
let n be the total number of data accesses, rt (t) be the total number of data with multiplexing time t, and f (t) be the data proportion with multiplexing time t, then:
Figure FDA0003541864770000031
for one access, define p (t) as the probability that its multiplexing time is greater than t:
Figure FDA0003541864770000032
the condition of whether to move now turns into a probability, p (t), that is understood to mean that a cache line with reuse time t moves p (t) positions in one unit time; at this time, p (t) is also understood as the moving rate of the cache line; at position m, the data arrival time is TmIts moving speed v (T)m) Comprises the following steps:
v(Tm)=P(Tm) (3)
the following relationship is obtained from equations (2) and (3):
Figure FDA0003541864770000033
for v (T)m) Integrating to obtain the moving distance of the data in the stack; from TmTo Tm+1Data moves a distance in the stack; when the movement of all positions m is accumulated, the size c of the whole stack is obtained:
Figure FDA0003541864770000034
the left side of equation (5) is varied:
Figure FDA0003541864770000041
thus, equation (6) is obtained:
Figure FDA0003541864770000042
AET (c), namely the average eviction time when the cache size is c, is calculated through the formula (6); the loss rate when the buffer size is c is:
mr(c)=P(AET(c)) (7)
substituting the values 1,2,3, …, c into formula (7) to obtain the relation between the buffer size and the loss rate, i.e. a loss rate curve;
and S5, substituting the loss rate curves of all the application programs and the performance monitoring data collected in the step S1 into a cache allocation annealing algorithm, and calculating an allocation scheme capable of ensuring the online service performance and the offline task performance.
2. The LLC distribution method for off-line container mixed deployment according to claim 1, wherein the specific steps of acquiring the memory and calculating the loss rate curve in step S4 are:
screening out a target thread needing memory access acquisition, wherein the thread needs to belong to the processes of the online service and the offline task, does not belong to a problem process and a low-sensitivity process, and runs for 1 hour;
developing a memory multiplexing time histogram acquisition program by using a Pin tool, acquiring all memory access addresses of a target process, and then sampling by using a pond sampling method to obtain a multiplexing time histogram;
the loss rate curve is calculated by substituting the multiplex time histogram into the average eviction time model.
3. The LLC distribution method oriented to off-line container mixed deployment according to claim 2, wherein the pond sampling method comprises the following specific steps:
creating a pond with the size of k; when the ith address arrives, if the address is not in the pond, the address is put into the pond according to the probability of min (1, k/i), if the pond is full, one address in the pond is randomly selected, and a new address is put into the pond after deletion; when a new address is put into the pond, marking the new address as not sampled, recording an address serial number i, when the address is accessed for the second time, making the access serial number j, recording the interval j-i of the two accesses, namely the multiplexing time of the address, and marking the address as sampled; traversing address records in the pond after sampling the access address, if the address records are sampled, adding one to the record number corresponding to the multiplexing time, and if the address records are not sampled, adding one to the record number with infinite multiplexing time; and obtaining a multiplexing time histogram after all the addresses are traversed.
4. The LLC distribution method for off-line container hybrid deployment according to claim 1, wherein said cache distribution annealing algorithm is specifically:
initializing the cache space usage of all target threads needing memory access acquisition, initializing the current temperature T to be 100000, initializing all threads to divide cache equally, initializing all cache allocation schemes, setting the initialized cache allocation schemes as optimal schemes, and then calculating by using the following steps:
s51, predicting the average cycle number and the loss rate of the instructions of the operation cycle of the process by using the cache space usage;
s52, predicting the new cache space usage by using the predicted average cycle number and loss rate of the instructions;
s53, updating the predicted operation period to 95% of the original value;
s54, repeating the step S52 and the step S53 until the predicted operation period reaches the minimum value or the amplitude change of the cache space usage reaches the minimum value;
s55, calculating a performance index by using the predicted average instruction period and loss rate;
s56, calculating a neighbor scheme, randomly selecting a scheme to be modified, and randomly selecting a modification method; after the modification is finished, verifying whether the new scheme is not accessed, if not, re-executing the step, and if yes, continuing to execute the next step;
s57, substituting the neighbor solution into the step S51 to the step S55 to obtain the value of the prediction performance index of the neighbor solution;
s58, comparing the performance index of the current optimal scheme with the performance index of the adjacent scheme, and if the current index is better, setting the adjacent scheme as the optimal scheme;
s59, calculating the difference value of the performance indexes of the current scheme and the neighbor scheme, wherein the difference value is defined as that if the index is larger, the index of the current scheme is better, the index of the neighbor scheme is subtracted from the index of the current scheme, otherwise, the difference value is opposite;
s510, whether the neighbor scheme is set as the current scheme or not is determined, and when the difference value is less than 0 or a random number from 0 to 1 is greater than or equal to
Figure FDA0003541864770000061
Replacing the current scheme with a neighbor scheme;
s511, updating the temperature T to 95% of the original temperature T;
s512, when the temperature T is higher than the lowest temperature, returning to the step S56, otherwise, continuing to obtain an optimal scheme;
and S513, setting the optimal scheme into each thread actually through the CAT technology.
5. The LLC distribution method for off-line container hybrid deployment according to claim 4, wherein said performance indicators specifically comprise:
the overall throughput is defined as the sum of the instruction cycles of all threads, and the larger the index value is, the better the index value is;
the loss number of each thousand instructions of the online service is defined as the average value of the loss numbers of each thousand instructions of all online service threads, and the smaller the index value is, the better the index value is;
the performance reduction ratio is defined as the ratio of the instruction periodicity of the thread before no management and control to the performance periodicity of the cache after management and control, the index is defined as the average value of the performance reduction ratios of all the online service threads, and the smaller the index value is, the better the index value is;
the maximum performance reduction ratio of the online service is defined as the maximum value of the performance reduction ratios of all the online service threads, and the smaller the index value is, the better the index value is.
6. The LLC distribution method for off-line container mixing deployment according to claim 4, wherein said modification method specifically comprises the following 4:
moving the thread of the scheme to another random scheme, and ensuring that the cache way of the other scheme is inconsistent with the cache way of the scheme;
the cache way moves to the left or right by one position in the scheme;
the scheme adds a cache way which must be added on two sides of the existing cache way and is not added when the cache way is full;
according to the scheme, one cache way is reduced, the reduction must be carried out on two sides of the existing cache way in the scheme, and the reduction is not carried out when only 1 cache way is left.
CN202110480587.0A 2021-04-30 2021-04-30 LLC (logical Link control) distribution method for mixed deployment of off-line containers Active CN113190350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110480587.0A CN113190350B (en) 2021-04-30 2021-04-30 LLC (logical Link control) distribution method for mixed deployment of off-line containers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110480587.0A CN113190350B (en) 2021-04-30 2021-04-30 LLC (logical Link control) distribution method for mixed deployment of off-line containers

Publications (2)

Publication Number Publication Date
CN113190350A CN113190350A (en) 2021-07-30
CN113190350B true CN113190350B (en) 2022-06-14

Family

ID=76983155

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110480587.0A Active CN113190350B (en) 2021-04-30 2021-04-30 LLC (logical Link control) distribution method for mixed deployment of off-line containers

Country Status (1)

Country Link
CN (1) CN113190350B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1666182A (en) * 2002-05-08 2005-09-07 英特尔公司 Method and system for optimally sharing memory between a host processor and graphic processor
CN103595653A (en) * 2013-11-18 2014-02-19 福建星网锐捷网络有限公司 Cache distribution method, device and apparatus
CN103729248A (en) * 2012-10-16 2014-04-16 华为技术有限公司 Method and device for determining tasks to be migrated based on cache perception
CN111258927A (en) * 2019-11-13 2020-06-09 北京大学 Application program CPU last-level cache miss rate curve prediction method based on sampling

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9336141B2 (en) * 2013-03-13 2016-05-10 Cloud Physics, Inc. Hash-based spatial sampling for efficient cache utility curve estimation and cache allocation
CN106997351B (en) * 2016-01-22 2021-03-02 斑马智行网络(香港)有限公司 Resource cache management method, system and device
US11294726B2 (en) * 2017-05-04 2022-04-05 Salesforce.Com, Inc. Systems, methods, and apparatuses for implementing a scalable scheduler with heterogeneous resource allocation of large competing workloads types using QoS
CN110795202B (en) * 2018-08-02 2023-11-17 华为技术有限公司 Resource allocation method and device of virtualized cluster resource management system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1666182A (en) * 2002-05-08 2005-09-07 英特尔公司 Method and system for optimally sharing memory between a host processor and graphic processor
CN103729248A (en) * 2012-10-16 2014-04-16 华为技术有限公司 Method and device for determining tasks to be migrated based on cache perception
CN103595653A (en) * 2013-11-18 2014-02-19 福建星网锐捷网络有限公司 Cache distribution method, device and apparatus
CN111258927A (en) * 2019-11-13 2020-06-09 北京大学 Application program CPU last-level cache miss rate curve prediction method based on sampling

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
异构多核下基于缺失感知的LLC缓冲管理策略的研究;张希蓓;《中国优秀博硕士论文全文数据库(硕士)信息科技辑》;20190515(第05期);I137-40 *

Also Published As

Publication number Publication date
CN113190350A (en) 2021-07-30

Similar Documents

Publication Publication Date Title
US6237059B1 (en) Method for estimating statistics of properties of memory system interactions among contexts in a computer system
US6442585B1 (en) Method for scheduling contexts based on statistics of memory system interactions in a computer system
US6202127B1 (en) Apparatus for spatial and temporal sampling in a computer memory system
US6332178B1 (en) Method for estimating statistics of properties of memory system transactions
CN108475236B (en) Measuring address translation delay
US9405695B2 (en) Cache modeling using random sampling and a timestamp histogram
EP3049915B1 (en) Prefetching with level of aggressiveness based on effectiveness by memory access type
Lv et al. Operation-aware buffer management in flash-based systems
US10025504B2 (en) Information processing method, information processing apparatus and non-transitory computer readable medium
CN111344684A (en) Multi-level cache placement mechanism
Khan et al. Improving cache performance using read-write partitioning
Das et al. Reuse distance-based probabilistic cache replacement
Pan et al. A modeling framework for reuse distance-based estimation of cache performance
CN109086141A (en) EMS memory management process and device and computer readable storage medium
Zhang et al. A machine learning based write policy for SSD cache in cloud block storage
CN113190350B (en) LLC (logical Link control) distribution method for mixed deployment of off-line containers
US11487671B2 (en) GPU cache management based on locality type detection
CN105359116B (en) Buffer, shared cache management method and controller
US20080196036A1 (en) Method and Apparatus for Establishing a Bound on the Effect of Task Interference in a Cache Memory
Arunkumar et al. ReMAP: Reuse and memory access cost aware eviction policy for last level cache management
Xiao et al. FLORIA: A fast and featherlight approach for predicting cache performance
Chang et al. An adaptive buffer cache management scheme
CN118034940B (en) Harvard architecture data cache size measuring and calculating method based on cache organization form detection
US20220171656A1 (en) Adjustable-precision multidimensional memory entropy sampling for optimizing memory resource allocation
Lim et al. Characterizing File Accesses in Android Applications and Caching Implications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant