CN113868068B - Kernel performance testing method, computing device and storage medium - Google Patents

Kernel performance testing method, computing device and storage medium Download PDF

Info

Publication number
CN113868068B
CN113868068B CN202111448514.XA CN202111448514A CN113868068B CN 113868068 B CN113868068 B CN 113868068B CN 202111448514 A CN202111448514 A CN 202111448514A CN 113868068 B CN113868068 B CN 113868068B
Authority
CN
China
Prior art keywords
cache
processor
simulator
hardware
parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111448514.XA
Other languages
Chinese (zh)
Other versions
CN113868068A (en
Inventor
郭克
卢彦
孟杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Uniontech Software Technology Co Ltd
Original Assignee
Uniontech Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Uniontech Software Technology Co Ltd filed Critical Uniontech Software Technology Co Ltd
Priority to CN202111448514.XA priority Critical patent/CN113868068B/en
Publication of CN113868068A publication Critical patent/CN113868068A/en
Application granted granted Critical
Publication of CN113868068B publication Critical patent/CN113868068B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/26Functional testing
    • G06F11/261Functional testing by simulating additional hardware, e.g. fault simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2273Test methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a kernel performance testing method, a computing device and a storage medium, and the method comprises the following steps: acquiring a first hardware simulation parameter according to simulation target hardware, and constructing a first simulator according to the first hardware simulation parameter; operating a first simulator to obtain index parameters; modifying the first hardware simulation parameter according to the index parameter to obtain a second hardware simulation parameter; constructing a second simulator according to the second hardware simulation parameters; acquiring a kernel to be tested, and operating the kernel to be tested in a second simulator to obtain test parameters; and generating a performance test result of the core to be tested according to the test parameters. The invention can obtain the simulator which is closer to the target simulation hardware, improve the tuning quality of the operating system and accelerate the adaptation work of new hardware.

Description

Kernel performance testing method, computing device and storage medium
Technical Field
The present invention relates to the field of system testing, and in particular, to a kernel performance testing method, a computing device, and a storage medium.
Background
With the development of computer technology, the requirements for operating systems to run are higher and higher. When the hardware including the processor chip is updated and iterated rapidly, the operating system needs to be adapted based on the new hardware; in order to meet the requirements of users, the adaptation work of the new hardware is required to be completed within a certain time, and due to the actual production problem of chips, operating system manufacturers often cannot obtain the new hardware earlier to perform the work, so that the hardware needs to be simulated, and the simulated hardware is used for performing the adaptation work of the operating system.
In the prior art, when the simulation hardware performs the adaptation of the operating system, a plurality of parameters need to be acquired, the parameters are derived from public parameters on a network and parameters provided by a hardware manufacturer, and the acquired parameters are often not comprehensive enough, even not accurate enough, and it is difficult to construct simulation hardware close enough to perform the adaptation of the operating system.
For this reason, a new core performance test method is required.
Disclosure of Invention
To this end, the present invention provides a core performance testing method that seeks to solve, or at least alleviate, the above-identified problems.
According to one aspect of the invention, a kernel performance testing method is provided, which is suitable for being executed in a computing device and comprises the following steps: acquiring a first hardware simulation parameter according to simulation target hardware, and constructing a first simulator according to the first hardware simulation parameter; operating a first simulator to obtain index parameters; modifying the first hardware simulation parameter according to the index parameter to obtain a second hardware simulation parameter; constructing a second simulator according to the second hardware simulation parameters; acquiring a kernel to be tested, and operating the kernel to be tested in a second simulator to obtain test parameters; and generating a performance test result of the core to be tested according to the test parameters.
Optionally, in the method according to the present invention, the first hardware simulation parameter includes: the method comprises the following steps of constructing a first simulator according to a first hardware simulation parameter, wherein the first simulator comprises the following steps of: and constructing a first simulator according to the first cache replacement strategy of the processor, the first cache pre-fetching mechanism of the processor, the first cache size of the processor, the first cache delay of the processor and the first operation weight of the processor.
Optionally, in the method according to the present invention, the index parameter includes a cache miss rate, and modifying the first hardware simulation parameter according to the index parameter includes the steps of: when the error value of the cache miss rate of the first simulator is larger than the miss rate error threshold value, the first cache replacement strategy of the processor is modified into the second cache replacement strategy of the processor, and the first cache prefetching mechanism is modified into the second cache prefetching mechanism of the processor.
Optionally, in the method according to the present invention, the first cache replacement policy of the processor comprises an LRU algorithm and the second cache replacement policy of the processor comprises a pseudo-random algorithm.
Optionally, in the method according to the present invention, the processor first cache prefetch mechanism comprises a fixed prefetch algorithm and the processor second cache prefetch mechanism comprises a predictive prefetch algorithm.
Optionally, in the method according to the present invention, further comprising the step of: and when the error value of the cache miss rate of the first simulator is larger than the miss rate error threshold value, modifying the first cache size of the processor into the second cache size of the processor.
Optionally, in the method according to the present invention, the index parameter further includes a data access rate, and the method further includes the steps of: and when the error value of the data access rate of the first simulator is larger than the access rate error threshold value, modifying the first cache delay of the processor into the second cache delay of the processor.
Optionally, in the method according to the present invention, the index parameter further includes an operation rate, and the method further includes the steps of: and when the error value of the operation rate of the first simulator is larger than the error threshold value of the operation rate, modifying the first operation weight of the processor into a second operation weight.
Optionally, in the method according to the present invention, the testing parameters include: the method comprises the following steps of generating a performance test result of a core to be tested according to test parameters during the running time of a code section: when the running duration of the code section is longer than the preset duration, acquiring a running process of the code section executed by a second simulator; and generating a performance test result of the kernel to be tested according to the running process of the second simulator for executing the code section.
According to another aspect of the present invention, there is provided a computing device comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing a core performance testing method according to the present invention.
According to yet another aspect of the present invention, there is provided a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform a kernel performance testing method according to the present invention.
The kernel performance testing method is suitable for being executed in computing equipment and comprises the following steps: first hardware simulation parameters are obtained according to simulation target hardware, and a first simulator is constructed according to the first hardware simulation parameters. Because the difference exists between the first simulator constructed by unprocessed hardware simulation parameters and the actual target hardware to be simulated, the first simulator is operated to obtain the index parameters, and the first hardware simulation parameters are modified according to the index parameters to obtain the second hardware simulation parameters, so that the second simulator which is closer to the target hardware to be simulated is constructed according to the second hardware simulation parameters. And finally, acquiring the kernel to be tested, operating the kernel to be tested in the second simulator to obtain the test parameters, so as to generate a performance test result of the kernel to be tested according to the test parameters, and performing adaptation and tuning work of the operating system to the new hardware according to the performance test result. The method can obtain the simulator which is closer to the target simulation hardware, improves the tuning quality of the operating system, accelerates the adaptation work of new hardware, can simulate the time behavior and the performance bottleneck of the kernel accurately enough, improves the speed of tracking and repairing the performance problem of an operating system manufacturer, and avoids the tracking performance problem from building a complex environment.
Drawings
To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description when read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.
FIG. 1 shows a schematic diagram of a hardware emulation architecture in accordance with an exemplary embodiment of the present invention;
FIG. 2 illustrates a block diagram of a computing device 200, according to an exemplary embodiment of the invention; and
FIG. 3 shows a flowchart of a core performance testing method 300 according to an example embodiment of the invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. Like reference numerals generally refer to like parts or elements.
FIG. 1 shows a schematic diagram of a hardware emulation architecture according to an exemplary embodiment of the present invention. As shown in FIG. 1, the hardware emulation architecture is built into a computing device 200. The computing device 200 includes a host hardware module 110 and a host operating system 120 running on the host hardware module 110. The host hardware module 110 is a hardware module of the computing device 200, and the computing device 200 provides a simulated hardware operating environment simulating the target hardware as a host simulating the simulated hardware. The host hardware module 110 includes hardware devices such as a processor and an internal memory, and the present invention is not limited to the specific configuration of the host hardware module 110. The host operating system 120 is an operating system of the computing device 200, and the host operating system 120 provides a software operating environment for simulating the target hardware, and the present invention does not limit the types of the host operating system 120.
The computing device 200 also includes a simulator 130 and a performance testing tool 150. The emulator 130 is adapted to run in the host operating system 120. Simulator 130 includes a first simulator and a second simulator for simulating the target hardware. When the hardware adaptation work of the operating system is required, the kernel 140 to be tested is operated in the simulator 130, the test parameters of the kernel 140 to be tested are obtained through the performance test tool 150, the performance test result is generated, the kernel 140 to be tested is optimized, and the tuning of the operating system is realized.
The specific structure of the computing device 200 in fig. 1 is shown in fig. 2. FIG. 2 illustrates a block diagram of a computing device 200, according to an exemplary embodiment of the invention. As shown in FIG. 2, in a basic configuration 202, a computing device 200 typically includes a system memory 206 and one or more processors 204. A memory bus 208 may be used for communication between the processor 204 and the system memory 206.
Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a digital information processor (DSP), or any combination thereof. The processor 204 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and registers 216. Example processor cores 214 may include Arithmetic Logic Units (ALUs), Floating Point Units (FPUs), digital signal processing cores (DSP cores), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations the memory controller 218 may be an internal part of the processor 204.
Depending on the desired configuration, system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 206 may include an operating system 220, one or more programs 222, and program data 228. In some embodiments, the program 222 may be arranged to execute the instructions 223 of the method 300 according to the invention on an operating system by one or more processors 204 using the program data 228.
Computing device 200 may also include a storage interface bus 234. The storage interface bus 234 enables communication from the storage devices 232 (e.g., removable storage 236 and non-removable storage 238) to the basic configuration 202 via the bus/interface controller 230. Operating system 220, programs 222, and at least a portion of data 224 can be stored on removable storage 236 and/or non-removable storage 238, and loaded into system memory 206 via storage interface bus 234 and executed by one or more processors 204 when computing device 200 is powered on or programs 222 are to be executed.
Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 252. Example peripheral interfaces 244 can include a serial interface controller 254 and a parallel interface controller 256, which can be configured to facilitate communications with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 258. An example communication device 246 may include a network controller 260, which may be arranged to communicate with one or more other computing devices 262 over a network communication link via one or more communication ports 264.
A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
In the computing device 200 according to the present invention, the program 222 includes a plurality of program instructions of the core performance testing method 300, which may instruct the processor 204 to perform some of the steps of the core performance testing method 300 executed in the computing device 200 according to the present invention, so that some of the components in the computing device 200 implement the core performance testing by executing the core performance testing method 300 according to the present invention.
Computing device 200 may be implemented as a server, e.g., a file server, a database, a server, an application server, etc., which may be, for example, a Personal Digital Assistant (PDA), a wireless web-browsing device, an application-specific device, or a hybrid device that include any of the above functions. May be implemented as a personal computer including both desktop and notebook computer configurations, and in some embodiments, the computing device 200 is configured to perform the kernel performance testing method 300.
FIG. 3 shows a flowchart of a core performance testing method 300 according to an example embodiment of the invention. The core performance testing method 300 of the present invention is suitable for being executed in a computing device, as shown in fig. 3, the core performance testing method 300 starts with step S310, obtaining a first hardware simulation parameter according to a simulation target hardware, and constructing a first simulator according to the first hardware simulation parameter. Due to the actual production problem of the chip, the operating system manufacturer often cannot obtain new hardware earlier to perform the adaptation work of the operating system, so that the new hardware needs to be simulated, and the simulated hardware is the simulation target hardware. Simulating hardware refers to simulating the details of real hardware operation with software to more realistically approximate the hardware operating environment and reflect problems in hardware design, and exploring hardware and software design space through these.
When the simulation target hardware is simulated, the hardware simulation parameters (including the first hardware simulation parameters and the second hardware simulation parameters) which need to simulate the simulation target hardware can specifically obtain the first hardware simulation parameters from the design information of the simulation target hardware disclosed on the network and the data provided by the hardware manufacturer. However, the first hardware simulation parameters usually obtained are often incomplete, or some parameters have a large difference from the simulation target hardware; in order to reduce the difference with the real hardware, a first simulator is constructed according to the simulation parameters of the first hardware, and parameters of the first simulator are adjusted, so that a second simulator which is closer to the simulation target hardware is obtained.
In order to adapt the operating system and perform related tuning work, the kernel of the operating system is operated under a physical operation environment constructed by an emulator (including a first emulator and a second emulator), and the operation condition of the kernel is checked, so that the operating system is optimized in various aspects. The simulation target hardware simulated by the simulator comprises a processor and a communication interface related to the processor; the simulated processor specifically executes instructions related to an operating system and a kernel, the communication interface related to the processor simulates data transmission between the processor and other hardware, and the other hardware comprises an internal memory, an external memory, a network card and the like.
When a simulator is constructed according to hardware simulation parameters, the simulation abstraction degree of the target simulation hardware simulated by the simulator needs to be considered. The higher the simulation abstraction, the lower the corresponding time accuracy, and the lower the simulation abstraction, the higher the corresponding time accuracy. When the target simulation hardware is simulated, the simulator is required to simulate the real hardware to run with certain time precision, time periods can be extracted to analyze detailed running processes, and meanwhile, the problems that the simulation abstraction degree is too low, too much data is needed when the simulator is constructed, and even the real hardware is constructed from the beginning according to the data are avoided.
In the prior art, Simics, qemu of WindRiver and Keil Microcontroller Development Kit of ARM are simulation software at an abstract operation model level, but for the performance improvement of an operating system kernel, the simulation time precision of the simcs, qemu and Keil Microcontroller Development Kit of ARM cannot reach the clock cycle precision and cannot correctly reflect the operation time logic and software bottleneck of the operating system.
Therefore, in order to build a computer micro-architecture, an organization structure for realizing and better running in parallel at a computer instruction level is designed, and an abstraction level of the computer architecture is improved at a smaller granularity, and hardware simulation parameters (including a first hardware parameter and a second hardware parameter) for building an emulator include a processor cache replacement policy, a processor cache prefetch mechanism, a processor cache size, a processor cache delay and a processor operation weight. The simulator is constructed according to the parameters, so that the time behavior and the performance of program operation can be accurately simulated, and the simulation operation speed is fast enough.
The first hardware simulation parameters also include a processor dominant frequency, a branch predictor type, a GHB size, a BTB size, a RAS size, a cache line size, a MSHRs size, a write buffer size, a Translation Lookaside Buffer (TLB) size in the processor, a TLB type, a TLB replacement policy, a number of adders, a number of multipliers, a number of FPUs/SIMDs, a LOAD/STORE unit attribute, a ROB size, an LSQ size, and an instruction delay configuration for each of the processors. These hardware simulation parameters may be obtained from network public parameters or data provided by a hardware manufacturer, and if all the parameters cannot be obtained, default values may be filled in to construct a simulator.
In hardware simulation parameters, a branch predictor is a digital circuit that guesses which branch will be executed before the end of branch instruction execution to improve the performance of the processor's instruction pipeline. The purpose of using branch predictor is to improve the flow of instruction pipelining;
the GHB is a buffer area for recording the cache miss history and is used for improving the cache hit rate and guiding a prefetching strategy in a cache algorithm;
the BTB is a buffer area used for storing the current most jump possible address in the branch prediction;
ras (return address stack) is a hardware stack with limited capacity for storing the return address of the calling function to speed up the return performance of the function call;
the cache line is the minimum unit for operation and storage in the cache strategy;
MSHRs are cache miss save registers;
the write buffer is a special buffer area designed for accelerating the writing to the memory in the cache;
a Translation Lookaside Buffer (TLB), a cache of the CPU, used by the memory management unit to improve the speed of virtual to physical address translation;
the FPU (Performance Monitor Unit) is a performance monitoring unit in the CPU, which is an open place for a hardware design manufacturer to program and obtain hardware performance indexes when designing the CPU;
SIMD (single instruction multiple data) is a vector operation unit in the processor and is used for calculating single instruction multiple data;
the LOAD/STORE unit attributes include a prefetch width, a decode width, a launch width, a commit width, a prefetch instruction cache size, and an instruction window size;
ROB is CPU internal buffer area for instruction rearrangement;
lsq (load store request) is a cache established to support holding points in the pipeline when instructions or data are handled from the L1cache to the ALU registers.
The simulator built includes a processor that includes a Cache memory (Cache), an ALU, and an FPU. The Cache memory (Cache) comprises a first-level Cache (L1 Cache), a second-level Cache (L2 Cache) and a third-level Cache (LLC). The first level Cache (L1 Cache) comprises a first level instruction Cache (L1-iCache) and a first level data Cache (L1-dCache); the ALU is an arithmetic logic unit.
The processor cache replacement strategy for constructing the simulator comprises a first-level instruction cache line replacement strategy, a first-level data cache line replacement strategy and a second-level cache line replacement strategy. The processor cache prefetch mechanism includes a first level instruction cache prefetch mechanism, a first level data cache prefetch mechanism, and a second level cache prefetch mechanism. The processor cache size includes a first level instruction cache size, a first level data cache size, and a second level cache size. The processor cache delays include a first level instruction cache delay, a first level data cache delay, and a second level cache delay. The processor operation weight includes an ALU operation weight (ALU-calc-weight) and an FPU operation weight (FPU-calc-weight).
According to one embodiment of the invention, the present simulator may be implemented as a GEM5 simulator. The GEM5 emulator is a hardware emulator supporting different instruction sets (Alpha, ARM, MIPS, PowerPC, SPARC and x86), different operating systems (Linux, VxWorks, Solaris, FreeBSD, QNX and RTEMS) created by combining M5 and GEMS, and integrates the advantages of M5 (emulated processor) and GEMS (emulated memory system). The GEM5 simulator realizes the support of various objects and events in a memory model, a processor model and a hardware system through c + +, python and ruby programming, thereby completely and accurately simulating the operation behavior of the hardware system. The GEM5 simulator provides a flexible and modularized simulation system, so that a hardware design engineer can explore the influence of different characteristic combinations in the aspect of a micro-architecture on the hardware performance, and a performance measurement means of clock period level time precision is provided for software operation.
The first hardware simulation parameters include: the method comprises the steps that a first simulator is built according to first hardware simulation parameters, and the first simulator is built according to parameters such as a first processor cache replacement strategy, a first processor cache pre-fetching mechanism, a first processor cache size, a first processor cache delay and a first processor operation weight. The first simulator constructed by the parameters can derive more hardware data, can find problems by single step at break points, saves the field and sends the problems to others for single step debugging, thereby finding performance bottlenecks better and finding methods for improving performance.
Subsequently, step S320 is executed to operate the first simulator to obtain the index parameter. The index parameters include cache miss rate, data access rate and operation rate. The cache miss rate comprises a first level data cache miss rate, a first level instruction cache miss rate, a third level cache miss rate, a data page table cache miss rate and an instruction page table cache miss rate.
The computing device also comprises a performance testing tool which is used for acquiring the performance parameters of the simulator when the simulator runs. Each index parameter is obtained by calculating one or more performance parameters, specifically, when the first simulator is operated to obtain the performance parameters, the performance test tool is used for obtaining a plurality of performance parameters of the first simulator, and then the index parameters are obtained by calculating the performance parameters. According to an embodiment of the invention, the performance testing tool can be implemented as perf software, and executes a perf list command to obtain the performance parameters, so as to observe the flow and time performance characteristics of the execution of the software simulated by the gem5 in real time and reproduce the performance problem of the actual hardware.
The performance parameters include the number of first-level data cache misses (L1-dcache-load-misses), the total number of first-level data cache loads (L1-dcache-loads), the number of first-level instruction cache misses (L1-icache-load-misses), the total number of first-level instruction cache loads (L1-icache-loads), the number of third-level instruction cache misses (LLC-load-misses), the total number of third-level instruction cache loads (LLC-loads), the number of data page table cache misses (dTLB-store-misses), the total number of data page table cache loads (dTLB-store), the total number of instruction page table cache misses (iTLB-store), the number of reference clock cycles (iTLB-cycles), and the total number of operations.
In the index parameters, the miss rate of the first-level data cache is obtained by dividing the number of misses of the first-level data cache by the total loading number of the first-level data cache; the miss rate of the first-level instruction cache is obtained by dividing the number of misses of the first-level instruction cache by the total number of loads of the first-level instruction cache; the miss rate of the third-level instruction cache is obtained by dividing the number of misses of the third-level instruction cache by the total number of loads of the third-level instruction cache; the cache miss rate of the data page table is divided by the cache miss number of the data page table and the total loading number of the data page table; the cache miss rate of the instruction page table is obtained by dividing the number of cache misses of the instruction page table by the total number of cache loads of the instruction page table; the data memory access rate is obtained by dividing the number of bytes specified by memory access by the number of the metered reference clock cycles; the operation rate is obtained by specifying the total number of operations at the reference clock cycle number.
When the first simulator is operated to obtain the data access and storage rate, the processor simulated by the first simulator is made to access and store the appointed byte number, the performance testing tool counts the access and storage time of the processor simultaneously to obtain a reference clock cycle, and the appointed byte number is divided by the reference clock cycle to obtain the data access and storage rate. According to an embodiment of the invention, data with the word length of 1 machine, the word length of 8 machines, the word length of 64 machines, the word length of 512 machines and the word length of 4096 machines … … 4096 x8 machines are accessed and stored for 10000 times by a processor respectively, and then corresponding data access rates are obtained by calculation respectively.
When the first simulator is operated to obtain the operation rate, the processor simulated by the first simulator operates the specified test case, the performance test tool counts the operation time of the specified test case at the same time to obtain a reference clock period, and the specified operation total amount is divided by the reference clock period to obtain the operation rate. The total operation amount is obtained according to the operation specified test case. According to one embodiment of the present invention, the specified test cases can be implemented as Dhrystone (test integer operation speed) and Whetstone (test floating point operation speed) in unixbench.
Subsequently, step S330 is executed to modify the first hardware simulation parameter according to the index parameter, so as to obtain a second hardware simulation parameter. When the first hardware simulation parameter is modified according to the index parameter, when the error value of the cache miss rate of the first simulator is larger than the miss rate error threshold value, the first cache replacement strategy of the processor is modified into the second cache replacement strategy of the processor, and the first cache pre-fetching mechanism of the processor is modified into the second cache pre-fetching mechanism of the processor.
Calculating an error value of the cache miss rate of the first simulator according to the cache miss rate of the first simulator and the cache miss rate of the simulation target hardware, specifically:
and the error value of the cache miss rate = | a-b |/b, wherein a is the cache miss rate of the first simulator, and b is the cache miss rate of the simulated target hardware. According to one embodiment of the invention, the cache miss rate of the simulated target hardware may be obtained from public data or provided by a hardware vendor. The miss rate error threshold may be set to 5%.
Wherein the first cache replacement policy of the processor comprises an LRU algorithm and the second cache replacement policy of the processor comprises a pseudo-random algorithm. The first cache prefetch mechanism of the processor comprises a fixed prefetch algorithm and the second cache prefetch mechanism of the processor comprises a predictive prefetch algorithm.
Specifically, when calculating the error value of the cache miss rate, calculating the error value for each item of the cache miss rate includes: a first level data cache miss rate, a first level instruction cache miss rate, a third level cache miss rate, a data page table cache miss rate, and an instruction page table cache miss rate. If the error value of the first-level data cache miss rate is calculated, the calculation is carried out according to the first-level data cache miss rate of the first simulator and the first-level data cache miss rate of the simulation target hardware. And when an error value of the cache miss rate is larger than the miss rate error threshold value, modifying the first cache replacement strategy of the processor into a second cache replacement strategy of the processor, and modifying the first cache prefetching mechanism of the processor into a second cache prefetching mechanism of the processor.
According to an embodiment of the present invention, when the error value of the cache miss rate of the first simulator is greater than the miss rate error threshold, the first cache size of the processor may be further modified by modifying the first cache size of the processor such that the error value of the cache miss rate of the first simulator is less than the miss rate error threshold, i.e., modifying the first cache size of the processor to the second cache size of the processor. Specifically, if the cache miss rate of the first emulator is less than the cache miss rate of the simulation target hardware, the first cache size of the processor is increased to obtain a second cache size of the processor, and if the first cache size of the processor is increased by 15 cache lines, the second cache size of the processor is obtained; and if the cache hit rate of the first simulator is greater than that of the simulation target hardware, reducing the first cache size of the processor to obtain a second cache size of the processor. When the cache of the processor is modified, one cache or all caches can be modified, and the method for modifying the cache of the processor is not limited.
According to one embodiment of the invention, when the error value of the data access rate of the first simulator is larger than the access rate error threshold value, the first cache delay of the processor is modified to be the second cache delay of the processor. And calculating the error value of the data access and storage rate of the first simulator according to the data access and storage rate of the first simulator and the data access and storage data of the simulation target hardware.
And the error value of the data access rate of the first simulator is = | c-d |/d, wherein c is the data access rate of the first simulator, and d is the data access rate of the simulation target hardware. According to one embodiment of the invention, the data access rate of the simulation target hardware can be obtained according to public materials or provided by hardware manufacturers. The memory access rate error threshold may be set to 5%.
When the first cache delay of the processor is modified into the second cache delay of the processor, when the data access rate of the first simulator is greater than the data access rate of the simulation target hardware, the first cache delay of the processor is increased to obtain the second cache delay of the processor; and when the data access rate of the first simulator is greater than that of the simulation target hardware, reducing the first cache delay of the processor to obtain a second cache delay of the processor. When the cache delay of the processor cache is modified, one cache or all caches can be modified, and the method for modifying the cache delay of the processor cache is not limited.
According to one embodiment of the invention, when an error value of an operation rate of the first simulator is greater than an operation rate error threshold, a first operation weight of the processor is modified to a second operation weight.
The error value of the operation rate of the first simulator is calculated according to the operation rate of the first simulator and the operation rate of the simulation target hardware, specifically:
and an error value of the operation rate = | e-f |/f, wherein e is the operation rate of the first simulator, and f is the operation rate of the simulation target hardware. According to one embodiment of the present invention, the operation rate of the simulation target hardware can be obtained according to public data or provided by a hardware manufacturer. The operation rate error threshold may be set to 5%.
According to an embodiment of the invention, the operation rate error threshold, the access rate error threshold and the miss rate error threshold in the invention can be adjusted, and the smaller the set operation rate error threshold, the smaller the access rate error threshold and the miss rate error threshold are, the smaller the difference between the constructed second simulator and the target simulation hardware is, the closer the second simulator is to the real hardware to be simulated.
When the first operation weight of the processor is modified into the second operation weight, when the operation rate of the first simulator is greater than the operation rate of the simulation target hardware, the first operation weight is reduced to obtain a second budget weight; and when the operation rate of the first simulator is less than the operation rate of the simulation target hardware, increasing the first operation weight to obtain a second budget weight. Specifically, when the first budget weight of the processor is modified, the first operation weight of the ALU and the first operation weight of the FPU are modified.
According to one embodiment of the present invention, a first operation weight of a processor is modified to a second operation weight, when an operation rate of a first simulator is less than an operation rate of simulation target hardware, an ALU first operation weight 10 is modified to 12 as an ALU second operation weight, and an FPU first operation weight 10 is modified to 10.7 as an FPU second operation weight.
According to one embodiment of the invention, the performance parameters of the emulator acquired by the performance testing tool further include the total number of instructions (instructions) and the number of processor clock cycles (cpu-cycles). The parameter index also includes the number of instructions per cycle. The number of instructions per cycle is obtained by dividing the total number of instructions by the number of processor clock cycles. When the parameter index is evaluated, whether the error value of the instruction number per cycle is smaller than the instruction number error threshold value is judged. The error value of the instruction number per cycle is calculated according to the instruction number per cycle of the first simulator and the instruction number per cycle of the simulation target hardware. Specifically, the method comprises the following steps:
the error value of the instruction number per cycle of the first simulator = | g-h |/h, wherein g is the instruction number per cycle of the first simulator, and h is the instruction number per cycle of the simulation target hardware. According to one embodiment of the present invention, the instruction count per cycle for simulating the target hardware may be obtained from public data or provided by the hardware vendor. The instruction count error threshold may be set to 5%. And when the error value of the instruction number of each period of the first simulator is greater than the instruction number error threshold, modifying the first hardware parameter of the first simulator to obtain a second hardware parameter.
According to one embodiment of the invention, the processor second cache replacement policy, the processing first cache prefetching mechanism is modified into the processor second cache prefetching mechanism, and the processor second cache size and the processor second cache latency are used as the second hardware parameters.
Subsequently, step S340 is executed to construct a second simulator according to the second hardware simulation parameters. The second simulator is constructed to be closer to the performance of the simulated target hardware. The second hardware parameters further include a processor main frequency, a branch predictor type, a GHB size, a BTB size, a RAS size, a cache line size, a MSHRs size, a write buffer size, a Translation Lookaside Buffer (TLB) size in the processor, a TLB type, a TLB replacement policy, an adder number, a multiplier number, an FPU/SIMD number, a LOAD/STORE unit attribute, a ROB size, an LSQ size, and an instruction delay configuration of each of the processors.
Subsequently, step S350 is executed to obtain the kernel to be tested, and the kernel to be tested is operated in the second simulator to obtain the test parameters. According to one embodiment of the invention, the second simulator comprises a plurality of operation modes, and when the kernel to be tested is tested, the atomic mode is adopted, and only the instruction atom is executed. The kernel to be tested can be realized as a linux kernel without limitation. And when the kernel to be tested runs in the second simulator, executing the specified test program.
Finally, step S360 is executed to generate a performance test result of the core to be tested according to the test parameters. The test parameters include: and when the code section running time length is greater than the preset time length, acquiring the running process of the second simulator for executing the code section, and generating the performance test result of the kernel to be tested according to the running process of the second simulator for executing the code section. The preset duration refers to the duration of the simulation target hardware executing the code section. The execution process of the code section according to the second simulator comprises reading and processing of data by the processor and the like. When the performance test result of the kernel to be tested is generated, the part with time consumption and low efficiency of the kernel to be tested is used as a place needing to be improved, so that a developer can improve the kernel of the operating system and adjust the operating system.
The kernel performance testing method is suitable for being executed in computing equipment and comprises the following steps: first hardware simulation parameters are obtained according to simulation target hardware, and a first simulator is constructed according to the first hardware simulation parameters. Because the difference exists between the first simulator constructed by unprocessed hardware simulation parameters and the actual target hardware to be simulated, the first simulator is operated to obtain the index parameters, and the first hardware simulation parameters are modified according to the index parameters to obtain the second hardware simulation parameters, so that the second simulator which is closer to the target hardware to be simulated is constructed according to the second hardware simulation parameters. And finally, acquiring the kernel to be tested, operating the kernel to be tested in the second simulator to obtain the test parameters, so as to generate a performance test result of the kernel to be tested according to the test parameters, and performing adaptation and tuning work of the operating system to the new hardware according to the performance test result. The method can obtain the simulator which is closer to the target simulation hardware, improves the tuning quality of the operating system, accelerates the adaptation work of new hardware, can simulate the time behavior and the performance bottleneck of the kernel accurately enough, improves the speed of tracking and repairing the performance problem of an operating system manufacturer, and avoids the tracking performance problem from building a complex environment.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim.
Those skilled in the art will appreciate that the modules or units or groups of devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. Modules or units or groups in embodiments may be combined into one module or unit or group and may furthermore be divided into sub-modules or sub-units or sub-groups. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.
The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the core performance testing method of the present invention according to instructions in the program code stored in the memory.
By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims (10)

1. A kernel performance testing method adapted to be executed in a computing device, the method comprising the steps of:
acquiring a first hardware simulation parameter according to simulation target hardware, and constructing a first simulator according to the first hardware simulation parameter;
operating the first simulator to obtain index parameters;
modifying the first hardware simulation parameter according to the index parameter to obtain a second hardware simulation parameter;
constructing a second simulator according to the second hardware simulation parameters;
acquiring a kernel to be tested, and operating the kernel to be tested in the second simulator to obtain test parameters;
generating a performance test result of the kernel to be tested according to the test parameters;
wherein the index parameter comprises a data access rate, and the step of modifying the first hardware simulation parameter according to the index parameter comprises the steps of:
and when the error value of the data access rate of the first simulator is larger than the access rate error threshold, modifying the first cache delay of the processor in the first hardware simulation parameter into the second cache delay of the processor.
2. The method of claim 1, wherein the first hardware simulation parameters further comprise: the method comprises the following steps of constructing a first simulator according to a first hardware simulation parameter, wherein the first simulator comprises a first processor cache replacement strategy, a first processor cache pre-fetching mechanism, a first processor cache size, a first processor cache delay and a first processor operation weight, and the method further comprises the following steps:
and constructing a first simulator according to the first cache replacement strategy, the first cache pre-fetching mechanism, the first cache size, the first cache delay and the first operation weight of the processor.
3. The method of claim 2, wherein the metric parameter comprises a cache miss rate, and wherein modifying the first hardware simulation parameter based on the metric parameter comprises:
when the error value of the cache miss rate of the first simulator is larger than the miss rate error threshold value, the first cache replacement strategy of the processor is modified into a second cache replacement strategy of the processor, and the first cache prefetching mechanism is modified into a second cache prefetching mechanism of the processor.
4. The method of claim 3, wherein the processor first cache replacement policy comprises an LRU algorithm and the processor second cache replacement policy comprises a pseudo-random algorithm.
5. The method of claim 3 or 4, wherein the processor first cache prefetch mechanism comprises a fixed prefetch algorithm and the processor second cache prefetch mechanism comprises a predictive prefetch algorithm.
6. The method of claim 5, wherein the method further comprises the steps of:
and when the error value of the cache miss rate of the first simulator is larger than the miss rate error threshold value, modifying the first cache size of the processor into a second cache size of the processor.
7. The method of claim 6, wherein the metric parameters further include a computation rate, the method further comprising the steps of:
and when the error value of the operation rate of the first simulator is larger than the error threshold of the operation rate, modifying the first operation weight of the processor into a second operation weight.
8. The method of claim 7, wherein the test parameters comprise: the method comprises the following steps of generating a performance test result of the kernel to be tested according to test parameters, wherein the performance test result comprises the following steps:
when the running duration of the code section is longer than the preset duration, acquiring the running process of the code section executed by the second simulator;
and generating a performance test result of the kernel to be tested according to the running process of the second simulator for executing the code section.
9. A computing device, comprising:
one or more processors;
a memory; and
one or more apparatuses comprising instructions for performing the method of any of claims 1-8.
10. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform the method of any of claims 1-8.
CN202111448514.XA 2021-12-01 2021-12-01 Kernel performance testing method, computing device and storage medium Active CN113868068B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111448514.XA CN113868068B (en) 2021-12-01 2021-12-01 Kernel performance testing method, computing device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111448514.XA CN113868068B (en) 2021-12-01 2021-12-01 Kernel performance testing method, computing device and storage medium

Publications (2)

Publication Number Publication Date
CN113868068A CN113868068A (en) 2021-12-31
CN113868068B true CN113868068B (en) 2022-03-18

Family

ID=78985534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111448514.XA Active CN113868068B (en) 2021-12-01 2021-12-01 Kernel performance testing method, computing device and storage medium

Country Status (1)

Country Link
CN (1) CN113868068B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115033434B (en) * 2022-06-07 2023-05-26 海光信息技术股份有限公司 Method and device for calculating kernel performance theoretical value and storage medium
CN115658455B (en) * 2022-12-07 2023-03-21 北京开源芯片研究院 Processor performance evaluation method and device, electronic equipment and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008029A (en) * 2013-02-27 2014-08-27 腾讯科技(深圳)有限公司 Kernel performance testing method and device
CN112000550A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Operating system parameter tuning method, system, device and computer readable medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5951704A (en) * 1997-02-19 1999-09-14 Advantest Corp. Test system emulator
US20070136726A1 (en) * 2005-12-12 2007-06-14 Freeland Gregory S Tunable processor performance benchmarking
US7756695B2 (en) * 2006-08-11 2010-07-13 International Business Machines Corporation Accelerated simulation and verification of a system under test (SUT) using cache and replacement management tables
CN105302950B (en) * 2015-10-19 2018-07-24 北京精密机电控制设备研究所 A kind of programmable logic device crosslinking emulation test method of soft and hardware collaboration
CN107515663B (en) * 2016-06-15 2021-01-26 北京京东尚科信息技术有限公司 Method and device for adjusting running frequency of central processing unit kernel
CN106066809B (en) * 2016-06-20 2019-06-18 深圳市普中科技有限公司 A kind of emulation debugging method and system based on 51 emulator of ARM kernel

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008029A (en) * 2013-02-27 2014-08-27 腾讯科技(深圳)有限公司 Kernel performance testing method and device
CN112000550A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Operating system parameter tuning method, system, device and computer readable medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
交叉编译Linux内核并使用QEMU仿真硬件运行Linux***;Ginger-Zone;《https://blog.csdn.net/yjj546542806/article/details/92840589》;20190621;第1-2页 *

Also Published As

Publication number Publication date
CN113868068A (en) 2021-12-31

Similar Documents

Publication Publication Date Title
Gutierrez et al. Sources of error in full-system simulation
Bazzaz et al. An accurate instruction-level energy estimation model and tool for embedded systems
CN113868068B (en) Kernel performance testing method, computing device and storage medium
CN105408859B (en) For instructing the method and system of scheduling
JP5961971B2 (en) Simulation apparatus, method, and program
JPWO2012049728A1 (en) Simulation apparatus, method, and program
Herdt et al. Fast and accurate performance evaluation for RISC-V using virtual prototypes
EP3391224B1 (en) Method and apparatus for data mining from core traces
US11263150B2 (en) Testing address translation cache
JP5514211B2 (en) Simulating processor execution with branch override
CN113886162A (en) Computing equipment performance test method, computing equipment and storage medium
US20130013283A1 (en) Distributed multi-pass microarchitecture simulation
US20140316761A1 (en) Simulation apparatus and storage medium
Van Biesbrouck et al. Efficient sampling startup for sampled processor simulation
Ottlik et al. Context-sensitive timing simulation of binary embedded software
Van Dung et al. Cache simulation for instruction set simulator QEMU
US9658849B2 (en) Processor simulation environment
Sazeides Modeling value speculation
CN115858092A (en) Time sequence simulation method, device and system
JP2004070862A (en) Memory resources optimization support method, program thereof, and system thereof
US9460247B2 (en) Memory frame architecture for instruction fetches in simulation
Sun et al. Using execution graphs to model a prefetch and write buffers and its application to the Bostan MPPA
Vora et al. Integration of pycachesim with QEMU
Jin et al. The study of hierarchical branch prediction architecture
Uma et al. Hardware Evaluation and Software Framework Construction for Performance Measurement of Embedded Processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant