CN109690497B

CN109690497B - System and method for differentiating function performance by input parameters

Info

Publication number: CN109690497B
Application number: CN201780055415.0A
Authority: CN
Inventors: A·亚辛; S·布拉塔诺夫
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2016-09-27
Filing date: 2017-08-16
Publication date: 2023-12-19
Anticipated expiration: 2037-08-16
Also published as: US10969995B2; US20190155540A1; US10140056B2; WO2018063550A1; US20180088861A1; CN109690497A; DE112017004837T5

Abstract

Systems and methods for monitoring processor performance are disclosed. The described embodiments relate to differentiating function performance according to input parameters. In one embodiment, a method includes: configuring a counter contained in the processor for counting occurrences of events in the processor and for overflowing when the count of occurrences reaches a specified value; configuring a Precision Event Based Sampling (PEBS) processor circuit to generate a PEBS record after at least one overflow and store the PEBS record in a PEBS memory buffer, the PEBS record containing at least one stack entry read from a stack after the at least one overflow; enabling the PEBS processor circuit to generate and store a PEBS record after at least one overflow; generating a PEBS record after at least one overflow and storing the PEBS record in a PEBS memory buffer; and storing the contents of the PEBS memory buffer to a PEBS tracking file in memory.

Description

System and method for differentiating function performance by input parameters

Technical Field

Embodiments described herein relate generally to monitoring performance of a computer processor. In particular, the described embodiments relate generally to systems and methods for differentiating function performance by inputting parameters.

Background

Performance monitoring of a processor may be used to characterize, debug, and tune software and program code. Decomposing the performance characteristics function-by-function argument may help to select the correct optimization strategy for different calls of the same function. The performance of the same function may depend on its input parameters and the function may be optimized differently for different function argument values.

Monitoring processor performance in executing functions based on different values of self-variables may help optimize execution of the functions in the processor. For example, memory copy operations are heavily dependent on the length of the input/output array, and different lengths require different optimizations: shorter operations require the use of general purpose registers, while longer operations utilize SSE/AVX registers to perform better.

Drawings

Various advantages of the embodiments disclosed herein will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the accompanying drawings, in which:

FIG. 1 is a block diagram illustrating a processor according to one embodiment;

FIG. 2 illustrates an embodiment of a process for generating and storing PEBS records in a memory buffer and for storing the memory buffer to a PEBS trace file;

FIG. 3 illustrates an embodiment of a process for programming a PEBS processor (handler) circuit to monitor processor performance and generate a PEBS record to be stored in a PEBS memory buffer and then in a PEBS trace file;

FIG. 4 illustrates an embodiment of post-processing a PEBS trace file to decompose performance data on a function call by function call basis;

FIG. 5 is a block diagram of a register architecture according to one embodiment;

FIG. 6 is a register stack according to an embodiment;

FIG. 7 illustrates an embodiment of a PEBS data record configuration manager;

FIG. 8 illustrates different registers for enabling event-based sampling on a fixed function counter, according to one embodiment;

FIG. 9 illustrates different registers for enabling event-based sampling on a fixed function counter, according to one embodiment;

FIG. 10 illustrates updating of a data store buffer management area according to one embodiment;

FIGS. 11A-11B illustrate improvements to performance monitoring implemented by embodiments of the present invention;

FIG. 12 is a block diagram of an exemplary computer system formed with a processor that includes an execution unit for executing instructions according to an embodiment of the present disclosure;

FIG. 13 is a block diagram of a first more specific exemplary system in accordance with an embodiment of the present invention;

FIG. 14 is a block diagram of a first more specific exemplary system in accordance with an embodiment of the present invention;

FIG. 15 is a block diagram of a second more specific exemplary system according to an embodiment of the present invention;

FIG. 16 is a block diagram of a SoC according to an embodiment of the present invention;

FIG. 17 is a block diagram of a processor with more than one core, integrated memory controller, and integrated graphics according to an embodiment of the invention; and

FIG. 18 is a block diagram of converting binary instructions in a source instruction set to binary instructions in a target instruction set using a software instruction converter in contrast to embodiments of the present invention.

Detailed Description

In the following description, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

References in the specification to "one embodiment," "an example embodiment," etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Furthermore, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

Monitoring processor performance in executing functions that are called with different values and types of self-variables helps to optimize execution of the functions in the processor based on the different values and types of self-variables. For example, memory copy operations are heavily dependent on the length of the input/output array, and different lengths require different optimizations: shorter operations require the use of general purpose registers, while longer operations utilize SSE/AVX registers to perform better.

Another example involves the need for engineers and scientists on high power particle accelerators: they apply the same physical modeling function to the particle impact data, but can input to handle different optimizations to handle different particles with different trajectories. Understanding the performance impact of different trajectories results in optimizing the function for different types of inputs.

Meter-based tracking of arguments is not as practical as the embodiments disclosed herein because instrumented code may run for weeks or months and performance information may be distorted.

Extending traditional sample-based statistical analysis methods is also costly and also does not work when interrupts are masked (kernel mode drivers).

The embodiments disclosed herein implement a HW (hardware) assisted method of tracking function performance based on function arguments in a low overhead manner that allows distinguishing between performance changes of the same function based on input parameters of the function. Some embodiments describe retrieving enough information to reconstruct arguments for a sampled function call and associating performance characteristics with the sampled function based on its actual parameters. Some embodiments allow for accurate sampling of retired function calls. Some embodiments describe extending Precision Event Based Sampling (PEBS) to have the ability to store stack memory contents alongside architectural metadata, register states, and other context information. Post-processing is performed on the stored stack memory contents to decompose the arguments for each function call.

Accurate event based sampling and non-accurate event based sampling (PEBS and NPEBS)

The performance monitoring capability employed in some embodiments of the processor is built on two sets of event counters: fixed function counters and universal counters. Three fixed function counters are currently defined and implemented to count: (1) retired instructions, (2) a reference clock, and (3) a core clock. Various concepts associated with Precision Event Based Sampling (PEBS) and non-precision event based sampling (NPEBS) are described in connection with the description of embodiments of the disclosure.

As used herein, a precision event is a performance event that links to a particular instruction or micro-operation in instruction trace and occurs when the instruction or micro-operation retires. Such precision events may include, but are not limited to, retired instructions, retired branch instructions, cache references, or cache misses, to name a few. In another aspect, a non-precise event is a performance event that is not linked to a particular instruction or micro-operation in the instruction trace or that can speculatively occur even when the instruction or micro-operation is not retired. By way of example, imprecise events may include, but are not limited to, a reference clock tick, a core clock tick, a period when an interrupt is masked, just to name a few.

In some embodiments, the performance of the processing device is monitored to manage precise and non-precise events. In some embodiments, a processing device utilizes mechanisms on the processing device to track precise and non-precise events and store architecture metadata related to those events in a non-invasive manner without intervention of a Performance Monitoring Interrupt (PMI).

The operation of the processing device may include monitoring the performance of the system for the occurrence of a plurality of events. An event includes any operation, occurrence, or action in the processor. In one embodiment, an event is a response to a given instruction and data stream in a processing device. The event may be associated with architecture metadata that includes state information of the processing device including, but not limited to, an instruction pointer, a timestamp counter, and a register state.

In some embodiments, the performance counter is configured to count one or more types of events. When the counter is being incremented or decremented, the software reads the counter at selected time intervals to determine the number of events that have been counted between the time intervals. The counter can be implemented in a number of ways. In one embodiment, the counter is decremented from a positive start value and overflows when the count reaches zero. In another embodiment, the counter starts at a zero value and counts up until it overflows at the specified value. In yet another embodiment, the counter starts at a negative value and is incremented until it overflows when it reaches zero. The performance counter may generate a performance record or Performance Monitoring Interrupt (PMI) when the counter overflows. To trigger overflow, the counter may be preset to a modulus value that can cause the counter to overflow after a certain number of events have been counted, which generates a PMI or performance record, such as a Precision Event Based Sampling (PEBS) record as described in detail herein below.

Tracking accurate events

There are several types of mechanisms for monitoring and managing various events. One type is the PEBS mechanism, which works to monitor and manage precise events. An exact event is a performance event that links to a particular instruction or micro-operation in the instruction trace and occurs when the instruction or micro-operation retires. Such precision events may include, but are not limited to, retired instructions, retired branch instructions, cache references, or cache misses, to name a few. The PEBS mechanism may include several components, such as Event Select (ES) control, performance counters, PEBS enable circuitry, and PEBS processor circuitry. The ES may be programmed with an event identifier that causes a performance counter corresponding to the ES control to begin tracking (e.g., counting the occurrence of) a programmed event corresponding to the event identifier.

Embodiments of the present disclosure also include a PEBS enable circuit of the processing device that controls when PEBS records are generated. When the PEBS enable circuit is activated, the PEBS record is stored in a memory of the PEBS processor circuit when a performance counter corresponding to the PEBS enable circuit overflows. In one embodiment, the user activates or sets the PEBS enable circuit. The PEBS records architecture metadata that includes the system state at the time of the performance counter overflow. Such architecture metadata may include, but is not limited to, instruction Pointers (IP), timestamp counters (TSC), and register states. In some embodiments, the PEBS record further comprises at least one stack entry identified by a stack pointer. In some embodiments, the PEBS record includes X doublewords from the top of the stack. Thus, the PEBS records not only allow for accurate profiling of the location of accurate events in instruction tracking, but also provide additional information for use in software optimization, hardware optimization, performance tuning, and the like.

Tracking imprecise events

Embodiments of the present disclosure further utilize a PEBS mechanism to track and manage imprecise events for processing devices. Imprecise events are performance events that are not linked to a particular instruction or micro-operation in the instruction trace or that can speculatively occur even when the instruction or micro-operation is not retired. By way of example, imprecise events may include, but are not limited to, a reference clock tick, a core clock tick, a period when an interrupt is masked, and so forth.

Some embodiments introduce non-precision event based sampling (NPEBS) processor circuitry of a processing device that allows the NPEBS processor circuitry to generate NPEBS records for programmed non-precision events and store the NPEBS records for non-precision events in a PEBS memory buffer of the PEBS processor circuitry.

In some embodiments, the NPEBS record shares the same format as the PEBS record. In other embodiments, the NPEBS record is formatted differently than the PEBS record.

The PEBS processor circuit and the NPEBS processor circuit may share a certain circuit. The NPEBS processor circuit may use the resources of the PEBS processor circuit, differing only in name from the PEBS processor circuit. In one example, when an ES control is programmed with a non-precise event identifier, a performance counter associated with the ES control and PEBS enable circuit tracks the programmed non-precise event. In one embodiment, the NPEBS processor circuit is coupled to a PEBS enable circuit coupled to the performance counter such that the PEBS enable circuit causes the NPEBS processor circuit to generate NPEBS records for non-precise events when the performance counter overflows. Accordingly, architecture metadata associated with non-precise events is captured without requiring a PMI.

In some embodiments, the NPEBS processor circuit controls the timing of the generation of NPEBS records for non-precise events. In one embodiment, NPEBS records for imprecise events are generated immediately upon overflow of a performance counter tracking the imprecise events. In another embodiment, NPEBS records for non-precise events are generated immediately after performance counters tracking the non-precise events overflow (e.g., upon execution of subsequent instructions). In one embodiment, the NPEBS processor stores NPEBS records for non-precise events in a memory store of the NPEBS processor circuit.

The above technique of avoiding the use of PMI to capture architectural state of a system associated with a non-precise event has a number of advantages. One such advantage is that storing architectural state of imprecise events in memory storage in this manner is not suppressed when interrupts are masked. Previously, imprecise events could only leave the PMI pending and could not log the PEBS record. Unless the PMI is configured to cause an unmasked interrupt (NMI), when the interrupt is masked, the PMI is blocked, which masks where the sampling actually occurs. The use of NMI may lead to stability and security problems on the system, and not all operating systems allow the use of NMI. Interrupts are masked in the interrupt handler, context switches, locking algorithms, and other critical areas within privileged code (ring 0). With switching to an SoC (system on chip) that requires an interrupt for interaction between a CPU and an Intellectual Property (IP) unit, the amount of time for interrupt handling has increased. Many event-based sampling profiles are erroneous because the PMI handler cannot be used when interrupts are masked; resulting in the capturing of an incorrect instruction pointer. In embodiments of the present disclosure, the details of placing events in the PEBS buffer are not suppressed when interrupts are masked, thus avoiding the drawbacks mentioned above with PMI handlers.

Another advantage of using NPEBS processing circuitry to generate NPEBS records for non-precise events is faster detection, and thus greater accuracy. At least one stack entry and a buffer of hardware may be captured to fetch an instruction pointer (and additional information about architecture state), with less latency than is required for the interrupt handler to enter at PMI. A further advantage is lower overhead in sampling. Multiple PEBS records (some or all PEBS records may correspond to imprecise events) may be collected at a single PMI to reduce the number of discontinuities per sample collected (i.e., PEBS records). Interrupts are expensive on the system and are responsible for most of the performance disturbances caused by event-based sampling. Therefore, it is advantageous to reduce the number of interruptions for obtaining performance monitoring samples.

Some embodiments of the present disclosure are compact circuits and thus are implemented as an integral part of a wide variety of processing units without significant cost and power consumption increases. Some embodiments of the present disclosure are programmable circuit logic and may be used to track and manage different types of imprecise events on the same circuit logic. The NPEBS processor circuit is also scalable to track multiple processing units. The NPEBS processor circuit may be shared by multiple applications running on the same processor and managed as a shared resource by an Operating System (OS) or virtual machine.

Exemplary processor for generating and storing PEBS and NPEBS records

FIG. 1 is a block diagram illustrating a processor according to one embodiment. Fig. 1 illustrates a processor 102, the processor 102 including NPEBS processor circuitry 106 and PEBS processor circuitry 108 having one or more memory stores 110 a-110 n (which may be implemented as physical memory stores, such as buffers). The PEBS processor circuit 108 may also include a Performance Monitoring Interrupt (PMI) component 112 as described above. In addition, processor 102 may include one or more Event Selection (ES) controls 114 a-114 n, which one or more ES controls 114 a-114 n correspond to one or more general performance counters 116a-116n, and further correspond to one or more PEBS enable circuits 118a-118n (details of which are discussed above). In some implementations, the PEBS enable circuits 118a-118n may be located in a single control register (e.g., a machine-specific register).

Additionally, in the embodiment shown in FIG. 1, PEBS, NPEBS, and PDIR operations are applied using fixed function counters 160 a-c. In one embodiment, three fixed function counters 160a-c are defined and implemented for counting retired instructions, reference clocks, and core clocks. However, it will be appreciated that the underlying principles of the invention are not limited to any particular number of fixed function counters or any particular fixed function counter implementation.

As mentioned, the processor 102 may execute an instruction stream embedded with a tag for an event, which may be placed on the bus/interconnect fabric 104. Execution of a segment of instructions may constitute one or more imprecise events. Imprecise events are performance events that are not linked to a particular instruction or micro-operation in the instruction trace or that can speculatively occur when the instruction or micro-operation is not retired. Such imprecise events may include, but are not limited to, reference clocks, core clocks, and cycles, to name a few. In one embodiment, the imprecise event is generated by the processor 102. In another embodiment, the imprecise event is generated external to the processor 102 and transmitted to the processor via the bus/interconnect structure 104.

In one embodiment, the Event Selection (ES) controls 150a-c shown in FIG. 1 perform similar operations as the ES controls 114a-c described above, but correspond to the fixed function performance counters 160a-c and further to the PEBS enable circuits 170a-c associated with these fixed function performance counters 160 a-c. In some embodiments, the PEBS enable circuits 118a-118n and 170a-c are located in a single control register.

For example, FIG. 8 illustrates an exemplary PEBS-enabled machine specific register 800, abbreviated as PEBS-enabled MSR 800, in which bits 0-3 are associated with four general purpose counters GPctr0-GPctr3, respectively, and bits 32-34 are associated with fixed function performance counters FxCtr0-FxCtr2, respectively. In one embodiment, a bit value of 1 in any of bit positions 0-3 is (N) PEBS enable a corresponding general purpose counter, and a value of 1 in any of bit positions 32-34 is (N) PEBS enable a corresponding fixed function counter. Of course, the particular bits used to enable the (N) PEBS are not relevant to the basic principles of the invention. For example, in an alternative implementation, a bit value of 0 is used to indicate that the corresponding counter is enabled for the (N) PEBS.

In one embodiment, programming of ES controls 150a-c causes performance counters 160a-c corresponding to the programmed ES controls to track the occurrence of certain programmed imprecise/precise events. In some embodiments, any event that is not defined as an exact event is considered an imprecise event. In one embodiment, ES controls 150a-c are programmed by executing an application. In another embodiment, the user programs the systems 150a-c to have imprecise/precise event identifiers.

When an ES control 150a-c is programmed with an event identifier, the performance counter 160a-c corresponding to the ES control 150a-c is incremented or decremented each occurrence of the programmed event. The PEBS enable circuits 170a-c corresponding to the ES controls 150a-c and the fixed function performance counters 160a-c may be set (e.g., activated, flag set, bit set to 1, etc.) to generate PEBS records when the fixed function performance counters 160a-c overflow or when the fixed function performance counters 160a-c reach a value of 0 if the counters are decremented. In one embodiment, the PEBS enable bit illustrated in FIG. 8 is set to enable the PEBS processor circuit 108 to generate a PEBS record when the fixed function performance counter 160a-c that is counting events overflows or zeroes. As discussed above, the PEBS records architecture metadata that includes the state of the system at the overflow or zero value of the fixed function performance counters 160 a-c. The architecture metadata may include, but is not limited to, for example, IP, TSC, or register state.

Exemplary control register for fixed function counter

Fig. 9 illustrates an alternative MSR layout for ES control over a fixed counter. In this embodiment, the arrangement of event selection controls 150a-c may be implemented in a combined MSR as shown in FIG. 9. Because they are fixed counters, there are no events to be programmed, and there is even no different MSR for each counter (i.e., there is no event selection or cell mask since each counter always counts only one event). The PEBS enable circuit 910 is shown for three FIXED counters (ia32_fixed_ctr0, ia32_fixed_ctr1, and ia32_fixed_ctr2). In one embodiment, ENABLE is a 2-bit value associated with each counter that will be set to the values 0 (disabled), 1 (OS control), 2 (user control), and 3 (control at all ring levels). In this embodiment, there is limited control associated with each counter due to some of the other logic that is required to be programmed (such as ring level mask and PMI enable).

In one embodiment, the NPEBS processor circuit 106 is coupled to the PEBS enable circuits 170a-c such that when the fixed function performance counters 160a-c overflow or reach a zero value, the NPEBS processor circuit 106 causes the PEBS enable circuits 170a-c to generate PEBS records for the event. In some embodiments, the NPEBS processor circuit 106 controls the timing of generating PEBS records for the event. For example, in one embodiment, the NPEBS processor circuit 106 causes the PEBS enable circuits 170a-c to immediately generate a PEBS record for a programmed event upon the occurrence of an overflow or zero value of the performance counter 160a-c that tracks and counts the event.

In another embodiment, the NPEBS processor circuit 106 causes the PEBS enable circuits 170a-c to generate PEBS records for programmed events immediately after an overflow or zero value of the fixed function performance counters 160a-c has occurred that tracks and counts the events. In this embodiment, the PEBS record is generated after the next instruction to retire (i.e., after the completion of the next instruction that triggered the fixed function performance counters 160a-c to overflow or run to zero in instruction tracking). In one embodiment, the PEBS record for the event generated by the PEBS processor circuit 108 is stored in the memory store 110 of the PEBS processor circuit 108. Accordingly, architecture metadata associated with the event may be captured without utilizing a PMI.

In one embodiment, the PMI component 112 collects PEBS records stored in the memory storage(s) 110a-110n of the PEBS processor circuit 108. The PMI component 112 may immediately collect PEBS records stored in the memory stores 110a-110 n. In another embodiment, the PMI component 112 is delayed in immediately collecting PEBS records stored in the memory stores 110a-110 n. The interface may be provided as a Machine Specific Register (MSR).

Applying PEBS/NPEBS/PDIR to the fixed function counters 160a-c provides similar benefits as those features are added to the generic counters 116a-n, but allows freedom in using the generic counters for other activities. These and other benefits and additional features of embodiments of the present invention are discussed below.

Generating PEBS records even when interrupts are masked

Specifically, using the techniques described herein, the PEBS samples are not suppressed when the interrupt is masked. In the current implementation, a fixed event can only make the PMI pending and cannot log the PEBS. Unless the PMI is configured to cause an unmasked interrupt (NMI), the PMI will be blocked when the interrupt is masked, which masks where the sampling actually occurs. The use of NMI may lead to stability and security problems on the system, and not all operating systems allow the use of NMI. When the interrupt is masked, the details of the event are placed in the PEBS buffer without suppression. Interrupts are masked in the interrupt handler, context switches, locking algorithms, and other critical areas within privileged code (ring 0). With switching to an SoC (system on chip) that requires an interrupt for interaction between the CPU and other chip units, the amount of time required for interrupt handling has increased. Today, many event-based sampling profiles are incorrect because the performance monitoring interrupt handler cannot enter to capture profiling critical data such as instruction pointers when interrupts are masked.

These embodiments also provide for faster detection. For example, a hardware buffer can be captured to get instruction pointers (and additional information about architecture state) with lower latency than required for an interrupt handler to enter upon a performance monitoring interrupt from an APIC. This results in more accurate profile information.

These embodiments also provide lower overhead in sampling. Multiple (N) PEBS samples and buffers may be collected at a single performance monitoring outage to reduce the number of interruptions per collected sample. As mentioned, interrupts are expensive and are responsible for most of the performance disturbances caused by event-based sampling.

For a "retired instruction" fixed event, expanding the PEBS to override the fixed counter 160 will allow further enhancements to features that utilize, such as the precise distribution of retired instructions (PDIR). This feature ensures that the sampling of the IP captured in the PEBS record is statistically accurate and is only available today on the generic counter 116. The general counter is typically multiplexed to collect all requested events, implying a partial instruction profile. This problem is solved using embodiments of the present invention where PDIR is supported on a fixed counter 160.

In addition, in the current implementation, there is no way for a fixed event to utilize the trigger mechanism or buffer of the PEBS event. The lack of the ability to accurately profile when interrupts are masked results in a significant amount of time being wasted debugging platform problems.

Exemplary procedure for generating and storing PEBS records

FIG. 2 illustrates an embodiment of a process for generating and storing PEBS records in a memory buffer and for storing the memory buffer to a PEBS trace file. After start, at 202, a PMU counter is set to-N. Starting at a negative value, every time a PEBS record is generated, the PMU counter in this embodiment will be incremented until it reaches zero (0). In an alternative embodiment (not shown), the PMU counter is set to +n and is decremented each time a PEBS record is generated. At 204, N PEBS records are generated and stored in a PEBS memory buffer. At 206, the N PEBS records are stored in a PEBS trace file. This step is also illustrated as 210, showing the storage of N PEBS records in a PEBS tracking file 212. At 208, the PEBS tracking file is post-processed, after which the process ends.

FIG. 3 illustrates an embodiment of a process for programming a PEBS processor circuit to monitor processor performance and generate PEBS records to be stored in a PEBS memory buffer and subsequently in a PEBS trace file. After start, at 302, a PMU counter is programmed to count function CALLs, such as BR_INST_RETIRED and NEAR_CALL_PS events, and overflows after N CALLs. At 304, the PEBS processor circuit is programmed to generate, after each overflow, a PEBS record configured to contain the X entries at the top of the stack and architecture metadata including state information for the processor, including but not limited to an instruction pointer, a timestamp counter, and a register state. The configuration of processor information monitored by the PEBS and stored in the PEBS data records is shown in fig. 7, 8 and 9 and discussed below. At 306, after the PEBS memory has been filled, the PEBS memory contents are stored to the PEBS track file. The process then ends.

Post-processing PEBS tracking filesExemplary procedure for processing

FIG. 4 illustrates an embodiment of post-processing a PEBS trace file to decompose performance data on a function call by function call basis. After starting, at 402, an Instruction Pointer (IP) is fetched from a record in the PEBS tracking file. At 404, the instruction pointer is mapped to symbol information. At 406, the instruction pointer is used to determine the name of the function associated with the instruction and the calling convention of the function, which defines the input parameters to be received by the function and the results to be provided. At 408, using the calling convention of the function, the argument of the function is fetched from the PEBS trace file, the argument containing X entries from the stack and the register value. At 410, performance data is decomposed using specific arguments on a function-by-function call basis. The process then ends.

Exemplary processor register File

FIG. 5 is a block diagram of a register architecture 500 according to one embodiment of the invention. In the illustrated embodiment, there are 32 vector registers 510 that are 512 bits wide; these registers are referenced zmm0 to zmm31. The lower order 256 bits of the lower 16 zmm registers are overlaid on registers ymm 0-16. The lower order 128 bits of the lower 16 zmm registers (the lower order 128 bits of the ymm registers) are overlaid on registers xmm 0-15.

Writemask register 515—in the illustrated embodiment, there are 8 writemask registers (k 0 through k 7), each having a size of 64 bits. In an alternative embodiment, the size of writemask register 515 is 16 bits. As previously described, in one embodiment of the present invention, vector mask register k0 cannot be used as a writemask; when the encoding of normal indication k0 is used as a writemask, it selects the hardwired writemask 0xFFFF, effectively disabling the writemask operation of the instruction.

General purpose registers 525-in the illustrated embodiment, there are sixteen 64-bit general purpose registers that are used with the existing x86 addressing mode to address memory operands. These registers are referenced by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP and R8 to R15.

A scalar floating point stack register file (x 87 stack) 545 upon which is superimposed an MMX packed integer flat register file 550—in the illustrated embodiment, the x87 stack is an eight element stack for performing scalar floating point operations on 32/64/80 bit floating point data using an x87 instruction set extension; while MMX registers are used to perform operations on 64-bit packed integer data and to hold operands for some operations performed between MMX and XMM registers.

Alternative embodiments of the present invention use wider or narrower registers. In addition, alternative embodiments of the present invention may use more, fewer, or different register files and registers.

Exemplary stacks

Fig. 6 is a register stack according to an embodiment. As illustrated, stack 602 (also referred to as a register stack) includes stack bottom 604, stack limit 608, and stack pointer 606. In some embodiments, stack pointer 606 points to the next stack entry to be popped from the stack. In other embodiments, stack pointer 606 points to an empty stack location where a stack element will be pushed onto the stack. Stack 602 supports pop and overflows when elements are popped when stack bound 608 is exceeded.

In some embodiments, a general purpose register, such as general purpose register 525 (FIG. 5), is used to store the stack. In some embodiments, some general purpose registers are used to implement a global stack that is shared by all processes. In some embodiments, some general purpose registers are reserved to implement stacks for a single process.

As illustrated, the stack stores three elements A1, A2, and A3 associated with a stack frame 610 for function a (). The stack also stores five elements B1, B2, B3, B4, and B5 associated with the stack frame 612 for function B (). As illustrated, the remaining elements of the stack are not used. By configuring the PEBS record in fig. 7, in order to monitor and reflect 5 entries, as illustrated, for example, stack pointer+x (where x=5), as illustrated, for example, in fig. 4, the PEBS post-processing can decompose the parameters fed to function (B) as described. In some embodiments, stack 602 includes general purpose registers as illustrated in fig. 5.

PEBS configuration register

FIG. 7 is an embodiment of a programmable PEBS configuration register. Fig. 7 is an example of programming the PEPS configuration register 702 to specify what is stored in the PEPS record. In some embodiments, the PEBS configuration register comprises one of the general purpose registers or architecture registers illustrated in fig. 5. In some embodiments, the PEBS configuration registers comprise separate dedicated registers included in the processor. In some embodiments, the PEBS data records include memory locations.

As shown, the lowest order 5 bits of the PEBS configuration register are set to 0b1_1011, which causes the PEBS processor circuit to monitor and record the Instruction Pointer (IP), timestamp (TSC), general purpose registers RAX, RBX, etc., the last branch (from), to, info). The next 6 bits cause the PEBS processor circuit to include X doublewords [ rsp+0], [ rsp+4] … … [ RSP (X-1) X4 ] of the stack starting at stack pointer RSP. As illustrated, X may be set to up to 64 doublewords of the memory stack. When the PEBS configuration registers are programmed as shown, in some embodiments, the PEBS processor circuit, when enabled, generates a PEBS record each time the PMU counter overflows, such that the PEBS record is generated and stored. Here, PEBS records 706 and 708 have been generated and stored in PEBS memory buffer 704. The PEBS record is intended to reflect the state of the processor at the time of the overflow. In some embodiments, the PEBS processor circuit reads the stack entry immediately after the PMU counter overflows, although some delay may exist.

Exemplary PEBS memory buffer

FIG. 10 illustrates updating of a data store buffer management area according to one embodiment. Fig. 10 illustrates additional details of one embodiment of the present invention in which the data store buffer management area 1000 is extended to include counter reset values 1001 for fixed counter fixed Cntr0, fixed Cntr1, and fixed Cntr2 (similar to fixed function counters 160a-c of fig. 1). To sample each "nth" event, a reset value "-N" will be specified by these values and programmed into the fixed counter, and into the memory-based control block location associated with the counter. When the counter reaches 0, and after a slight pipeline delay (where additional events may occur), the next event causes a sample to be taken. As illustrated, each event that causes a sample to be taken causes a PEBS record to be generated and stored in the PEBS memory buffer 1002. The counter is then reset (as execution and counting continues) by "-N" again from the counter reset value 1001. As shown, consecutive PEBS records (record 0 through record M) are written to PEBS memory buffer 1002. In some embodiments, when a predetermined threshold number of PEBS records are written to the PEBS memory buffer, the contents of the PEBS memory buffer are copied to the PEBS track file in memory. The PEBS tracking file may be stored in the same memory as the PEBS memory buffer 1002 or in a different memory. In some embodiments, the PEBS memory buffer 1002 is stored in a second memory. In some embodiments, the second memory has a greater capacity than the first memory.

In some embodiments, a combination of hardware and microcode is used to collect the samples, and the samples do not require interrupts or any macro code execution. Once the buffer is filled to a predefined threshold, a Performance Monitoring Interrupt (PMI) is taken and a macro code handler is invoked to process the samples in the buffer.

In one embodiment, non-precision event based sampling (NPEBS) uses the same debug storage mechanism as PEBS to periodically store a set of architecture state information, but NPEBS utilizes slightly different semantics. The same sampling control mechanism is used, but the sample is taken the next opportunity after the counter reaches 0. It is considered "imprecise" in that the sampled instruction may not be one that experienced the event. NPEBS participates in configuring PEBS for events that are not part of a PEBS-capable event list, such as reference clocks and core clocks. In the embodiment described above, it is implemented on the generic counters 116 a-n. Without NPEBS, the only way to get statistical samples based on clock events is to do a costly PMI each time a properly configured counter overflows.

In summary, embodiments of the present invention provide for the extension of PEBS ENABLE machine specific registers 800 (e.g., IA32_PEBS_ENABLE MSR), data store buffer management area 1000, and associated hardware control registers to include status bits for fixed counters 160 a-c. These embodiments allow all fixed events to set the corresponding PEBS enable bit so that they can utilize the PEBS trigger mechanism and buffers when they hit the input samples with PEBS or NPEBS as described above. For reference clocks and core clocks, a fixed event marker is not guaranteed to any particular instruction, but would allow clock events to utilize the PEBS buffer to store all information already available through the PEBS on the architecture, such as instruction pointers (RIP/EIP), timestamp counters (TSC), and general purpose registers. Additionally, in one embodiment, hardware in the exception generation logic takes additional input and inserts PEBS assist in operation as appropriate. In one embodiment, the fixed counter 160 utilizes a PEBS trigger mechanism. Thus, fixed events may program the PEBS enabled machine specific registers 800 and enable the PEBS to be used for those imprecise events.

Exemplary advantages of tracking PEBS events independent of interrupts

11A-11B illustrate improvements to performance monitoring achieved by embodiments of the present disclosure. Fig. 11A illustrates sampling without PEBS and where PMI is not mapped to NMI. The end result is an inaccurate profile, where the entire profile may be missed and the samples may be discarded. In contrast, FIG. 11B illustrates event-based sampling of fixed events using the PEBS sampling technique as described herein. The result is significantly better accuracy and sample collection at the time of the event.

Exemplary System architecture

FIG. 12 is an exemplary computer system according to an embodiment of the present disclosureThe computer system is formed with a processor including an execution unit for executing instructions. In accordance with the present disclosure, such as in the embodiments described herein, the system 1200 may include a component, such as a processor 1202, the processor 1202 for employing an execution unit including logic to execute an algorithm to process data. System 1200 may be representative of a system based on the Intel corporation available from Santa Clara, calif., USAIII、/>4、Xeon ^TM 、/>XScale ^TM And/or StrongARM ^TM A microprocessor processing system, although other systems (including PCs with other microprocessors, engineering workstations, set-top boxes, etc.) may be used. In one embodiment, system 1200 may execute WINDOWS available from Microsoft corporation of Redmond, washington, USA ^TM Some version of the operating system, although other operating systems (e.g., UNIX and Linux), embedded software, and/or graphical user interfaces may be used. Thus, embodiments of the disclosure are not limited to any specific combination of hardware circuitry and software.

Embodiments are not limited to computer systems. Embodiments of the present disclosure may be used in other devices, such as handheld devices and embedded applications. Some examples of handheld devices include cellular telephones, internet protocol devices, digital cameras, personal Digital Assistants (PDAs), and handheld PCs. The embedded applications may include a microcontroller, a Digital Signal Processor (DSP), a system on a chip, a network computer (NetPC), a set top box, a network hub, a Wide Area Network (WAN) switch, or any other system that may or may have one or more instructions in accordance with at least one embodiment.

The system 1200 may include a processor 1202, which processor 1202 may include one or more execution units 1208 for executing algorithms to execute at least one instruction according to one embodiment of the disclosure. One embodiment may be described in the context of a single processor desktop or server system, but other embodiments may be included in a multiprocessor system. System 1200 may be an example of a "hub" system architecture. The system 1200 may include a processor 1202 for processing data signals. The processor 1202 may include, for example, a Complex Instruction Set Computer (CISC) microprocessor, a Reduced Instruction Set Computing (RISC) microprocessor, a Very Long Instruction Word (VLIW) microprocessor, a processor implementing a combination of instruction sets, or any other processor device, such as a digital signal processor. In one embodiment, the processor 1202 may be coupled to a processor bus 1210, which processor bus 1210 may transmit data signals between the processor 1202 and other components in the system 1200. The various elements of system 1200 may perform their conventional functions as known to those skilled in the art.

In one embodiment, the processor 1202 may include a level 12 (L1) internal cache memory 1204. The processor 1202 may have a single internal cache or multiple levels of internal caches depending on the architecture. In another embodiment, the cache memory may reside external to the processor 1202. Other embodiments may also include a combination of both internal and external caches, depending on the particular implementation and requirements. The register file 1206 may store different types of data in various registers, including integer registers, floating point registers, status registers, instruction pointer registers.

Execution units 1208 (including logic for performing integer and floating point operations) also reside in the processor 1202. The processor 1202 may also include a microcode (ucode) ROM that stores microcode for certain macroinstructions. In one embodiment, the execution unit 1208 may include logic to handle a packed instruction set 1209. By including the packed instruction set 1209 in the instruction set of the processor 1202 and associated circuitry for executing instructions, packed data in the processor 1202 may be used to perform operations used by many multimedia applications. Thus, by using the full width of the processor's data bus for performing operations on packed data, many multimedia applications may be accelerated and executed more efficiently. This may eliminate the need to transfer smaller data units across the data bus of the processor in order to perform one or more operations on one data element at a time.

Embodiments of execution unit 1208 may also be used with microcontrollers, embedded processors, graphics devices, DSPs, and other types of logic circuitry. The system 1200 may include a memory 1220. Memory 1220 may be implemented as a Dynamic Random Access Memory (DRAM) device, a Static Random Access Memory (SRAM) device, a flash memory device, or other memory device. Memory 1220 can store instructions and/or data represented by data signals that can be executed by processor 1202.

Memory controller hub 1216 may be coupled to processor bus 1210 and memory 1220. Memory controller hub 1216 may include a Memory Controller Hub (MCH). The processor 1202 may communicate with a memory controller hub 1216 via a processor bus 1210. Memory controller hub 1216 may provide a high bandwidth memory path 1218 to memory 1220 for instruction and data storage, and for storage of graphics commands, data, and textures. Memory controller hub 1216 may direct data signals between processor 1202, memory 1220, and other components in system 1200, and is used to bridge data signals between processor bus 1210, memory 1220, and input/output (I/O) controller hub 1230. In some embodiments, memory controller hub 1216 provides a graphics port for coupling to graphics/video card 1212. Memory controller hub 1216 may be coupled to memory 1220 through memory interface 1218. Graphics card 1212 may be coupled to memory controller hub 1216 through Accelerated Graphics Port (AGP) interconnect 1214.

System 1200 may use a proprietary hub interface bus 1222 to couple memory controller hub 1216 to an I/O controller hub (ICH) 1230. In one embodiment, ICH 1230 may provide a direct connection to certain I/O devices via a local I/O bus. The local I/O bus may comprise a high-speed I/O bus for connecting peripheral devices to memory 1220, the chipset, and processor 1202. Examples may include an audio controller, a firmware hub (flash BIOS) 1228, a wireless transceiver 1226, a data storage device 1224, a conventional I/O controller including user input and keyboard interfaces, a serial expansion port such as a Universal Serial Bus (USB), and a network controller 1234. Data storage 1224 may include a hard disk drive, floppy disk drive, CD-ROM device, flash memory device, or other mass storage device.

For another embodiment of a system, instructions according to one embodiment may be used with a system on a chip. One embodiment of a system on a chip includes a processor and a memory. The memory for one such system may include flash memory. The flash memory may be located on the same die as the processor and other system components. In addition, other logic blocks, such as a memory controller or a graphics controller, may also be located on the system-on-chip.

Exemplary System architecture

Fig. 13 and 14 are block diagrams of exemplary system architectures. Other system designs and configurations known in the art are also suitable for laptop devices, desktop computers, hand-held PCs, personal digital assistants, engineering workstations, servers, network devices, hubs, switches, embedded processors, digital Signal Processors (DSPs), graphics devices, video game devices, set-top boxes, microcontrollers, cellular telephones, portable media players, hand-held devices, and various other electronic devices. In general, a wide variety of systems or electronic devices capable of containing a processor and/or other execution logic as disclosed herein are generally suitable.

Referring now to fig. 13, shown is a block diagram of a system 1300 in accordance with one embodiment of the present invention. The system 1300 may include one or more processors 1310, 1315, which are coupled to a controller hub 1320. In one embodiment, controller hub 1320 includes a Graphics Memory Controller Hub (GMCH) 1390 and an input/output hub (IOH) 1350 (which may be on separate chips); GMCH 1390 includes memory and a graphics controller to which memory 1340 and coprocessor 1345 are coupled; IOH 1350 couples input/output (I/O) devices 1360 to the GMCH 1390. Alternatively, one or both of the memory and graphics controller are integrated within a processor (as described herein), the memory 1340 and coprocessor 1345 are directly coupled to the processor 1310, and the controller hub 1320 and IOH 1350 are in a single chip.

The optional nature of the additional processor 1315 is indicated in fig. 13 by dashed lines. Each processor 1310, 1315 may include one or more of the processing cores described herein, and may be some version of processor 1700.

Memory 1340 may be, for example, dynamic Random Access Memory (DRAM), phase Change Memory (PCM), or a combination of both. For at least one embodiment, controller hub 1320 communicates with processors 1310, 1315 via a multi-drop bus, such as a Front Side Bus (FSB), a point-to-point interface, such as a Quick Path Interconnect (QPI), or similar connection 1395.

In one embodiment, coprocessor 1345 is a special-purpose processor, such as, for example, a high-throughput integrated many-core (MIC) processor, a network or communication processor, compression engine, graphics processor, general-purpose graphics processing unit (GPGPU), embedded processor, or the like. In one embodiment, controller hub 1320 may include an integrated graphics accelerator.

There may be various differences between the processors 1310, 1315 in a range of quality metrics including architecture, microarchitecture, thermal, power consumption characteristics, and the like.

In one embodiment, the processor 1310 executes instructions that control general types of data processing operations. Embedded within these instructions may be coprocessor instructions. The processor 1310 recognizes these coprocessor instructions as being of a type that should be executed by the attached coprocessor 1345. Thus, the processor 1310 issues these coprocessor instructions (or control signals representing coprocessor instructions) to the coprocessor 1345 on a coprocessor bus or other interconnect. Coprocessor(s) 1345 accept and execute the received coprocessor instructions.

Referring now to FIG. 14, shown is a block diagram of a multiprocessor system 1400 in accordance with an embodiment of the present invention. As shown in fig. 14, multiprocessor system 1400 is a point-to-point interconnect system, and includes a first processor 1470 and a second processor 1480 coupled via a point-to-point interconnect 1450. Each of processor 1470 and processor 1480 may be some version of processor 1700. In one embodiment of the invention, processors 1470 and 1480 are respectively processors 1310 and 1315, and coprocessor 1438 is coprocessor 1345. In another embodiment, processor 1470 and processor 1480 are processor 1310 and coprocessor 1345, respectively.

Processor 1470 and processor 1480 are shown as including Integrated Memory Controller (IMC) units 1472 and 1482, respectively. Processor 1470 also includes point-to-point (P-P) interfaces 1476 and 1478 as part of its bus controller unit; similarly, the second processor 1480 includes P-P interfaces 1486 and 1488. Processor 1470 and processor 1480 may exchange information via a P-P interconnect 1450 using point-to-point (P-P) interface circuits 1478, 1488. As shown in fig. 14, IMCs 1472 and 1482 couple the processors to respective memories, namely a memory 1432 and a memory 1434, which may be portions of main memory locally attached to the respective processors.

Processor 1470 and processor 1480 may each exchange information with chipset 1490 via individual P-P interfaces 1452, 1454 using point to point interface circuits 1476, 1494, 1486, 1498. Chipset 1490 may optionally exchange information with a coprocessor 1438 via a high-performance interface 1492. In one embodiment, the coprocessor 1438 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like.

A shared cache (not shown) may be included in either processor or external to both processors but connected to the processors via a P-P interconnect such that if the processors are placed in a low power mode, local cache information for either or both processors may be stored in the shared cache.

Chipset 1490 may be coupled to a first bus 1416 via an interface 1496. In one embodiment, first bus 1416 may be a Peripheral Component Interconnect (PCI) bus or a bus such as a PCI express bus or another third generation I/O interconnect bus, although the scope of the present invention is not so limited.

As shown in FIG. 14, various I/O devices 1414 may be coupled to first bus 1416 along with a bus bridge 1418, which bus bridge 1418 couples first bus 1416 to a second bus 1420. In one embodiment, one or more additional processors 1415, such as coprocessors, high-throughput MIC processors, GPGPUs, accelerators (such as, for example, graphics accelerators or Digital Signal Processing (DSP) units), field programmable gate arrays, or any other processor, are coupled to first bus 1416. In one embodiment, the second bus 1420 may be a Low Pin Count (LPC) bus. In one embodiment, various devices may be coupled to the second bus 1420 including, for example, a keyboard and/or mouse 1422, a communication device 1427, and a data storage 1428, such as a disk drive or other mass storage device which may include instructions/code and data 1430. Further, an audio I/O1424 may be coupled to the second bus 1420. Note that other architectures are possible. For example, instead of the point-to-point architecture of FIG. 14, a system may implement a multi-drop bus or other such architecture.

Referring now to fig. 15, shown is a block diagram of a second more particular exemplary system 1500 in accordance with an embodiment of the present invention. Like elements in fig. 14 and 15 are given like reference numerals, and certain aspects of fig. 14 are omitted from fig. 15 to avoid obscuring other aspects of fig. 15.

Fig. 15 illustrates that the processor 1470 and the processor 1480 may include integrated memory and I/O control logic ("CL") 1472 and 1482, respectively. Thus, the CL 1472, 1482 includes an integrated memory controller unit and includes I/O control logic. Fig. 15 illustrates that not only are the memories 1432, 1434 coupled to the CLs 1472, 1482, but that the I/O devices 1514 are also coupled to the control logic 1472, 1482. Legacy I/O devices 1515 are coupled to the chipset 1490.

Referring now to fig. 16, shown is a block diagram of a SoC 1600 in accordance with an embodiment of the present invention. Like elements in fig. 17 are given like reference numerals. In addition, the dashed box is an optional feature on a more advanced SoC. In fig. 16, interconnect unit(s) 1602 are coupled to: an application processor 1610 comprising a set of one or more cores 1702A-N and a shared cache unit(s) 1706, the set of one or more cores 1702A-N comprising cache units 1704A-N; a system agent unit 1710; bus controller unit(s) 1716; an integrated memory controller unit(s) 1714; a set 1620 of one or more coprocessors which may include integrated graphics logic, an image processor, an audio processor, and a video processor; a Static Random Access Memory (SRAM) unit 1630; a Direct Memory Access (DMA) unit 1632; and a display unit 1640 for coupling to one or more external displays. In one embodiment, coprocessor(s) 1620 include a special-purpose processor, such as, for example, a network or communication processor, compression engine, GPGPU, high-throughput MIC processor, embedded processor, or the like.

Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementations. Embodiments of the invention may be implemented as a computer program or program code that is executed on a programmable system comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code (such as code 1430 illustrated in fig. 14) may be applied to the input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this application, a processing system includes any system having a processor, such as, for example, a Digital Signal Processor (DSP), microcontroller, application Specific Integrated Circuit (ASIC), or microprocessor.

Program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. Program code can also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.

One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represent various logic in a processor, which when read by a machine, cause the machine to fabricate logic to perform the techniques described herein. Such representations, referred to as "IP cores," may be stored on a tangible machine-readable medium and may be supplied to individual customers or manufacturing facilities to load into the manufacturing machines that actually manufacture the logic or processor.

Such machine-readable storage media may include, but are not limited to, non-transitory, tangible arrangements of articles of manufacture or formed by a machine or device, including storage media, such as hard disks; any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewriteable (CD-RWs), and magneto-optical disks; semiconductor devices such as read-only memory (ROM), random Access Memory (RAM) such as Dynamic Random Access Memory (DRAM) and Static Random Access Memory (SRAM), erasable programmable read-only memory (EPROM), flash memory, electrically erasable programmable read-only memory (EEPROM); phase Change Memory (PCM); magnetic cards or optical cards; or any other type of medium suitable for storing electronic instructions.

Thus, embodiments of the invention also include a non-transitory, tangible machine-readable medium containing instructions or containing design data, such as a Hardware Description Language (HDL), that defines the structures, circuits, devices, processors, and/or system features described herein. These embodiments are also referred to as program products.

Exemplary core architecture, processor, and computer architecture

The processor cores can be implemented in different ways, for different purposes, in different processors. For example, an implementation of such a core may include: 1) A general purpose ordered core intended for general purpose computing; 2) A high performance general purpose out-of-order core intended for general purpose computing; 3) Dedicated cores intended mainly for graphics and/or scientific (throughput) computation. Implementations of different processors may include: 1) A CPU comprising one or more general-purpose ordered cores intended for general-purpose computing and/or one or more general-purpose out-of-order cores intended for general-purpose computing; and 2) coprocessors comprising one or more dedicated cores intended mainly for graphics and/or science (throughput). Such different processors result in different computer system architectures that may include: 1) A coprocessor on a chip separate from the CPU; 2) A coprocessor in the same package as the CPU but on a separate die; 3) Coprocessors on the same die as the CPU (in which case such coprocessors are sometimes referred to as dedicated logic or as dedicated cores, such as integrated graphics and/or scientific (throughput) logic); and 4) a system on a chip that may include the described CPU (sometimes referred to as application core(s) or application processor(s), the co-processor described above, and additional functionality on the same die. An exemplary core architecture is next described, followed by an exemplary processor and computer architecture.

Exemplary core architecture

Ordered and unordered core block diagram

FIG. 17 is a block diagram of a processor 1700 that may have more than one core, may have an integrated memory controller, and may have an integrated graphics device, according to an embodiment of the invention. The solid line box in fig. 17 illustrates a processor 1700 having a single core 1702A, a system agent 1710, a set 1716 of one or more bus controller units, while the optional addition of a dashed line box illustrates an alternative processor 1700 having multiple cores 1702A-N, a set 1714 of one or more integrated memory controller units in the system agent 1710, and dedicated logic 1708.

Thus, different implementations of the processor 1700 may include: 1) A CPU, wherein the dedicated logic 1708 is integrated graphics and/or scientific (throughput) logic (which may include one or more cores), and the cores 1702A-N are one or more general-purpose cores (e.g., general-purpose ordered cores, general-purpose out-of-order cores, a combination of both); 2) Coprocessors in which cores 1702A-N are a large number of specialized cores intended primarily for graphics and/or science (throughput); and 3) a coprocessor, wherein the cores 1702A-N are a number of general purpose ordered cores. Thus, the processor 1700 may be a general purpose processor, coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), high-throughput integrated many-core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 1700 may be part of one or more substrates and/or may be implemented on one or more substrates using any of a variety of process technologies, such as, for example, biCMOS, CMOS, or NMOS.

The memory hierarchy includes one or more cache levels within the core, a set of one or more shared cache units 1706, and an external memory (not shown) coupled to the set of integrated memory controller units 1714. The set of shared cache units 1706 may include one or more intermediate levels of cache, such as a second level (L2), a third level (L3), a fourth level (L4), or other levels of cache, a Last Level Cache (LLC), and/or combinations thereof. While in one embodiment, ring-based interconnect unit 1712 interconnects integrated graphics logic 1708 (integrated graphics logic 1708 is an example of dedicated logic, and is also referred to herein as dedicated logic), set of shared cache units 1706, and system agent unit 1710/(integrated memory controller unit(s) 1714), alternative embodiments may interconnect such units using any number of well-known techniques. In one embodiment, coherency is maintained between one or more cache units 1706 and the cores 1702A-N.

In some embodiments, one or more of the cores 1702A-N may be capable of multithreading. System agents 1710 include those components that coordinate and operate cores 1702A-N. System agent unit 1710 may include, for example, a Power Control Unit (PCU) and a display unit. The PCU may be, or may include, the logic and components required to regulate the power states of the cores 1702A-N and the integrated graphics logic 1708. The display unit is used to drive one or more externally connected displays.

The cores 1702A-N may be homogenous or heterogeneous in terms of architectural instruction sets; that is, two or more of the cores 1702A-N may be capable of executing the same instruction set, while other cores may be capable of executing only a subset of the instruction set or a different instruction set.

Exemplary computer architecture

Simulation (including binary transformation, code morphing, etc.)

In some cases, an instruction converter is used to convert instructions from a source instruction set to a target instruction set. For example, the instruction converter may transform (e.g., using a static binary transform, a dynamic binary transform including dynamic compilation), morph, emulate, or otherwise convert an instruction into one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on-processor, off-processor, or partially on-processor and partially off-processor.

FIG. 18 is a block diagram of converting binary instructions in a source instruction set to binary instructions in a target instruction set using a software instruction converter in contrast to embodiments of the present invention. In the illustrated embodiment, the instruction converter is a software instruction converter, but alternatively the instruction converter may be implemented in software, firmware, hardware, or various combinations thereof. FIG. 18 illustrates that a program in the form of a high-level language 1802 can be compiled using an x86 compiler 1804 to generate x86 binary code 1806 that can be natively executed by a processor 1816 having at least one x86 instruction set core. The processor 1816 with at least one x86 instruction set core represents any processor that performs substantially the same function as an intel processor with at least one x86 instruction set core by compatibly executing or otherwise performing the following: 1) An essential part of the instruction set of the intel x86 instruction set core, or 2) an object code version of an application or other software that is targeted to run on an intel processor having at least one x86 instruction set core so as to achieve substantially the same results as an intel processor having at least one x86 instruction set core. The x86 compiler 1804 represents a compiler operable to generate x86 binary code 1806 (e.g., object code) that may or may not be executed by additional linking processes on the processor 1816 having at least one x86 instruction set core. Similarly, fig. 18 illustrates that a program in the form of a high-level language 1802 can be compiled using an alternative instruction set compiler 1808 to generate alternative instruction set binary code 1810 that can be natively executed by a processor 1814 that does not have at least one x86 instruction set core (e.g., a processor having a core that executes the MIPS instruction set of MIPS technologies, inc. Of sanyvale, california, and/or the ARM instruction set of ARM control company, of sanyvale, california). The instruction converter 1812 is used to convert the x86 binary code 1806 into code that can be natively executed by the processor 1814 without the x86 instruction set core. This translated code is unlikely to be identical to the alternative instruction set binary code 1810 because an instruction converter capable of doing so is difficult to manufacture; however, the translated code will perform the general operation and be composed of instructions from the alternative instruction set. Thus, the instruction converter 1812 represents software, firmware, hardware, or a combination thereof that allows a processor or other electronic device without an x86 instruction set processor or core to execute the x86 binary code 1806 by emulation, simulation, or any other process.

The above examples include specific combinations of features. However, such examples above are not limited in this respect, and in various implementations, examples above may include: performing only a subset of such features; performing a different order of such features; performing different combinations of such features; and/or perform additional features beyond those explicitly listed. For example, all features described with reference to the example methods may be implemented with reference to the example apparatus, the example system, and/or the example article of manufacture, and vice versa.

Embodiments of the invention may include steps that have been described above. The steps may be embodied in machine-executable instructions that may be used to cause a general-purpose or special-purpose processor to perform the steps. Alternatively, the steps may be performed by specific hardware components that contain hardwired logic for performing the steps, or by any combination of programmed computer components and custom hardware components.

Specific exemplary embodiments have been disclosed in the foregoing specification. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Example

Example 1 provides a processor comprising: a counter for counting occurrences of events in the processor and for overflowing when the count of occurrences reaches a specified value; a PEBS processor circuit for generating a PEBS record and storing the PEBS record in a PEBS memory buffer, the PEBS record comprising at least one stack entry reflecting the state of the processor; and a PEBS enable circuit coupled to the counter and to the PEBS processor circuit, the PEBS enable circuit to enable the PEBS processor circuit to generate a PEBS record and store the PEBS record to the PEBS memory buffer.

Example 2 includes the entity of example 1. In this example, the PEBS record further includes architecture metadata for the processor and the register state of the processor.

Example 3 includes the entity of any one of examples 1-2. The example further includes: an event selection register for being programmed with an event identifier corresponding to an event; and a programmable PEBS configuration register for specifying the contents of the PEBS record.

Example 4 includes the entity of any one of examples 1-3. The example further includes: a second counter included in the processor for generating a second count of occurrences of imprecise events in the processor and for overflowing when the second count of occurrences reaches a second specified value; an NPEBS processor circuit for generating an NPEBS record and storing the NPEBS record in the PEBS memory buffer, the NPEBS record including at least one stack entry reflecting a state of the processor; and an NPEBS enable circuit coupled to the second counter and to the NPEBS processor circuit, the NPEBS enable circuit operable to enable the NPEBS processor circuit to generate an NPEBS record and store the NPEBS record to the PEBS memory buffer when the counter reaches the second specified value.

Example 5 includes the entity of any one of examples 1-4. In this example, the event is a non-precise event.

Example 6 includes the entity of any one of examples 1-5. In this example, the specified value includes a zero value when the counter is decremented from a positive starting value, the specified value includes a zero value when the counter is incremented from a negative starting value, and the specified value includes a positive value when the counter is incremented from a zero starting value.

Example 7 includes the entity of any one of examples 1-6. The example further includes an interface to a second memory, the PEBS memory buffer to be stored into a PEBS tracking file contained in the second memory.

Example 8 includes the entity of any one of examples 1-7. In this example, the PEBS memory buffer comprises a cache memory contained in the processor, and the second memory comprises memory external to the processor.

Example 9 includes the entity of any one of examples 1-8. In this example, the PEBS memory buffer includes a memory external to the processor and coupled to the processor through a memory controller hub, and the second memory includes a data store external to the processor and coupled to the processor through an input/output (I/O) controller hub.

Example 10 provides a method comprising the steps of: configuring a counter contained in the processor for counting occurrences of events in the processor and for overflowing when the count of occurrences reaches a specified value; configuring a Precision Event Based Sampling (PEBS) processor circuit for generating a PEBS record after at least one overflow and for storing the PEBS record in a PEBS memory buffer, the PEBS record containing at least one stack entry read from the stack after the overflow; the PEBS enable circuit is a PEBS processor circuit capable of generating and storing a PEBS record after at least one overflow; generating, by the PEBS processor circuit, a PEBS record after at least one overflow and storing the PEBS record in a PEBS memory buffer; and storing the contents of the PEBS memory buffer to a PEBS tracking file in memory.

Example 11 includes the entity of example 10. In this example, the PEBS record further includes architecture metadata for the processor and the register state of the processor.

Example 12 includes the entity of any one of examples 10-11. In this example, configuring the PEBS processor circuit to generate and store the PEBS record includes: the PEBS configuration register is programmed to specify the contents of the PEBS record.

Example 13 includes the entity of any one of examples 10-12. In this example, the event is a precision event, and the example further includes: configuring a second counter included in the processor for generating a second count of occurrences of non-precise events in the processor and for generating a second overflow when the second count of occurrences reaches a second specified value; and configuring non-precision event based sampling (NPEBS) processor circuitry to generate an NPEBS record after the at least one second overflow, the NPEBS record containing at least one stack entry read from the stack after the at least one second overflow of the second counter, and to store the NPEBS record in the PEBS memory buffer.

Example 14 includes the entity of example 13. In this example, the PEBS processor circuit and the NPEBS processor circuit share at least some hardware.

Example 15 includes the entity of any one of examples 10-14. The example further includes post-processing the PEBS tracking file, the post-processing including: fetching an instruction pointer from a PEBS record in the PEBS tracking file; mapping the instruction pointer to symbol information; determining a function name and a calling convention of the function pointed to by the instruction pointer; retrieving function arguments from the PEBS trace file; and decomposing the performance data function by function call using the function arguments.

Example 16 provides a non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method comprising: configuring a counter contained in the processor for counting occurrences of events in the processor and for overflowing when the count of occurrences reaches a specified value; the PEBS processor circuit is configured to generate a PEBS record after the at least one overflow and to store the PEBS record in a PEBS memory buffer, the PEBS record containing at least one stack entry read from the stack after the at least one overflow; causing, by the PEBS enable circuit, the PEBS processor circuit to generate and store a PEBS record after at least one overflow; generating, by the PEBS processor circuit, a PEBS record and storing the PEBS record in a PEBS memory buffer; and storing the PEBS memory buffer to a PEBS trace file.

Example 17 includes the entity of example 16. In this example, the PEBS record further includes architecture metadata for the processor and the register state of the processor.

Example 18 includes the entity of any one of examples 16-17. In this example, configuring a counter included in a processor to count occurrences of events in the processor and to overflow when the count of occurrences reaches a specified value includes: programming an Event Selection (ES) control to have an event identifier corresponding to the selected event; and configuring the PEBS enable circuit to cause the PEBS processor circuit to generate and store a PEBS record when the count of occurrences of events in the processor reaches the specified value.

Example 19 includes the entity of any one of examples 16-18. In this example, the event is a precision event, and the method further comprises: configuring a second counter included in the processor for generating a second count of occurrences of non-precise events in the processor and for overflowing when the second count of occurrences reaches a second specified value; and configuring the NPEBS processor circuit to generate an NPEBS record after at least one second overflow of the second count of occurrences, and to store the NPEBS record in the PEBS memory buffer, the NPEBS record containing at least one stack entry read from the stack after the at least one second overflow of the second counter.

Example 20 includes the entity of any one of examples 16-19. The example further includes post-processing the PEBS tracking file, the post-processing including: fetching an instruction pointer from a PEBS record in the PEBS tracking file; mapping the instruction pointer to symbol information; determining a function name and a calling convention of the function pointed to by the instruction pointer; retrieving function arguments from the PEBS trace file; and decomposing the performance data function by function call using the function arguments.

Example 21 provides a system, comprising: a system memory, a processor, the processor comprising: a counter for counting occurrences of events in the processor and for overflowing when the count of occurrences reaches a specified value; a PEBS processor circuit for generating a PEBS record and storing the PEBS record in a PEBS memory buffer, the PEBS record comprising at least one stack entry reflecting the state of the processor; and a PEBS enable circuit coupled to the counter and to the PEBS processor circuit, the PEBS enable circuit to enable the PEBS processor circuit to generate a PEBS record and store the PEBS record to the PEBS memory buffer.

Example 22 includes the entity of example 21. In this example, the PEBS record further includes architecture metadata for the processor and the register state of the processor.

Example 23 includes the entity of any one of examples 21-22. The example further includes programming the event selection register to have an event identifier corresponding to the event, and further includes a programmable PEBS configuration register to specify contents of the PEBS record.

Example 24 includes the entity of any one of examples 21-23. The example further includes: a second counter included in the processor for generating a second count of occurrences of imprecise events in the processor and for overflowing when the second count of occurrences reaches a second specified value; an NPEBS processor circuit for generating an NPEBS record and storing the NPEBS record in the PEBS memory buffer, the NPEBS record including at least one stack entry reflecting a state of the processor; and an NPEBS enable circuit coupled to the second counter and to the NPEBS processor circuit, the NPEBS enable circuit operable to enable the NPEBS processor circuit to generate an NPEBS record and store the NPEBS record to the PEBS memory buffer when the counter reaches the second specified value.

Example 25 includes the entity of any one of examples 21-24. In this example, the event is a non-precise event.

Example 26 includes the entity of any one of examples 21-25. In this example, the specified value includes a zero value when the counter is decremented from a positive starting value, the specified value includes a zero value when the counter is incremented from a negative starting value, and the specified value includes a positive value when the counter is incremented from a zero starting value.

Example 27 includes the entity of any one of examples 21-26. The example further includes an interface to a second memory, and the PEBS memory buffer is to be stored into a PEBS tracking file contained in the second memory.

Example 28 includes the entity of any one of examples 21-27. In this example, the PEBS memory buffer comprises a cache memory contained in the processor, and the second memory comprises memory external to the processor.

Example 29 includes the entity of any one of examples 21-28. In this example, the PEBS memory buffer includes a memory external to the processor and coupled to the processor through a memory controller hub, and the second memory includes a data store external to the processor and coupled to the processor through an input/output (I/O) controller hub.

Example 30 provides a processor comprising: means for counting occurrences of events in the processor and for overflowing when the count of occurrences reaches a specified value; means for generating a PEBS record and storing the PEBS record in a PEBS memory buffer, the PEBS record comprising at least one stack entry reflecting the state of the processor; means for enabling the means for generating and storing the PEBS record in the PEBS memory buffer to generate and store the PEBS record in the PEBS memory buffer.

Example 31 includes the entity of example 30. In this example, the PEBS record further includes architecture metadata for the processor and the register state of the processor.

Example 32 includes the entity of any one of examples 30-31. The example further includes: the apparatus further includes means for programming the event selection register with an event identifier corresponding to the event, and further comprising means for programming the PEBS configuration register to specify the contents of the PEBS record.

Although some embodiments herein relate to data handling and distribution in the context of hardware execution units and logic circuits, other embodiments may be implemented with data or instructions stored on non-transitory machine-readable tangible media, which when executed by a machine, cause the machine to perform functions consistent with at least one embodiment. In one embodiment, the functions associated with embodiments of the present disclosure are embodied in machine-executable instructions. The instructions may be used to cause a general-purpose processor or special-purpose processor, which is programmed with the instructions, to perform the steps that indicate one embodiment. Embodiments of the invention may be provided as a computer program product or software which may include a machine or computer-readable medium having stored thereon computer-executable instructions which may be used to program a computer (or other electronic devices) to perform one or more operations according to an embodiment. Alternatively, the steps of the embodiments may be performed by specific hardware components that contain fixed-function logic for performing the steps, or by any combination of programmed computer components and fixed-function hardware components.

Instructions for programming logic to perform at least one embodiment may be stored in a memory (such as DRAM, cache, flash, or other storage) in the system. Furthermore, the instructions may be distributed via a network or by way of other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (such as a computer), but is not limited to: a floppy disk, an optical disk, a compact disk read-only memory (CD-ROM), magneto-optical disk, read-only memory (ROM), random-access memory (RAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic or optical cards, flash memory, or tangible machine-readable storage for use in conveying information via the internet by way of electrical, optical, acoustical, or other form of propagated signals (such as carrier waves, infrared signals, digital signals, etc.). Thus, a non-transitory computer-readable medium includes any type of tangible machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

Claims

1. A processor, comprising:

a counter for counting occurrences of events in the processor and for overflowing when the count of occurrences reaches a specified value;

A PEBS processor circuit for generating a PEBS record and storing the PEBS record in a PEBS memory buffer, the PEBS record comprising at least one stack entry reflecting a state of the processor; and

a PEBS enable circuit coupled to the counter and to the PEBS processor circuit, the PEBS enable circuit to enable the PEBS processor circuit to generate the PEBS record and store the PEBS record to the PEBS memory buffer.

2. The processor of claim 1, wherein the PEBS record further comprises architecture metadata of the processor and a register state of the processor.

3. The processor as set forth in claim 1,

further comprising an event selection register for being programmed with an event identifier corresponding to the event; and is also provided with

Further included is a programmable PEBS configuration register for specifying the contents of the PEBS record.

4. The processor of any one of claims 1-3, further comprising:

a second counter included in the processor for generating a second count of occurrences of non-precise events in the processor and for overflowing when the second count of occurrences reaches a second specified value;

NPEBS processor circuitry to generate and store an NPEBS record in the PEBS memory buffer, the NPEBS record comprising at least one stack entry reflecting a state of the processor; and

an NPEBS enable circuit coupled to the second counter and to the NPEBS processor circuit, the NPEBS enable circuit operable to enable the NPEBS processor circuit to generate an NPEBS record and store the NPEBS record to the PEBS memory buffer when the counter reaches a second specified value.

5. A processor as claimed in any one of claims 1 to 3, wherein the event is a non-precision event.

6. A processor according to any one of claims 1-3, wherein the specified value comprises a zero value when the counter is decremented from a positive starting value, the specified value comprises a zero value when the counter is incremented from a negative starting value, and the specified value comprises a positive value when the counter is incremented from a zero starting value.

7. The processor of any one of claims 1-3, further comprising an interface to a second memory, the PEBS memory buffer to be stored into a PEBS tracking file contained in the second memory.

8. The processor of claim 7, wherein the PEBS memory buffer comprises a cache memory contained in the processor and the second memory comprises a memory external to the processor.

9. The processor of claim 7, wherein the PEBS memory buffer comprises a memory external to the processor and coupled to the processor through a memory controller hub, and the second memory comprises a data store external to the processor and coupled to the processor through an input/output controller hub.

10. A method of processing, comprising:

configuring a counter contained in a processor for counting occurrences of events in the processor and for overflowing when the count of occurrences reaches a specified value;

configuring a PEBS processor circuit for generating a PEBS record after at least one overflow and for storing the PEBS record in a PEBS memory buffer, the PEBS record containing at least one stack entry read from a stack after the overflow;

enabling, by a PEBS enable circuit, the PEBS processor circuit to generate and store the PEBS record after the at least one overflow;

Generating, by the PEBS processor circuit, the PEBS record after the at least one overflow and storing the PEBS record in the PEBS memory buffer; and

and storing the content of the PEBS memory buffer into a PEBS tracking file in a memory.

11. The processing method of claim 10, the PEBS record further comprising architecture metadata of the processor and a register state of the processor.

12. The processing method of claim 10, wherein configuring the PEBS processor circuit to generate and store the PEBS record comprises: the PEBS configuration register is programmed to specify the contents of the PEBS record.

13. The process according to claim 10 to 12,

wherein the event is a precision event; and is also provided with

The method further comprises:

configuring a second counter contained in the processor for generating a second count of occurrences of imprecise events in the processor and for generating a second overflow when the second count of occurrences reaches a second specified value; and

the NPEBS processor circuit is configured for generating an NPEBS record after at least one second overflow, and for storing the NPEBS record in the PEBS memory buffer, the NPEBS record containing at least one stack entry read from the stack after the at least one second overflow of the second counter.

14. The processing method of claim 13, wherein the PEBS processor circuit and the NPEBS processor circuit share at least some hardware.

15. The processing method of any of claims 10-12, further comprising post-processing the PEBS track file, the post-processing comprising:

fetching an instruction pointer from a PEBS record in the PEBS tracking file;

mapping the instruction pointer to symbol information;

determining a function name and a calling convention of the function pointed to by the instruction pointer;

retrieving function arguments from the PEBS trace file; and

the performance data is decomposed function by function call using the function arguments.

16. A machine readable medium comprising code which when executed causes a machine to perform the processing method of any of claims 10-15.

17. A processing system, comprising:

a system memory;

a processor, the processor comprising:

A PEBS enable circuit coupled to the counter and to the PEBS processor circuit, the PEBS enable circuit enabling the PEBS processor circuit to generate a PEBS record and store the PEBS record to the PEBS memory buffer.

18. The processing system of claim 17, wherein the PEBS record further comprises architecture metadata of the processor and a register state of the processor.

19. The processing system of claim 17,

20. The processing system of any of claims 17-19, further comprising:

a second counter included in the processor for generating a second count of occurrences of imprecise events in the processor and for overflowing when the second count of occurrences reaches a second specified value;

21. The processing system of any of claims 17-19, wherein the event is a non-precise event.

22. The processing system of any of claims 17-19, wherein the specified value comprises a zero value when the counter is decremented from a positive starting value, the specified value comprises a zero value when the counter is incremented from a negative starting value, and the specified value comprises a positive value when the counter is incremented from a zero starting value.

23. The processing system of any of claims 17-19, further comprising an interface to a second memory, the PEBS memory buffer to be stored into a PEBS tracking file contained in the second memory.

24. A processor, comprising:

means for counting occurrences of events in the processor and for overflowing when the count of occurrences reaches a specified value;

Means for generating a PEBS record and storing the PEBS record in a PEBS memory buffer, the PEBS record comprising at least one stack entry reflecting a state of the processor; and

means for enabling the means for generating and storing PEBS records into a PEBS memory buffer to generate and store the PEBS records into the PEBS memory buffer.

25. The processor of claim 24, wherein the processor is configured to,

wherein the PEBS record further comprises architecture metadata for the processor and a register state for the processor; and is also provided with

Wherein the processor further comprises:

means for programming an event selection register with an event identifier corresponding to the event; and

means for programming a PEBS configuration register to specify the contents of the PEBS record.