Professional Documents
Culture Documents
The good old days, when Moores Law guaranteed a stable and transparent computing performance gain each and every year, are over. Processor clock speed cannot be increased anymore and, even if it could, it would not help since memory is still far behind in terms of speed, it will not catch up with the processor in the near future and it would be the bottleneck. Programs are performance-greedy and, as they get larger and more complex, they require improved and faster hardware to run properly. The hardware improvements that are available today include: multiple processors, multiple cores and NUMA architectures. Although all these are very promising, they are definitely non-transparent for programmers, for at least two reasons: firstly programmers need to write multi-threaded code, secondly, as the limited hardware resources (caches, bus, main memory) are shared among cores and processors, programmers need to constantly monitor how their programs use these resources in order to avoid bottlenecks and to speed up performance. Instead we will focus on monitoring the performance of singlethreaded programs, to find problems and inefficiencies in the code, to optimize it and to get the most out of todays hardware. All the research done, concerns only the most recent Intel processor families Core. Most modern, high-performance processors have special, on-chip hardware that monitors processor performance. Data collected by this hardware provides performance information on applications, the operating system, and the processor. These data can guide performance improvement efforts by providing information that helps programmers tune the algorithms used by the applications and operating system, and the code sequences to implement those algorithms. In order to facilitate simpler development and maintenance of performance monitoring tools, an architecturally-defined approach for software agents to interacting with the PMU is started. This approach is known as "architectural Performance Monitoring (PerfMon)," as opposed to the traditional "model-specific Performance Monitoring (PerfMon)." The purpose of architectural PerfMon is to provide a functionally and logically consistent set of capabilities with a consistent hardware interface that developers can rely on now and in the future. This has the ability to non-intrusively measure performance-related data/scenarios.
1.1 What is PerfMon? Ability to non intrusively measure performance related data/scenarios. Main Usage is to enable Software writer to avoid bad scenarios, if possible. Performance-monitoring features provide data that describe how an application and the operating system are performing on the processor; this information can guide efforts to improve performance. PerfMon is the short form for performance monitoring. Performance monitoring was introduced in the Pentium processor with a set of modelspecific performance-monitoring counter MSRs. These counters permit selection of processor performance parameters to be monitored and measured. The information obtained from these counters can be used for tuning system and compiler performance. In Intel P6 family of processors, the performance monitoring mechanism was enhanced to permit a wider selection of events to be monitored and to allow greater control events to be monitored. Next, Pentium 4 and Intel Xeon processors introduced a new performance monitoring mechanism and new set of performance events.
system-level pre-silicon validation environment should relieve test writers from maintaining or keeping track of the data in test code to make the task easily manageable for multi-ported systems. Keeping track of the data becomes arduous when different agents in a system interact on the same address segments. 2.1.3 Automated test generation The test creation and/or generation methodology is critical in building a system level pre- silicon validation environment capable of generating real-world-like stimuli. A dynamic test generator and checker are more effective in creating very interesting, reactive test sequences. An automated test generation tool should be capable of handling directed testing, pseudo-random testing and reactive testing. In directed testing, users specify the sequence of events to generate. This is efficient for verifying known cases and conditions. Pseudo-random testing is useful in uncovering unknown conditions or corner cases. Pseudo-random test generation, where transactions are generated from user-defined constraints, can be interspersed with blocks of directed sequences of transactions at periodic intervals to re-create real-life traffic scenarios in a pre-silicon validation environment. Dynamic test generation also facilitates reactive test generation. Reactive test generation implies a change in test generation when a monitored event is detected during simulation. 2.1.4 Robust, high-quality validation intellectual property (IP) The quality of validation is greatly enhanced with robust, high-quality validation IP, which includes such items as bus functional models (BFMs) and protocol monitors. The project group that develops the RTL must not create the validation IP used to verify the RTL. 2.1.5 Reusing the validation environment A good validation strategy is a must re-use of tests and the validation environment for successive revisions. Since test creation is one of the most time-consuming and labor-intensive parts of the validation process, designers should consider leveraging on subsequent projects the validation environment and the test suites already developed.
2.2 Major Components of Validation Environment Major components of validation environment are listed below 2.2.1 Bus functional models (BFM) BFMs will drive the generated input stimulus to DUT (Design under test). The intelligent BFMs provide a transaction-level API (Application Programming Interface), and are designed to handle concurrency and parallelism. This makes it suitable to be used in an automated test generation environment. It also offers a high degree of controllability for the model behavior to emulate a real device with real operating characteristics through programmable delay registers and configuration registers. 2.2.2 Bus protocol monitors and checkers The bus protocol monitors provide dynamic protocol checking and can be used in automated test generation environments. They provide dynamic bus state information, which can be used to provide dynamic feedback to user tests or automated test controllers. The bus protocol checkers will check whether transactions are happening according to the protocol or not. Protocol checker does checks protocol, not data. 2.2.3 Test stimulus generator The intelligent test generator utilizes transaction generators to create constraint-based concurrent sequences of transactions at the different interfaces of the DUT. The controller can generate transactions pseudo-randomly, for a user specified sequence, or a mix of both. It can also perform specific tasks or dynamically reload input constraints upon a certain event occurring during simulation. 2.2.4 Data checking The Data checker receives data from the DUT output interface. Based on the input stimuli provided, data checker will check whether output data is correct or not. Block diagram of validation environment is shown in the fig. 2.1.
Test Generator
BFM
Data Checker
DUT
Figure 2.1: Components of Validation Environment
2.3. Summary Validation is one of critical and time consuming activity, which determines the correctness of the design. So the validation environment must be reusable. Concurrency, Automatic test stimulus generation and Result checking are some of the requirements of validation environment.
Instruction Queue and Decode Unit The Instruction Queue and Decode Unit decodes up to four instructions, or up to five with macro-fusion. The instruction queue is 18 instructions deep. It sits between the instruction pre decode unit and the instruction decoders. It sends up to five instructions per cycle, and supports one macro-fusion per cycle. It also serves as a loop cache for loops smaller than 18 instructions, enabling some loops to be executed with both higher bandwidth and lower power. 3.1.3 Out-of-order superscalar execution core The execution core of the Intel Core micro architecture is superscalar and can process instructions out of order. When a dependency chain causes the machine to wait for a resource (such as a second-level data cache line), the execution core executes other instructions. This increases the overall rate of instructions executed per cycle (IPC). The execution core contains the following three major components: a Reservation Station, a Reorder Buffer and a Renamer. Reservation station (RS) Queues ops until all source operands are ready, schedules and dispatches ready ops to the available execution units. The RS has 32 entries. The initial stages of the out of order core move the ops from the front end to the ROB and RS. In this process, the out of order core carries out the following steps: 1. Allocates resources to ops. 2. Binds the op to an appropriate issue port. 3. Renames sources and destinations of ops, enabling out of order execution. 4. Provides data to the op when the data is either an immediate value or a register value that has already been calculated. Renamer Moves ops from the front end to the execution core. Architectural registers are renamed to a larger set of micro architectural registers. Renaming eliminates false dependencies known as read-after-read and write-after-read hazards. Reorder buffer (ROB) Holds ops in various stages of completion, buffers completed ops, updates the architectural state in order, and manages ordering of exceptions. The ROB has 96 entries to handle instructions in flight.
Figure 3.1: Architecture Block Diagram 3.2 Cache and Memory Subsystem The micro architecture contains an instruction cache, a first-level data cache and a second-level unified cache in each core. Each physical processor contains several processor cores and a shared collection of subsystems that are referred to as uncore, including: a unified
9
third-level cache shared by all cores in the physical processor and the Intel Quick Path Interconnect links and associated logic. The L1 and L2 caches are writeback and non-inclusive. The shared L3 cache is write back and inclusive, such that a cache line that exists in L1 data cache, L1 instruction cache, unified L2 cache also exists in L3. This is to minimize snoop traffic between processor cores. The micro architecture implements two levels of translation lookaside buffer (TLB). The first level consists of separate TLBs for data and code. DTLB0 handles address translation for data accesses; it provides 64 entries to support 4KB pages and 32 entries for large pages. The ITLB provides 64 entries (per thread) for 4KB pages and 7 entries (per thread) for large pages. The second level TLB (STLB) handles both code and data accesses for 4KB pages. It supports 4KB page translation operation that missed DTLB0 or ITLB. All entries are 4-way associative. 3.3 Hyper-Threading Technology Hyper-Threading Technology (HT) provides two logical processors sharing most execution/cache resources in each core. The HT implementation in new intel micro architecture is much better compared to previous generations of HT implementations because new micro architecture has a wider execution engine, more functional execution units, it supports higher peak memory bandwidth, it has larger instruction buffers and replicates (or partitions) almost all the resources needed by the instructions of each hardware thread (replicated: register state, renamed return stack buffer, and large-page ITLB - partitioned: load buffers, store buffers, reorder buffers, and small-page ITLB) with the only exception of the execution units. 3.4 Core Out of Order Pipeline The basic analysis methodology starts with an accounting of the cycle usage for execution. The out of order execution can be considered from the perspective of a simple block diagram as shown below (figure 3.2): After instructions are decoded into the executable micro operations (uops), they are assigned their required resources. They can only be issued to the downstream stages when there are sufficient free resources. This would include (among other requirements): 1) Space in the Reservation Station (RS), where the uops wait until their inputs are available 2) Space in the Reorder Buffer, where the uops wait until they can be retired 3) Sufficient load and store buffers in the case of memory related uops (loads and stores)
10
Retirement and write back of state to visible registers is only done for instructions and uops that are on the correct execution path. Instructions and uops of incorrectly predicted paths are flushed upon identification of the mispredictions and the correct paths are then processed. Retirement of the correct execution path instructions can proceed when two conditions are satisfied 1) The uops associated with the instruction to be retired have completed, allowing the retirement of the entire instruction, or in the case of instructions that generate very large number of uops, enough to fill the retirement window 2) Older instructions and their uops of correctly predicted paths have retired The mechanics of following these requirements ensures that the visible state is always consistent with in-order execution of the instructions. The magic of this design is that if the oldest instruction is blocked, for example waiting for the arrival of data from memory, younger independent instructions and uops, whose inputs are available, can be dispatched to the execution units and warehoused in the ROB upon completion. They will then retire when all the older work has completed. The terms issued, dispatched, executed and retired have very precise meanings as to where in this sequence they occur and are used in the event names to help document what is being measured.
11
In the Intel Core Processor, the reservation station has 36 entries which are shared between the Hyper-threads when that mode (HT) is enabled in the bios, with some entries reserved for each thread to avoid locking. If not, all 36 could be available to the single running thread, making restarting a blocked thread inefficient. There are 128 positions in the reorder buffer, which are again divided if HT is enabled or entirely available to the single thread if HT is not enabled. As on Core processors, the RS dispatches the uops to one of 6 dispatch ports where they are consumed by the execution units. This implies that on any cycle between 0 and 6 uops can be dispatched for execution. The hardware branch prediction requests the bytes of instructions for the predicted code paths from the 32KB L1 instruction cache at a maximum bandwidth of 16 bytes/cycle. Instructions fetches are always 16 byte aligned, so if a hot code path starts on the 15th byte, the FE will only receive 1 byte on that cycle. This can aggravate instruction bandwidth issues. The instructions are referenced by virtual address and translated to physical address with the help of a 128 entry instruction translation lookaside buffer (ITLB). The x86 instructions are decoded into the processors uops by the pipeline front end. Four instructions can be decoded and issued per cycle. If the branch prediction hardware mispredicts the execution path, the uops from the incorrect path which are in the instruction pipeline are simply removed where they are, without stalling execution. This reduces the cost of branch mispredictions. Thus the cost associated with such mispredictions is only the wasted work associated with any of the incorrect path uops that actually got dispatched and executed and any cycles that are idle while the correct path instructions are located, decoded and inserted into the execution pipeline.
12
13
application and the operating system. These advantages often make hardware performance monitoring the preferred and sometimes only choice for collecting processor performance data. Performance-monitoring hardware typically has two components: performance event detectors and event counters. Users can configure performance event detectors to detect any one of several performance events (for example, cache misses or branch mispredictions). Often, event detectors have an event mask field that allows further qualification of the event. For example, based on processors privilege mode (user/supervisor) to separate events generated by application from operating system code, or for filtering accesses to the L2 cache based on cache lines specific state (i.e. modified, shared, exclusive, or invalid). Further configuration is usually possible through enabling event counters only under certain edge and threshold conditions. The edge detection feature is most often used for events that detect the presence or absence of certain conditions every cycle, like a pipeline stall. The threshold feature lets the event counter compare the value it reports each cycle to a threshold value and then increment the counter. The threshold feature is only useful for performance events that report values greater than one in each cycle, for example for an Instructions Completed event, number of cycles when three or more instructions were completed (in one cycle) can be counted by using a threshold of two. 4.2 Performance event monitoring Performance events can be grouped into five categories: program characterization, memory accesses, pipeline stalls, branch prediction, and resource utilization. Program characterizations events help define the attributes of a program (and/or the operating system) that are largely independent of the processors implementation. The most common examples of these events are the number and type of instructions (for example, loads, stores, floating point, branches, and so on) completed by the program. Memory access events often comprise the largest event category and aid performance analysis of the processors memory hierarchy. For example, memory events can count references and misses to various caches and transactions on the processor memory bus. Pipeline stall event information helps users analyze how well the programs instructions flow through the pipeline. Processors with deep pipelines rely heavily on branch prediction hardware to keep the pipeline filled with useful instructions.
14
Branch prediction events let users analyze the performance of branch prediction hardware (for example, by providing counts of mispredicted branches). Resource utilization events let users monitor how often a processor uses certain resources (for example, the number of cycles spent using a floating-point divider). 4.2.1 Performance-monitoring hardware Performance-monitoring hardware typically has two components: performance event detectors and event counters. By properly configuring the event detectors and counters, users can obtain counts of a variety of performance events under various conditions. Users can configure performance event detectors to detect any one of several performance events (for example, cache misses or branch mispredictions). Often, event detectors have an event mask field that allows further qualification of the event. For example, the Intel Pentium IIIs event to count load accesses to the level 2 cache (L2_LD) has an event mask that lets event detectors monitor only accesses to cache lines in a specific state modified, shared, exclusive, or invalid. The event detector configuration also allows qualification by the processors current privilege mode. Operating systems use supervisor and user privilege modes to prevent applications from accessing and manipulating critical data structures and hardware that only the operating system should use directly. When the operating system is executing on the processor, the privilege mode is supervisor; when an application is executing on the processor, the privilege mode is user. As such, the ability to qualify event detection by the processors privilege mode allows counting of events caused only by the operating system or only by an application. Configuring the event detector to detect events for both privilege modes counts all events. In addition to counting events detected by the performance event detectors, users can configure performance event counters to count only under certain edge and threshold conditions. The edge detection feature is most often used for performance events that detect the presence or absence of certain conditions every cycle. For these events, an event count of one represents a conditions presence and zero indicates its absence. For example, a pipeline stall event indicates the presence or absence of a pipeline stall on each cycle. Counting the number of these events gives the number of cycles that the pipeline stalled. However, the edge detection feature can also count the number of stalls (more specifically, the number of times a stall began) rather than just the total number of cycles stalled.
15
With edge detect enabled, the performance counter will increment by one only when the previous number of performance events reported by the event detector is less than the current number being reported. So when the event detector reports zero events on a cycle followed by one event on the next cycle, the event counter has detected a rising edge and will increment by one. Its usually possible to invert the sense of the edge detection to count falling edges. For these events, disabling the edge detection feature counts stall durations, and enabling edge detection counts the number of stalls. Dividing the total stall duration by the number of stalls gives the average number of cycles stalled for a particular stall condition. The event counters second major feature is threshold support. This capability lets the event counter compare the value it reports each cycle to a threshold value. If the reported value exceeds the threshold, the counter increments by one. The threshold feature is only useful for performance events that report values greater than one each cycle. For example, superscalar processors can complete more than one instruction per cycle. Selecting instructions completed as the performance event and setting the counter threshold to two would increment the counter by one whenever three or more instructions complete in one cycle. This provides a count of how many times three or more instructions completed per cycle.
Figure 4.1: The general structure of the Pentium 4 event counters and detectors
16
4.2.2 Performance profiles Although performance event detectors and counters can easily detect the presence of a performance problem and let the user estimate the severity of the problem, its often necessary to find the locations of the code (whether in the application or the operating system) that are causing the performance problem. Knowing the source of the performance problem lets programmers alter the high-level algorithms used by the application and/or the low-level code to avoid or reduce the problems impact. To illustrate how performance counters can help create a profile that identifies the major sources of performance problems, lets first review the goals and techniques used to create time-based profiles. 4.2.2.1 Time-based profiles A common technique to identify areas upon which to focus tuning efforts is to obtain a time-based profile of the application. A time-based profile estimates the percentage of time an application spends in its major sections Focusing tuning efforts on the applications most frequently executed sections maximizes the benefits of a performance tuning changes made to the code. A time-based profile relies upon interrupting an applications execution at regular time intervals. During each interrupt, the interrupt service routine saves the value of the program counter. Once the application completes, the user can create a histogram that shows the number of samples collected for each program counter value. Assuming that the histogram draws from many program counter samples, it will show the applications most frequently executed sections. 4.2.2.2 Event-based profiles A technique similar to that for creating a time-based profile can help collect an event based profile. An event-based profile is a histogram that plots performance event counts as a function of code location. Instead of interrupting the application at regular time intervals (as is done to create a time-based profile), the performance-monitoring hardware interrupts the application after a specific number of performance events has occurred. So just as a time-based profile indicates the most frequently executed instructions, an event-based profile indicates the most frequently executed instructions that cause a given performance event. To support event-based sampling (EBS), performance-monitoring hardware typically generates a performance monitor interrupt when a performance event counter overflows. To generate an interrupt after N performance events, the performance counter is initialized to a value
17
of overflow minus N before being enabled. A performance monitor interrupt service routine (ISR) handles these interrupts. The ISR saves sample data from the program (for example, the program counter) and reenables the performance event counter to cause another interrupt after N occurrences of a certain performance event. After the application finishes executing, the user can plot the data samples saved by the ISR to create an event based profile.
18
Table 5.1: IA32_PERF_GLOBAL_CTRL Programming 5.2 IA32_PERF_GLOBAL_STATUS This register indicates the overflow status of each of the fixed and programmable counters. The upper bits provide additional status information of the PerfMon facilities. A set bit indicates an overflow has occurred in the corresponding counter. Overflow status bits in this register are cleared by writing the IA32_PERF_GLOBAL_OVF_CTRL register. Status bit indications in this register have no affect on interrupts or pending interrupts.
5.3 IA32_PERF_GLOBAL_OVF_CTRL The IA32_PERF_GLOBAL_OVF_CTRL provides software the ability to clear status bits set in the IA32_PERF_GLOBAL_STATUS register, described in the preceding section. This is a write-only register. To clear overflow or condition change status in the global status register, software must write the corresponding bits in this register to binary one.
22
23
Figure 5.4: PerfEvtSelX MSR 5.5 PERF_FIXED_CTRX and IA32_PMCX Registers Each counter register is 48-bits long. Counter registers can be cleared, or pre-loaded with count values as desired. This latter method is often used to set the point at which the counter will overflow, which is useful in event-based sampling. When writing the programmable counters using the wrmsr instruction, bits 32 through 47 cannot be written directly. They are signextended based on the value written to bit 31 of the counter register. When using the PEBS facility to re-load the programmable counters the entire 48-bit value is loaded from the DS Buffer Management area without any sign extension. Previous implementations of Intel Core Architecture contained counters which were limited to 40 bits in length. This implement counters 48 bits in length. Counter width can be enumerated using the features in the CPUID instruction.
The table above summarizes the differences between the new Intel Architecture which is same as Nehalem core (NHM) PMU features and that of previous products in the Intel Core and Pentium 4 processor families. Nehalem adheres to Architectural Performance Monitoring Version 3. The table includes architectural and non-architectural features. Architectural features are listed in the top of the table with intercepts highlighted. Intel processor cores for many years included a Performance Monitoring Unit (PMU). This unit provided the ability to count the occurrence of micro-architectural events which expose some of the inner workings of the processor core as it executes code. One usage of this capability is to create a list of events from which certain performance metrics can be calculated. Software configures the PMU to count events over an interval of time and report the resulting event counts. Using this methodology, performance analysts can characterize overall system performance. The Nehalem core supports event counting with seven event counters. Three of these counters are fixed function counters; the events counted by each of these counters are fixed in hardware. Software can determine whether counting is enabled during user or supervisor code execution, or both. The four remaining counters are programmable, and can be configured to count a variety of events. There are some restrictions on individual counters. Fixed counters are controlled by bit fields in a global control register. Programmable counters are controlled by a separate control register, one for each counter. PMU resources are available and must be programmed for each hardware thread (logical processor), if threading is enabled. Otherwise they are programmed for each core. PMU resources available in each thread do not accumulate to the core when hardware threading is disabled. Thus the PMU programming model remains consistent in any case. To successfully program all PMU resources, software must affinitive itself to each processor the operating system exposes. Counter registers are 48 bits in extent. Writing a binary one to any reserved bit in any counter or counter control register is undefined and may cause a general protection fault.
27
28
categories: data load related stalls, floating point exceptions, cycles stalled due to long-latency divisions and/or square root operations executing, instruction fetching related stalls, stalls due to jumps and branches The aim of software optimization is therefore: 1. To bring the stalled cycles close to 0% by improving code and data locality for example. 2. To do the same for cycles that are not retiring ops by minimizing branches or use more predictable branching. 3. To reduce the number of cycles which are retiring ops by using vector instructions where possible, and using faster and more efficient algorithms of course. Doing so will result in fewer total cycles and therefore a faster application.
To give a more detailed overview of Cycle Accounting Analysis, let us now see how it works in detail for Intel Core micro architecture. The first and most important event count for performance evaluation is the number of total clock cycles needed by an application to terminate successfully its execution. This metric can be measured by the event CPU CLK UNHALTED.CORE (aka UNHALTED CORE CYCLES). It is the most important metric
30
because it is the only one that has to be considered at the end of any optimization process to see if we did a good job or not. All other events must be taken into account with the only aim to eventually reduce the UNHALTED CORE CYCLES event count. As all the cycles used by an application can be (roughly) divided into cycles not issuing ops and cycles issuing ops, we need a way to calculate them using performance counters. It turns out that the only event we need to monitor in this case is called RS UOPS DISPATCHED that is the number of ops dispatched by the Reservation Station (RS) into the various execution ports. One useful feature of Intel Core Micro architectures PMU is the counter mask (aka CMASK): when it is set to something larger than zero, say n, it tells the counters to count the number of cycles (and NOT the number of events) during which the event monitored has occurred at least n times. Therefore if we want to know, for instance, how many cycles the Reservation Station dispatched at least 2 ops (in one single cycle) we would monitor the RS UOPS DISPATCHED setting the CMASK to 2. In our case we just need to know how many cycles the Reservation Station dispatched any number (bigger than 0) of ops so we will set our CMASK to 1. So we have that: Cycles issuing ops = RS UOPS DISPATCHED (CMASK = 1) Another interesting feature of Intels PMU is the INV bit that defaults to 0 but can also be set to 1. When the INV bit is set to 1 (and the CMASK is set to some n bigger than 0), cycles are counted only when the event monitored occurs less than n times. So, in our case, since we need to know how many cycles the RS did NOT issue ANY ops, we also need to set the CMASK to 1 (because we are still counting cycles, not ops) and the INV to 1; therefore: Cycles not issuing ops = RS UOPS DISPATCHED (CMASK = 1 && INV = 1)
So the total number of cycles can be expressed as: Total cycles = Cycles issuing ops + Cycles not issuing ops Theres no equal sign there because there are few situations that are not properly considered in this analysis, such as whether the RS is full or empty, or transient situations of RS
31
being empty but some in-flight ops are getting retired. Nevertheless the following equation should hold within a (small) error:
UNHALTED CORE CYCLES = RS UOPS DISPATCHED (CMASK = 1) + RS UOPS DISPATCHED (CMASK = 1 && INV = 1) The ops that are issued for execution are not necessarily retired. This happens when the ops are part of a speculative execution that ends up being wrong: mispredicted branching is a good example. Those ops that do not reach retirement do not help forward progress of program execution. Therefore the number of Cycles issuing ops can be further decomposed into Cycles non retiring ops and Cycles retiring ops. Unfortunately theres no event capable of measuring the number of Cycles non retiring ops. We will derive this metric from available performance events, and several assumptions. We define ops rate as: ops rate = Dispatched ops / Cycles issuing ops Where the quantity Dispatched ops can be measured with the event RS UOPS DISPATCHED (without CMASK and INV). Thus: ops rate = RS UOPS DISPATCHED / RS UOPS DISPATCHED (CMASK=1) Next we define the total number of ops retired as: Retired ops = UOPS RETIRED.ANY + UOPS RETIRED.FUSED Next we approximate the number of non-retiring ops by: Non retired ops = Dispatched ops Retired ops
32
Thus finally, Cycles non retiring ops = Non retired ops / ops rate The number of cycles retiring ops is easier and can be calculated as: Cycles retiring ops = Retired ops / ops rate
We also define the number of cycles stalled as: Cycles stalled = Cycles not issuing ops
Therefore: Cycles stalled = RS UOPS DISPATCHED (CMASK = 1 && INV = 1) This methodology does not take into account situations where retiring ops and non-retiring ops may be dispatched in the same cycle into the Out-Of-Order (OOO) engine. Nevertheless this scenario does not occur very often and the method used finds results that are a very good approximation of what happens in reality. So finally we have that the three calculated components should sum up to the total number of cycles, i.e.: Total cycles = Cycles non retiring ops + Cycles retiring ops + Cycles stalled
So, for optimization purposes we have to keep in mind that: If the contribution from Cycles non retiring ops is high, focusing on code layout and reducing branch mispredictions will be important. If the contribution from Cycles stalled is high, additional drill-down may be necessary to locate bottlenecks that lie deeper in the micro architecture pipeline. If the contributions from Cycles non retiring ops and Cycles stalled are both insignificant, the focus of performance tuning should be directed to code vectorization or other techniques to improve retirement throughput of hot spots.
33
We should now understand what part of the architecture is stressed by our programs execution and is therefore causing the stalled cycles that we just calculated. One thing to note at this time is that events that cause stalls can be counted using the PMU, but the count obtained is not the number of cycles lost (stalled) caused by the event. I will therefore use the concept of impact when talking about number of cycles stalled due a particular kind of event. These are easily obtained by multiplying the cycle penalty of a certain kind of event (number of cycles stalled caused by that event) by the number of events (of the same kind) counted.
Figure 6.2 performance events and where they monitor the uop flow. The following items discuss several common stress points of the micro architecture: Level-2 Cache Miss Impact The Intel Core Micro architecture has a two level caching system, meaning that a miss at the second level involves an access to system memory. The latency of system memory varies with different chipsets, but it is generally in the order of more than one hundred cycles. Server chipsets tend to exhibit longer latency than desktop chipsets. The number L2 cache miss references can be measured by MEM LOAD RETIRED: L2 LINE MISS. An estimation of overall L2 miss impact calculated by multiplying system memory latency by the
34
number of L2 misses is only an approximation because it ignores the OOO engines ability to handle multiple outstanding load misses:
L2 miss impact = MEM LOAD RETIRED: L2 LINE MISS * system memory latency
Level-2 Cache Hit Impact When a Level-1 Cache Miss occurs, it does not necessarily mean that the processor will find the data on the second level cache. It may happen that the data required is missing from the second level cache as well. Therefore, the number of L2 hits can be measured by the difference between the numbers of Level-1 Data Cache Misses and Level-2 Cache misses, i.e.: Level 2 Cache Hits = MEM LOAD RETIRED: L1D LINE MISS MEM LOAD RETIRED: L2 LINE MISS
As in the previous case to obtain the impact we have to multiply this quantity by the Level-2 Cache access latency:
L2 hit impact = Level 2 Cache Hits * Level 2 Cache latency This formula, just like the one above does not take into account the OOO engines ability to handle multiple outstanding load misses. L1 DTLB Miss Impact Another cause of CPU stalls are Data Translation Look-aside Buffer (DTLB) Misses that occur in the Level-1 Cache. The number of misses is calculated using MEM LOAD RETIRED: DTLB MISS. Therefore:
DTLB miss impact = MEM LOAD RETIRED: DTLB MISS * DTLB miss cycle penalty
LCP Impact LCP stands for Length-Changing Prefix. When instructions of this type are fetched they require the use of the slow instruction decoder. The event ILD STALL measures the number of times the slow decoder was triggered, so: LCP impact = ILD STALL * LCP cycle penalty
35
Store Forwarding Stall Impact When a store forwarding situation does not meet address or size requirements imposed by hardware, a stall occurs. The delay varies for different store forwarding stall situations. Consequently, there are several performance events that provide fine grain specificity to detect different store-forwarding stall conditions. Three components will be analyzed: A load blocked by preceding store to unknown address can be measured by the event LOAD BLOCK: STA. So: Load block sta impact = LOAD BLOCK: STA * Load block sta cycle penalty The event LOAD BLOCK:OVERLAP STORE counts the number of load operations blocked because of an actual data overlap with a preceding store, or because of an ambiguous overlap from page aliasing in which the load and a preceding store have the same offset but into different pages. We have that: Load block overlap store impact = LOAD BLOCK: OVERLAP STORE * Load block overlap store cycle penalty A load spanning across cache line boundary can be measured by the event LOAD BLOCK: UNTIL RETIRE. So:
Load block until retire impact = LOAD BLOCK: UNTIL RETIRE * Load block until retire cycle penalty
So we have that these three contributions sum up to the total numbers of cycles lost due to Store Forwarding mechanism problems:
Store forwarding stall impact = Load block sta impact + Load block overlap store impact + Load block until retire impact
In principle the sum of these five stalls contributions should give a result very close to the total number of stalled cycles calculated before:
Cycles stalled = L2 miss impact + L2 hit impact + DTLB miss impact + LCP impact + Store forwarding stall impact
36
Anyway this approach has a few problems: first of all it implies a simplification since other kinds of stalls may occur besides the 5 categories we saw. Secondly the impact in terms of cycles lost for each stall event may have an error depending on the particular state the machine is at the moment in which the event occurs, for instance sometimes a L2 miss may cause a 160 cycle delay other times a 250 cycle delay (we have used an average value of 201). Third, sometimes the sum of all impacts of different stalls exceeds the total number of cycles not issuing uops, meaning that their impact was overestimated or that some of them overlap. Some other times the sum is a little smaller than the total number of cycles not issuing uops, meaning that their impact was underestimated or that another kind of stall occurred and was not taken into account. Moreover there are several components which cannot be counted reliably on the Intel Core Micro architecture. This fall into three main classes: stalls due to instruction starvation, stalls due to dependent chains of multi-cycle instructions (other than divide) and stalls related to Front Side Bus saturation. Finally, almost all event counts are approximations of real events, although they are very good approximations since the error are typically below 3%. Nevertheless, although quantities may be over or underestimated they give a good insight as to which is the main problems to work on within a particular application
37
Events Monitored
Instruction_Retired:All - Number of Architectural instructions retired. A macro-fused uop is counted as 2 instructions. A REP prefixed instruction should be counted as single instruction (not per iteration). Instruction_Decoded - Instructions Decoders used this cycle Uops_Retired:All - All uops that actually retired (macro-fused=1, micro-fused=2, others=1) Uops_Issued:Any - Number of Uops issued. Counts the number of Uops issued by the Register Allocation Table to the Reservation Station. Uops_Issued: Fused: micro fused uops that are issued, subset of uops_issued: Any. Uops_Executed: Thread - Number of uops to be executed per-thread each cycle. Uops_Dispatched: Core Number of uops dispatched to execution unit. Branch_Instruction_Executed: All - All (macro) branch instructions executed. Branch_Instruction_Retired: All - All (macro) branch instructions retired. Branch_Instruction_Retired: Conditional - Conditional branch instructions retired. Branch_Instruction_Retired: Not taken - Non-taken branch instructions retired. Branch_Instruction_Retired: Taken - Taken branch instructions retired. Branch_Instruction_Retired: Return - Return instructions retired. Branch_Instruction_Retired: Call - Call instructions retired. Branch_misprediction_retired: All - All miss-predicted (macro) branch instructions retired. Branch_misprediction_retired: Conditional - Miss-predicted conditional branch instructions retired. Branch_misprediction_retired: not taken - Miss-predicted non-taken branch instructions retired (i.e. were mispredicted and not-taken).
38
Branch_misprediction_retired: taken - Miss-predicted taken branch instructions retired (i.e. were mispredicted and not-taken). Unhalted core cycles: UNHALTED CORE CYCLES = RS UOPS DISPATCHED (CMASK = 1) + RS UOPS DISPATCHED (CMASK = 1 && INV = 1). RS uops dispatched is uops dispatched only. Idq_uops_not_delivered: Core - Number of non-delivered uops to RAT upon read from IDQ. Specifically: 1.Count 0 when: a. IDQ-RAT pipe serving other thread b. RAT is stalled for this thread (incl. Uop dropping & clear BE conditions) c. IDQ delivers 4 uops 2. Count 4 x when RAT is not stalled and IDQ delivers x uops to RAT (x belongs-to {0,1,2,3})
39
; 3 Fixed Counters
; 4 Programmable Counters
40
; enable perfMon global control before test case mov ecx, 0x38f mov eax, 0xf mov edx, 0x7 wrmsr ;-----------------; Test Case #1: indirect call ; generates 0x15 = 21 ; Increments counter by 1 for every call, 1 for every return and 1 for the final jmp. ;-----------------lea eax, fun_ica nop call eax nop call eax nop call eax nop call eax nop call eax nop call eax nop call eax nop call eax nop call eax nop call eax jmp end_ica fun_ica: ret end_ica:
; Disable counters after the test mov ecx, 0x38f mov eax, 0 mov edx, 0 wrmsr
41
; jump to test fail if not equal to cmp eax, 0x ; ;counter value which is expected jne fail ; Read counter 1 mov ecx, 0xc2 rdmsr ; jump to test fail if not equal to sum of all test cases cmp eax, 0x ; jne fail ; Read counter 2 mov ecx, 0xc3 rdmsr ; jump to test fail if not equal to sum of all test cases cmp eax, 0x ; jne fail ; Read counter 3 mov ecx, 0xc4 rdmsr ; jump to test fail if not equal to sum of all test cases cmp eax, 0x ; jne fail pass: &SIGNAL_PASS fail: &SIGNAL_FAIL CODE
42
43
Since the event RESOURCE_STALLS.ANY counts the number of cycles where uops could not be issued due to a lack of downstream resources (RS or ROB slots, load or store buffers etc), the difference is the cycles no uops are issued because there were none available. Instruction Starvation = UOPS_ISSUED.STALL_CYCLES - RESOURCE_STALLS.ANY From the above values Instruction Starvation = 435 0 = 435 The above mentioned observation is an example of how the performance can be monitored. Likewise with different performance monitoring events the performance can be monitored in a better way. This cycle accounting analysis just showed the way how it can be done.
45
CONCLUSION
The general capabilities of performance monitoring hardware described in this article have been used extensively to analyze application, operating system, and processor performance. These analyses have helped improve not only application and operating system code but also compilers and next-generation processor designs. However, as discussed previously, the performance-monitoring support offered by most processors is limited (too few counters, lack of support for distinguishing between speculative and non speculative event counts, imprecise event-based sampling, and lack of support for creating data address profiles). The recent Advanced processors provides performance-monitoring capabilities that overcome these limitations, while also providing full support for the simultaneous multithreading capabilities of the new processor. We saw how Cycle Accounting Analysis, used in all our analysis approaches, gives a good insight on how a specific application performs by means of decomposition of cycles and most importantly of stalled cycles.
46
REFERENCES
1. Intel Reference manual: Intel 64 and IA-32 Architectures Software Developers Manual Volume 3B:System Programming Guide 2. THE BASICS OF PERFORMANCEMONITORING HARDWARE by Brinkley Sprunt, Bucknell Univ., Electrical Engineering Dept., Moore Ave., Lewisburg, PA 17837; bsprunt@bucknell.edu 3. Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors. 4. Performance Monitoring Unit Sharing Guide -Peggy Irelan and Shihjong Kuo. 5. Intel Micro architecture Codename Nehalem Performance Monitoring Unit
Programming Guide (Nehalem Core PMU). 6. Hardware-based performance monitoring with VTune Performance Analyzer under Linux- Hassan Shojania,shojania@ieee.org. 7. PENTIUM 4 PERFORMANCEMONITORING FEATURES by Brinkley Sprunt, Bucknell Univ., Electrical Engineering Dept., Moore Ave., Lewisburg, PA 17837; bsprunt@bucknell.edu
47