You are on page 1of 47

CHAPTER 1 INTRODUCTION

The good old days, when Moores Law guaranteed a stable and transparent computing performance gain each and every year, are over. Processor clock speed cannot be increased anymore and, even if it could, it would not help since memory is still far behind in terms of speed, it will not catch up with the processor in the near future and it would be the bottleneck. Programs are performance-greedy and, as they get larger and more complex, they require improved and faster hardware to run properly. The hardware improvements that are available today include: multiple processors, multiple cores and NUMA architectures. Although all these are very promising, they are definitely non-transparent for programmers, for at least two reasons: firstly programmers need to write multi-threaded code, secondly, as the limited hardware resources (caches, bus, main memory) are shared among cores and processors, programmers need to constantly monitor how their programs use these resources in order to avoid bottlenecks and to speed up performance. Instead we will focus on monitoring the performance of singlethreaded programs, to find problems and inefficiencies in the code, to optimize it and to get the most out of todays hardware. All the research done, concerns only the most recent Intel processor families Core. Most modern, high-performance processors have special, on-chip hardware that monitors processor performance. Data collected by this hardware provides performance information on applications, the operating system, and the processor. These data can guide performance improvement efforts by providing information that helps programmers tune the algorithms used by the applications and operating system, and the code sequences to implement those algorithms. In order to facilitate simpler development and maintenance of performance monitoring tools, an architecturally-defined approach for software agents to interacting with the PMU is started. This approach is known as "architectural Performance Monitoring (PerfMon)," as opposed to the traditional "model-specific Performance Monitoring (PerfMon)." The purpose of architectural PerfMon is to provide a functionally and logically consistent set of capabilities with a consistent hardware interface that developers can rely on now and in the future. This has the ability to non-intrusively measure performance-related data/scenarios.

1.1 What is PerfMon? Ability to non intrusively measure performance related data/scenarios. Main Usage is to enable Software writer to avoid bad scenarios, if possible. Performance-monitoring features provide data that describe how an application and the operating system are performing on the processor; this information can guide efforts to improve performance. PerfMon is the short form for performance monitoring. Performance monitoring was introduced in the Pentium processor with a set of modelspecific performance-monitoring counter MSRs. These counters permit selection of processor performance parameters to be monitored and measured. The information obtained from these counters can be used for tuning system and compiler performance. In Intel P6 family of processors, the performance monitoring mechanism was enhanced to permit a wider selection of events to be monitored and to allow greater control events to be monitored. Next, Pentium 4 and Intel Xeon processors introduced a new performance monitoring mechanism and new set of performance events.

CHAPTER 2 VALIDATION ENVIRONMENT


The objective of pre-silicon validation is to verify the correctness and sufficiency of the design. This approach typically requires modeling the complete system, where the model of the design under test may be (Register Transfer Level) RTL, and other components of the system may be behavioral or bus functional models. The validation environment functional models are connected to the design under test (RTL) for validation. The goal is to subject the Design under Test (DUT) to real-world-like input stimuli. Pre-silicon validation aims to: Validate design sufficiency Validate design correctness. Verify implementation correctness. Uncover unexpected system component interactions. 2.1 Requirements of Validation Environment To achieve the above goals the environment is required with special emphasis for handling the following situations. 2.1.1 Concurrency Most complex chips have multiple ports or interfaces and there is concurrent, asynchronous and independent activity at these ports in a real system. A system-level validation environment should be able to create and handle such a real-world concurrency to qualify as a pre-silicon validation environment. Concurrency needs to be handled in both the test controller and the bus/interface models used. Some models will return data when a transaction completes, so the test controller or environment can do data checking. Other models require the expected data to be provided up front so the model can do data checking when the transaction completes. 2.1.2 Results checking While it may be relatively easy to generate stimulus to the different ports or interfaces of a chip, the difficult part is to implement an automated results or data checking strategy. A
3

system-level pre-silicon validation environment should relieve test writers from maintaining or keeping track of the data in test code to make the task easily manageable for multi-ported systems. Keeping track of the data becomes arduous when different agents in a system interact on the same address segments. 2.1.3 Automated test generation The test creation and/or generation methodology is critical in building a system level pre- silicon validation environment capable of generating real-world-like stimuli. A dynamic test generator and checker are more effective in creating very interesting, reactive test sequences. An automated test generation tool should be capable of handling directed testing, pseudo-random testing and reactive testing. In directed testing, users specify the sequence of events to generate. This is efficient for verifying known cases and conditions. Pseudo-random testing is useful in uncovering unknown conditions or corner cases. Pseudo-random test generation, where transactions are generated from user-defined constraints, can be interspersed with blocks of directed sequences of transactions at periodic intervals to re-create real-life traffic scenarios in a pre-silicon validation environment. Dynamic test generation also facilitates reactive test generation. Reactive test generation implies a change in test generation when a monitored event is detected during simulation. 2.1.4 Robust, high-quality validation intellectual property (IP) The quality of validation is greatly enhanced with robust, high-quality validation IP, which includes such items as bus functional models (BFMs) and protocol monitors. The project group that develops the RTL must not create the validation IP used to verify the RTL. 2.1.5 Reusing the validation environment A good validation strategy is a must re-use of tests and the validation environment for successive revisions. Since test creation is one of the most time-consuming and labor-intensive parts of the validation process, designers should consider leveraging on subsequent projects the validation environment and the test suites already developed.

2.2 Major Components of Validation Environment Major components of validation environment are listed below 2.2.1 Bus functional models (BFM) BFMs will drive the generated input stimulus to DUT (Design under test). The intelligent BFMs provide a transaction-level API (Application Programming Interface), and are designed to handle concurrency and parallelism. This makes it suitable to be used in an automated test generation environment. It also offers a high degree of controllability for the model behavior to emulate a real device with real operating characteristics through programmable delay registers and configuration registers. 2.2.2 Bus protocol monitors and checkers The bus protocol monitors provide dynamic protocol checking and can be used in automated test generation environments. They provide dynamic bus state information, which can be used to provide dynamic feedback to user tests or automated test controllers. The bus protocol checkers will check whether transactions are happening according to the protocol or not. Protocol checker does checks protocol, not data. 2.2.3 Test stimulus generator The intelligent test generator utilizes transaction generators to create constraint-based concurrent sequences of transactions at the different interfaces of the DUT. The controller can generate transactions pseudo-randomly, for a user specified sequence, or a mix of both. It can also perform specific tasks or dynamically reload input constraints upon a certain event occurring during simulation. 2.2.4 Data checking The Data checker receives data from the DUT output interface. Based on the input stimuli provided, data checker will check whether output data is correct or not. Block diagram of validation environment is shown in the fig. 2.1.

Test Generator

BFM

Protocol Monitor & Protocol Checker

Data Checker

DUT
Figure 2.1: Components of Validation Environment

2.3. Summary Validation is one of critical and time consuming activity, which determines the correctness of the design. So the validation environment must be reusable. Concurrency, Automatic test stimulus generation and Result checking are some of the requirements of validation environment.

Chapter 3 The Micro Architectures


3.1 Intel Core Micro Architecture This section will give a very brief introduction to the Intel Core Micro architecture in a way which is relevant to performance monitoring and optimization. 3.1.1 Pipeline Intel CORE Micro architecture pipeline consists of: In-order front end: fetches instruction streams from memory, with four instruction decoders to supply decoded instruction (ops) to the out of- order execution core. Out-of-order superscalar execution core: can issue up to six ops per cycle and reorder ops to execute as soon as sources are ready and execution resources are available. In-order retirement unit: ensures the results of execution of ops are processed and architectural states are updated according to the original program order. In the following we will present the main features of the front-end and the execution core. 3.1.2 In-order front end The front ends supplies a stream of decoded instructions (i.e. ops) to a six issue wide out-of-order engine. It is made of three components: the Branch Prediction Unit (BPU), the Instruction Fetch Unit and the Instruction Queue and Decode Unit. Branch Prediction Unit The Branch Prediction Unit helps the instruction to fetch unit fetch the most likely instruction to be executed by predicting the various branch types: conditional, indirect, direct, call, and return. It uses dedicated hardware for each type. It enables speculative execution, and it improves its efficiency by reducing the amount of code in the non-architected path (code paths that the processor thought it should execute but then found out it should go in another path) to be fetched into the pipeline. Instruction Fetch Unit The Instruction Fetch Unit pre fetches instructions that are likely to be executed, caches frequently-used instructions, and pre decodes and buffers instructions, maintaining a constant bandwidth despite irregularities in the instruction stream.

Instruction Queue and Decode Unit The Instruction Queue and Decode Unit decodes up to four instructions, or up to five with macro-fusion. The instruction queue is 18 instructions deep. It sits between the instruction pre decode unit and the instruction decoders. It sends up to five instructions per cycle, and supports one macro-fusion per cycle. It also serves as a loop cache for loops smaller than 18 instructions, enabling some loops to be executed with both higher bandwidth and lower power. 3.1.3 Out-of-order superscalar execution core The execution core of the Intel Core micro architecture is superscalar and can process instructions out of order. When a dependency chain causes the machine to wait for a resource (such as a second-level data cache line), the execution core executes other instructions. This increases the overall rate of instructions executed per cycle (IPC). The execution core contains the following three major components: a Reservation Station, a Reorder Buffer and a Renamer. Reservation station (RS) Queues ops until all source operands are ready, schedules and dispatches ready ops to the available execution units. The RS has 32 entries. The initial stages of the out of order core move the ops from the front end to the ROB and RS. In this process, the out of order core carries out the following steps: 1. Allocates resources to ops. 2. Binds the op to an appropriate issue port. 3. Renames sources and destinations of ops, enabling out of order execution. 4. Provides data to the op when the data is either an immediate value or a register value that has already been calculated. Renamer Moves ops from the front end to the execution core. Architectural registers are renamed to a larger set of micro architectural registers. Renaming eliminates false dependencies known as read-after-read and write-after-read hazards. Reorder buffer (ROB) Holds ops in various stages of completion, buffers completed ops, updates the architectural state in order, and manages ordering of exceptions. The ROB has 96 entries to handle instructions in flight.

Architecture Block Diagram

Figure 3.1: Architecture Block Diagram 3.2 Cache and Memory Subsystem The micro architecture contains an instruction cache, a first-level data cache and a second-level unified cache in each core. Each physical processor contains several processor cores and a shared collection of subsystems that are referred to as uncore, including: a unified
9

third-level cache shared by all cores in the physical processor and the Intel Quick Path Interconnect links and associated logic. The L1 and L2 caches are writeback and non-inclusive. The shared L3 cache is write back and inclusive, such that a cache line that exists in L1 data cache, L1 instruction cache, unified L2 cache also exists in L3. This is to minimize snoop traffic between processor cores. The micro architecture implements two levels of translation lookaside buffer (TLB). The first level consists of separate TLBs for data and code. DTLB0 handles address translation for data accesses; it provides 64 entries to support 4KB pages and 32 entries for large pages. The ITLB provides 64 entries (per thread) for 4KB pages and 7 entries (per thread) for large pages. The second level TLB (STLB) handles both code and data accesses for 4KB pages. It supports 4KB page translation operation that missed DTLB0 or ITLB. All entries are 4-way associative. 3.3 Hyper-Threading Technology Hyper-Threading Technology (HT) provides two logical processors sharing most execution/cache resources in each core. The HT implementation in new intel micro architecture is much better compared to previous generations of HT implementations because new micro architecture has a wider execution engine, more functional execution units, it supports higher peak memory bandwidth, it has larger instruction buffers and replicates (or partitions) almost all the resources needed by the instructions of each hardware thread (replicated: register state, renamed return stack buffer, and large-page ITLB - partitioned: load buffers, store buffers, reorder buffers, and small-page ITLB) with the only exception of the execution units. 3.4 Core Out of Order Pipeline The basic analysis methodology starts with an accounting of the cycle usage for execution. The out of order execution can be considered from the perspective of a simple block diagram as shown below (figure 3.2): After instructions are decoded into the executable micro operations (uops), they are assigned their required resources. They can only be issued to the downstream stages when there are sufficient free resources. This would include (among other requirements): 1) Space in the Reservation Station (RS), where the uops wait until their inputs are available 2) Space in the Reorder Buffer, where the uops wait until they can be retired 3) Sufficient load and store buffers in the case of memory related uops (loads and stores)
10

Figure 3.2: Flow of Uop

Retirement and write back of state to visible registers is only done for instructions and uops that are on the correct execution path. Instructions and uops of incorrectly predicted paths are flushed upon identification of the mispredictions and the correct paths are then processed. Retirement of the correct execution path instructions can proceed when two conditions are satisfied 1) The uops associated with the instruction to be retired have completed, allowing the retirement of the entire instruction, or in the case of instructions that generate very large number of uops, enough to fill the retirement window 2) Older instructions and their uops of correctly predicted paths have retired The mechanics of following these requirements ensures that the visible state is always consistent with in-order execution of the instructions. The magic of this design is that if the oldest instruction is blocked, for example waiting for the arrival of data from memory, younger independent instructions and uops, whose inputs are available, can be dispatched to the execution units and warehoused in the ROB upon completion. They will then retire when all the older work has completed. The terms issued, dispatched, executed and retired have very precise meanings as to where in this sequence they occur and are used in the event names to help document what is being measured.

11

In the Intel Core Processor, the reservation station has 36 entries which are shared between the Hyper-threads when that mode (HT) is enabled in the bios, with some entries reserved for each thread to avoid locking. If not, all 36 could be available to the single running thread, making restarting a blocked thread inefficient. There are 128 positions in the reorder buffer, which are again divided if HT is enabled or entirely available to the single thread if HT is not enabled. As on Core processors, the RS dispatches the uops to one of 6 dispatch ports where they are consumed by the execution units. This implies that on any cycle between 0 and 6 uops can be dispatched for execution. The hardware branch prediction requests the bytes of instructions for the predicted code paths from the 32KB L1 instruction cache at a maximum bandwidth of 16 bytes/cycle. Instructions fetches are always 16 byte aligned, so if a hot code path starts on the 15th byte, the FE will only receive 1 byte on that cycle. This can aggravate instruction bandwidth issues. The instructions are referenced by virtual address and translated to physical address with the help of a 128 entry instruction translation lookaside buffer (ITLB). The x86 instructions are decoded into the processors uops by the pipeline front end. Four instructions can be decoded and issued per cycle. If the branch prediction hardware mispredicts the execution path, the uops from the incorrect path which are in the instruction pipeline are simply removed where they are, without stalling execution. This reduces the cost of branch mispredictions. Thus the cost associated with such mispredictions is only the wasted work associated with any of the incorrect path uops that actually got dispatched and executed and any cycles that are idle while the correct path instructions are located, decoded and inserted into the execution pipeline.

12

Chapter 4 Performance Monitoring Overview


Performance measurement of any high-performance cluster system is very critical for development and deployment of efficient applications for such system. All new modern processors have special hardware to monitor processor performance. This hardware-based performance measurement has many advantageous over traditional intrusive methods of performance measurements based on adding code for probing execution time of portion of a program. For example, data collected by this hardware provides performance information on applications, the operating system, and the processor. These data can guide performance improvement efforts by helping programmers tuning the algorithms used, and the code sequences that implement those algorithms. 4.1 Basics of performance monitoring hardware There can be different approaches for collecting processor performance data. 1. Modifying the application to add instrumentation code for collecting various data like instruction trace and memory reference data. This requires either rebuilding it from source code or modifying its executable version; both not favorable usually (especially for operating system code). Also, these approaches can disturb the applications behavior, bringing questions about validity of the collected data. 2. Another way to collect processor performance data is by using a simulator to model the processor as it executes the application. This simulation approach can yield a detailed data on processor blocks like pipeline stalls, branch prediction, cache performance, and so on. However, processor manufacturers do not usually provide simulators for advanced processor designs and third parties dont know enough about the hardware detail to build such a simulator. 3. Using performance-monitoring hardware has several distinct advantages over previous approaches. Having the processor itself actually collect performance data as it executes an application have several benefits. First, the application and operating system remain largely unmodified. Second, the accuracy of the collected event counts is much higher compared to using loose simulators which are not capable of simulating exact hardware behavior. Third, performance monitoring hardware collects data on the fly as the application executes, avoiding the slow simulation based approaches. Fourth, this approach can collect data for both the

13

application and the operating system. These advantages often make hardware performance monitoring the preferred and sometimes only choice for collecting processor performance data. Performance-monitoring hardware typically has two components: performance event detectors and event counters. Users can configure performance event detectors to detect any one of several performance events (for example, cache misses or branch mispredictions). Often, event detectors have an event mask field that allows further qualification of the event. For example, based on processors privilege mode (user/supervisor) to separate events generated by application from operating system code, or for filtering accesses to the L2 cache based on cache lines specific state (i.e. modified, shared, exclusive, or invalid). Further configuration is usually possible through enabling event counters only under certain edge and threshold conditions. The edge detection feature is most often used for events that detect the presence or absence of certain conditions every cycle, like a pipeline stall. The threshold feature lets the event counter compare the value it reports each cycle to a threshold value and then increment the counter. The threshold feature is only useful for performance events that report values greater than one in each cycle, for example for an Instructions Completed event, number of cycles when three or more instructions were completed (in one cycle) can be counted by using a threshold of two. 4.2 Performance event monitoring Performance events can be grouped into five categories: program characterization, memory accesses, pipeline stalls, branch prediction, and resource utilization. Program characterizations events help define the attributes of a program (and/or the operating system) that are largely independent of the processors implementation. The most common examples of these events are the number and type of instructions (for example, loads, stores, floating point, branches, and so on) completed by the program. Memory access events often comprise the largest event category and aid performance analysis of the processors memory hierarchy. For example, memory events can count references and misses to various caches and transactions on the processor memory bus. Pipeline stall event information helps users analyze how well the programs instructions flow through the pipeline. Processors with deep pipelines rely heavily on branch prediction hardware to keep the pipeline filled with useful instructions.

14

Branch prediction events let users analyze the performance of branch prediction hardware (for example, by providing counts of mispredicted branches). Resource utilization events let users monitor how often a processor uses certain resources (for example, the number of cycles spent using a floating-point divider). 4.2.1 Performance-monitoring hardware Performance-monitoring hardware typically has two components: performance event detectors and event counters. By properly configuring the event detectors and counters, users can obtain counts of a variety of performance events under various conditions. Users can configure performance event detectors to detect any one of several performance events (for example, cache misses or branch mispredictions). Often, event detectors have an event mask field that allows further qualification of the event. For example, the Intel Pentium IIIs event to count load accesses to the level 2 cache (L2_LD) has an event mask that lets event detectors monitor only accesses to cache lines in a specific state modified, shared, exclusive, or invalid. The event detector configuration also allows qualification by the processors current privilege mode. Operating systems use supervisor and user privilege modes to prevent applications from accessing and manipulating critical data structures and hardware that only the operating system should use directly. When the operating system is executing on the processor, the privilege mode is supervisor; when an application is executing on the processor, the privilege mode is user. As such, the ability to qualify event detection by the processors privilege mode allows counting of events caused only by the operating system or only by an application. Configuring the event detector to detect events for both privilege modes counts all events. In addition to counting events detected by the performance event detectors, users can configure performance event counters to count only under certain edge and threshold conditions. The edge detection feature is most often used for performance events that detect the presence or absence of certain conditions every cycle. For these events, an event count of one represents a conditions presence and zero indicates its absence. For example, a pipeline stall event indicates the presence or absence of a pipeline stall on each cycle. Counting the number of these events gives the number of cycles that the pipeline stalled. However, the edge detection feature can also count the number of stalls (more specifically, the number of times a stall began) rather than just the total number of cycles stalled.
15

With edge detect enabled, the performance counter will increment by one only when the previous number of performance events reported by the event detector is less than the current number being reported. So when the event detector reports zero events on a cycle followed by one event on the next cycle, the event counter has detected a rising edge and will increment by one. Its usually possible to invert the sense of the edge detection to count falling edges. For these events, disabling the edge detection feature counts stall durations, and enabling edge detection counts the number of stalls. Dividing the total stall duration by the number of stalls gives the average number of cycles stalled for a particular stall condition. The event counters second major feature is threshold support. This capability lets the event counter compare the value it reports each cycle to a threshold value. If the reported value exceeds the threshold, the counter increments by one. The threshold feature is only useful for performance events that report values greater than one each cycle. For example, superscalar processors can complete more than one instruction per cycle. Selecting instructions completed as the performance event and setting the counter threshold to two would increment the counter by one whenever three or more instructions complete in one cycle. This provides a count of how many times three or more instructions completed per cycle.

Figure 4.1: The general structure of the Pentium 4 event counters and detectors
16

4.2.2 Performance profiles Although performance event detectors and counters can easily detect the presence of a performance problem and let the user estimate the severity of the problem, its often necessary to find the locations of the code (whether in the application or the operating system) that are causing the performance problem. Knowing the source of the performance problem lets programmers alter the high-level algorithms used by the application and/or the low-level code to avoid or reduce the problems impact. To illustrate how performance counters can help create a profile that identifies the major sources of performance problems, lets first review the goals and techniques used to create time-based profiles. 4.2.2.1 Time-based profiles A common technique to identify areas upon which to focus tuning efforts is to obtain a time-based profile of the application. A time-based profile estimates the percentage of time an application spends in its major sections Focusing tuning efforts on the applications most frequently executed sections maximizes the benefits of a performance tuning changes made to the code. A time-based profile relies upon interrupting an applications execution at regular time intervals. During each interrupt, the interrupt service routine saves the value of the program counter. Once the application completes, the user can create a histogram that shows the number of samples collected for each program counter value. Assuming that the histogram draws from many program counter samples, it will show the applications most frequently executed sections. 4.2.2.2 Event-based profiles A technique similar to that for creating a time-based profile can help collect an event based profile. An event-based profile is a histogram that plots performance event counts as a function of code location. Instead of interrupting the application at regular time intervals (as is done to create a time-based profile), the performance-monitoring hardware interrupts the application after a specific number of performance events has occurred. So just as a time-based profile indicates the most frequently executed instructions, an event-based profile indicates the most frequently executed instructions that cause a given performance event. To support event-based sampling (EBS), performance-monitoring hardware typically generates a performance monitor interrupt when a performance event counter overflows. To generate an interrupt after N performance events, the performance counter is initialized to a value
17

of overflow minus N before being enabled. A performance monitor interrupt service routine (ISR) handles these interrupts. The ISR saves sample data from the program (for example, the program counter) and reenables the performance event counter to cause another interrupt after N occurrences of a certain performance event. After the application finishes executing, the user can plot the data samples saved by the ISR to create an event based profile.

18

Chapter 5 Global Control and Status Registers


There are a set of global control and status registers which control the fixed and programmable counters, and provide status indications of the PMU in general. The following sections describe these global registers. 5.1 IA32_PERF_GLOBAL_CTRL This register globally controls the fixed and programmable counters. If a control bit in this register is clear, all other control register programming for the corresponding counter will be ignored and the counter will not count. Counters that are disabled by this register cannot count, overflow, or subsequently generate overflow interrupts. It is possible that a disabled counter may generate a PEBS assist. This can occur as follows. While the counter is enabled it can overflow (to zero) and arm the PEBS hardware. The next event (the counter transitions from zero to one) will cause the PEBS assist to occur. At this point the PEBS assist will occur even if an intervening write to this register disables the counter. Note that the state of the IA32_PERF_GLOBAL_CTRL register is preserved across entry and exit to probe mode (halting with an ITP). Writes to this register during probe mode will be lost upon exit from probe mode.

Figure 5.1: IA32_PERF_GLOBAL_CTRL MSR


19

Table 5.1: IA32_PERF_GLOBAL_CTRL Programming 5.2 IA32_PERF_GLOBAL_STATUS This register indicates the overflow status of each of the fixed and programmable counters. The upper bits provide additional status information of the PerfMon facilities. A set bit indicates an overflow has occurred in the corresponding counter. Overflow status bits in this register are cleared by writing the IA32_PERF_GLOBAL_OVF_CTRL register. Status bit indications in this register have no affect on interrupts or pending interrupts.

Figure 5.2: IA32_PERF_GLOBAL_STATUS MSR


20

Table 5.2: IA32_PERF_GLOBAL_STATUS Programming


21

5.3 IA32_PERF_GLOBAL_OVF_CTRL The IA32_PERF_GLOBAL_OVF_CTRL provides software the ability to clear status bits set in the IA32_PERF_GLOBAL_STATUS register, described in the preceding section. This is a write-only register. To clear overflow or condition change status in the global status register, software must write the corresponding bits in this register to binary one.

Figure 5.3: IA32_PERF_GLOBAL_OVF_CTRL MSR

22

Table 5.3: IA32_PERF_GLOBAL_OVF_CTRL Programming

5.4 Programmable Counter Control Registers


This section describes the control registers for the four programmable counters. PerfEvtSelX The PerfEvtSelX registers control the four programmable counters. Using these control registers, software can select the event to be counted, and the constraints under which those events are counted. Each counter must be locally enabled by this register, as well as globally enabled, in order to operate correctly. The layout of this register is similar to previous Intel Core Architecture implementations. However, the pin control (PC) bit is now reserved. This implements an additional event modifier bit, AnyThr, which controls counting events for the counters logical processor only, or for all logical processors in the core which contains the counter.

23

Table 5.4: PerfEvtSelX Programming


24

Figure 5.4: PerfEvtSelX MSR 5.5 PERF_FIXED_CTRX and IA32_PMCX Registers Each counter register is 48-bits long. Counter registers can be cleared, or pre-loaded with count values as desired. This latter method is often used to set the point at which the counter will overflow, which is useful in event-based sampling. When writing the programmable counters using the wrmsr instruction, bits 32 through 47 cannot be written directly. They are signextended based on the value written to bit 31 of the counter register. When using the PEBS facility to re-load the programmable counters the entire 48-bit value is loaded from the DS Buffer Management area without any sign extension. Previous implementations of Intel Core Architecture contained counters which were limited to 40 bits in length. This implement counters 48 bits in length. Counter width can be enumerated using the features in the CPUID instruction.

Figure 5.5 : Counter


25

5.6 PERFORMANCE MONITORING ARCHITECTURE COMPARISON

Table 5.5: Perfmon Architecture Comparison


26

The table above summarizes the differences between the new Intel Architecture which is same as Nehalem core (NHM) PMU features and that of previous products in the Intel Core and Pentium 4 processor families. Nehalem adheres to Architectural Performance Monitoring Version 3. The table includes architectural and non-architectural features. Architectural features are listed in the top of the table with intercepts highlighted. Intel processor cores for many years included a Performance Monitoring Unit (PMU). This unit provided the ability to count the occurrence of micro-architectural events which expose some of the inner workings of the processor core as it executes code. One usage of this capability is to create a list of events from which certain performance metrics can be calculated. Software configures the PMU to count events over an interval of time and report the resulting event counts. Using this methodology, performance analysts can characterize overall system performance. The Nehalem core supports event counting with seven event counters. Three of these counters are fixed function counters; the events counted by each of these counters are fixed in hardware. Software can determine whether counting is enabled during user or supervisor code execution, or both. The four remaining counters are programmable, and can be configured to count a variety of events. There are some restrictions on individual counters. Fixed counters are controlled by bit fields in a global control register. Programmable counters are controlled by a separate control register, one for each counter. PMU resources are available and must be programmed for each hardware thread (logical processor), if threading is enabled. Otherwise they are programmed for each core. PMU resources available in each thread do not accumulate to the core when hardware threading is disabled. Thus the PMU programming model remains consistent in any case. To successfully program all PMU resources, software must affinitive itself to each processor the operating system exposes. Counter registers are 48 bits in extent. Writing a binary one to any reserved bit in any counter or counter control register is undefined and may cause a general protection fault.

27

CORE PMU MSR LIST

Table 5.6: Core PMU MSR list

28

Chapter 6 Cycle Accounting Analysis and uops flow


The Cycle Accounting Analysis is a methodology to analyze the performance of an application and to find its weak points. It is specific for the Intel Core Micro architecture and was developed by David Levinthal of Intel Corporation. Figure 6.1 illustrates the cycle decomposition schema of an application execution. According to it, the total execution time, i.e. the total execution cycles of an application, can be divided into cycles in which the front-end is issuing ops and cycles in which it is not. Improving performance starts with identifying where in the application the cycles are spent and identifying how they can be reduced. As Amdahls law points out, an application can only be sped up by the fraction of cycles that are being used by the section of code being optimized. To accomplish such a cycle count reduction it is critical to know how the cycles are being used. This is both to identify those places where there is nothing to be gained, but more importantly where any effort is most likely to be fruitful. Such cycle usage decomposition is usually described as cycle accounting. The first step of the decomposition is usually to divide the cycles into two groups, productive and unproductive or stalled. This is particularly important, as stalled cycles are usually the easiest to recover. The cycles in which the front-end is issuing ops can be further divided into cycles which are retiring ops and cycles which are not. One example of ops issued but not retired is when branch mispredictions occurs, ops which were issued and executed do not get retired since they belong to a speculative execution which did not prove correct eventually. So basically we have three possible types of cycles: cycles retiring ops, i.e. doing useful work, cycles issuing ops but not retiring them, i.e. doing useless work and cycles which are not issuing ops at all, i.e. stalled cycles doing no work. In the Core micro architecture the stalled cycles can be further decomposed in 5 major components to better understand who is responsible for the lost cycles: level-2 cache misses, level-1 cache misses (but with a level-2 hit), instruction and data translation look-aside buffers misses (inside the level-1 cache), store-forwarding related stalls and finally stalls related to the use of length changing prefix instructions which involve the use of the slow decoder. Instead for the new micro architecture there are more stall-related events that account for the lost cycles. Nevertheless these can all be binned into the following
29

categories: data load related stalls, floating point exceptions, cycles stalled due to long-latency divisions and/or square root operations executing, instruction fetching related stalls, stalls due to jumps and branches The aim of software optimization is therefore: 1. To bring the stalled cycles close to 0% by improving code and data locality for example. 2. To do the same for cycles that are not retiring ops by minimizing branches or use more predictable branching. 3. To reduce the number of cycles which are retiring ops by using vector instructions where possible, and using faster and more efficient algorithms of course. Doing so will result in fewer total cycles and therefore a faster application.

Figure 6.1: Cycle Accounting Analysis - Cycle Decomposition.

To give a more detailed overview of Cycle Accounting Analysis, let us now see how it works in detail for Intel Core micro architecture. The first and most important event count for performance evaluation is the number of total clock cycles needed by an application to terminate successfully its execution. This metric can be measured by the event CPU CLK UNHALTED.CORE (aka UNHALTED CORE CYCLES). It is the most important metric

30

because it is the only one that has to be considered at the end of any optimization process to see if we did a good job or not. All other events must be taken into account with the only aim to eventually reduce the UNHALTED CORE CYCLES event count. As all the cycles used by an application can be (roughly) divided into cycles not issuing ops and cycles issuing ops, we need a way to calculate them using performance counters. It turns out that the only event we need to monitor in this case is called RS UOPS DISPATCHED that is the number of ops dispatched by the Reservation Station (RS) into the various execution ports. One useful feature of Intel Core Micro architectures PMU is the counter mask (aka CMASK): when it is set to something larger than zero, say n, it tells the counters to count the number of cycles (and NOT the number of events) during which the event monitored has occurred at least n times. Therefore if we want to know, for instance, how many cycles the Reservation Station dispatched at least 2 ops (in one single cycle) we would monitor the RS UOPS DISPATCHED setting the CMASK to 2. In our case we just need to know how many cycles the Reservation Station dispatched any number (bigger than 0) of ops so we will set our CMASK to 1. So we have that: Cycles issuing ops = RS UOPS DISPATCHED (CMASK = 1) Another interesting feature of Intels PMU is the INV bit that defaults to 0 but can also be set to 1. When the INV bit is set to 1 (and the CMASK is set to some n bigger than 0), cycles are counted only when the event monitored occurs less than n times. So, in our case, since we need to know how many cycles the RS did NOT issue ANY ops, we also need to set the CMASK to 1 (because we are still counting cycles, not ops) and the INV to 1; therefore: Cycles not issuing ops = RS UOPS DISPATCHED (CMASK = 1 && INV = 1)

So the total number of cycles can be expressed as: Total cycles = Cycles issuing ops + Cycles not issuing ops Theres no equal sign there because there are few situations that are not properly considered in this analysis, such as whether the RS is full or empty, or transient situations of RS
31

being empty but some in-flight ops are getting retired. Nevertheless the following equation should hold within a (small) error:

UNHALTED CORE CYCLES = RS UOPS DISPATCHED (CMASK = 1) + RS UOPS DISPATCHED (CMASK = 1 && INV = 1) The ops that are issued for execution are not necessarily retired. This happens when the ops are part of a speculative execution that ends up being wrong: mispredicted branching is a good example. Those ops that do not reach retirement do not help forward progress of program execution. Therefore the number of Cycles issuing ops can be further decomposed into Cycles non retiring ops and Cycles retiring ops. Unfortunately theres no event capable of measuring the number of Cycles non retiring ops. We will derive this metric from available performance events, and several assumptions. We define ops rate as: ops rate = Dispatched ops / Cycles issuing ops Where the quantity Dispatched ops can be measured with the event RS UOPS DISPATCHED (without CMASK and INV). Thus: ops rate = RS UOPS DISPATCHED / RS UOPS DISPATCHED (CMASK=1) Next we define the total number of ops retired as: Retired ops = UOPS RETIRED.ANY + UOPS RETIRED.FUSED Next we approximate the number of non-retiring ops by: Non retired ops = Dispatched ops Retired ops

32

Thus finally, Cycles non retiring ops = Non retired ops / ops rate The number of cycles retiring ops is easier and can be calculated as: Cycles retiring ops = Retired ops / ops rate

We also define the number of cycles stalled as: Cycles stalled = Cycles not issuing ops

Therefore: Cycles stalled = RS UOPS DISPATCHED (CMASK = 1 && INV = 1) This methodology does not take into account situations where retiring ops and non-retiring ops may be dispatched in the same cycle into the Out-Of-Order (OOO) engine. Nevertheless this scenario does not occur very often and the method used finds results that are a very good approximation of what happens in reality. So finally we have that the three calculated components should sum up to the total number of cycles, i.e.: Total cycles = Cycles non retiring ops + Cycles retiring ops + Cycles stalled

So, for optimization purposes we have to keep in mind that: If the contribution from Cycles non retiring ops is high, focusing on code layout and reducing branch mispredictions will be important. If the contribution from Cycles stalled is high, additional drill-down may be necessary to locate bottlenecks that lie deeper in the micro architecture pipeline. If the contributions from Cycles non retiring ops and Cycles stalled are both insignificant, the focus of performance tuning should be directed to code vectorization or other techniques to improve retirement throughput of hot spots.
33

We should now understand what part of the architecture is stressed by our programs execution and is therefore causing the stalled cycles that we just calculated. One thing to note at this time is that events that cause stalls can be counted using the PMU, but the count obtained is not the number of cycles lost (stalled) caused by the event. I will therefore use the concept of impact when talking about number of cycles stalled due a particular kind of event. These are easily obtained by multiplying the cycle penalty of a certain kind of event (number of cycles stalled caused by that event) by the number of events (of the same kind) counted.

Figure 6.2 performance events and where they monitor the uop flow. The following items discuss several common stress points of the micro architecture: Level-2 Cache Miss Impact The Intel Core Micro architecture has a two level caching system, meaning that a miss at the second level involves an access to system memory. The latency of system memory varies with different chipsets, but it is generally in the order of more than one hundred cycles. Server chipsets tend to exhibit longer latency than desktop chipsets. The number L2 cache miss references can be measured by MEM LOAD RETIRED: L2 LINE MISS. An estimation of overall L2 miss impact calculated by multiplying system memory latency by the

34

number of L2 misses is only an approximation because it ignores the OOO engines ability to handle multiple outstanding load misses:

L2 miss impact = MEM LOAD RETIRED: L2 LINE MISS * system memory latency

Level-2 Cache Hit Impact When a Level-1 Cache Miss occurs, it does not necessarily mean that the processor will find the data on the second level cache. It may happen that the data required is missing from the second level cache as well. Therefore, the number of L2 hits can be measured by the difference between the numbers of Level-1 Data Cache Misses and Level-2 Cache misses, i.e.: Level 2 Cache Hits = MEM LOAD RETIRED: L1D LINE MISS MEM LOAD RETIRED: L2 LINE MISS

As in the previous case to obtain the impact we have to multiply this quantity by the Level-2 Cache access latency:

L2 hit impact = Level 2 Cache Hits * Level 2 Cache latency This formula, just like the one above does not take into account the OOO engines ability to handle multiple outstanding load misses. L1 DTLB Miss Impact Another cause of CPU stalls are Data Translation Look-aside Buffer (DTLB) Misses that occur in the Level-1 Cache. The number of misses is calculated using MEM LOAD RETIRED: DTLB MISS. Therefore:

DTLB miss impact = MEM LOAD RETIRED: DTLB MISS * DTLB miss cycle penalty

LCP Impact LCP stands for Length-Changing Prefix. When instructions of this type are fetched they require the use of the slow instruction decoder. The event ILD STALL measures the number of times the slow decoder was triggered, so: LCP impact = ILD STALL * LCP cycle penalty
35

Store Forwarding Stall Impact When a store forwarding situation does not meet address or size requirements imposed by hardware, a stall occurs. The delay varies for different store forwarding stall situations. Consequently, there are several performance events that provide fine grain specificity to detect different store-forwarding stall conditions. Three components will be analyzed: A load blocked by preceding store to unknown address can be measured by the event LOAD BLOCK: STA. So: Load block sta impact = LOAD BLOCK: STA * Load block sta cycle penalty The event LOAD BLOCK:OVERLAP STORE counts the number of load operations blocked because of an actual data overlap with a preceding store, or because of an ambiguous overlap from page aliasing in which the load and a preceding store have the same offset but into different pages. We have that: Load block overlap store impact = LOAD BLOCK: OVERLAP STORE * Load block overlap store cycle penalty A load spanning across cache line boundary can be measured by the event LOAD BLOCK: UNTIL RETIRE. So:

Load block until retire impact = LOAD BLOCK: UNTIL RETIRE * Load block until retire cycle penalty

So we have that these three contributions sum up to the total numbers of cycles lost due to Store Forwarding mechanism problems:

Store forwarding stall impact = Load block sta impact + Load block overlap store impact + Load block until retire impact

In principle the sum of these five stalls contributions should give a result very close to the total number of stalled cycles calculated before:

Cycles stalled = L2 miss impact + L2 hit impact + DTLB miss impact + LCP impact + Store forwarding stall impact
36

Anyway this approach has a few problems: first of all it implies a simplification since other kinds of stalls may occur besides the 5 categories we saw. Secondly the impact in terms of cycles lost for each stall event may have an error depending on the particular state the machine is at the moment in which the event occurs, for instance sometimes a L2 miss may cause a 160 cycle delay other times a 250 cycle delay (we have used an average value of 201). Third, sometimes the sum of all impacts of different stalls exceeds the total number of cycles not issuing uops, meaning that their impact was overestimated or that some of them overlap. Some other times the sum is a little smaller than the total number of cycles not issuing uops, meaning that their impact was underestimated or that another kind of stall occurred and was not taken into account. Moreover there are several components which cannot be counted reliably on the Intel Core Micro architecture. This fall into three main classes: stalls due to instruction starvation, stalls due to dependent chains of multi-cycle instructions (other than divide) and stalls related to Front Side Bus saturation. Finally, almost all event counts are approximations of real events, although they are very good approximations since the error are typically below 3%. Nevertheless, although quantities may be over or underestimated they give a good insight as to which is the main problems to work on within a particular application

37

Events Monitored
Instruction_Retired:All - Number of Architectural instructions retired. A macro-fused uop is counted as 2 instructions. A REP prefixed instruction should be counted as single instruction (not per iteration). Instruction_Decoded - Instructions Decoders used this cycle Uops_Retired:All - All uops that actually retired (macro-fused=1, micro-fused=2, others=1) Uops_Issued:Any - Number of Uops issued. Counts the number of Uops issued by the Register Allocation Table to the Reservation Station. Uops_Issued: Fused: micro fused uops that are issued, subset of uops_issued: Any. Uops_Executed: Thread - Number of uops to be executed per-thread each cycle. Uops_Dispatched: Core Number of uops dispatched to execution unit. Branch_Instruction_Executed: All - All (macro) branch instructions executed. Branch_Instruction_Retired: All - All (macro) branch instructions retired. Branch_Instruction_Retired: Conditional - Conditional branch instructions retired. Branch_Instruction_Retired: Not taken - Non-taken branch instructions retired. Branch_Instruction_Retired: Taken - Taken branch instructions retired. Branch_Instruction_Retired: Return - Return instructions retired. Branch_Instruction_Retired: Call - Call instructions retired. Branch_misprediction_retired: All - All miss-predicted (macro) branch instructions retired. Branch_misprediction_retired: Conditional - Miss-predicted conditional branch instructions retired. Branch_misprediction_retired: not taken - Miss-predicted non-taken branch instructions retired (i.e. were mispredicted and not-taken).
38

Branch_misprediction_retired: taken - Miss-predicted taken branch instructions retired (i.e. were mispredicted and not-taken). Unhalted core cycles: UNHALTED CORE CYCLES = RS UOPS DISPATCHED (CMASK = 1) + RS UOPS DISPATCHED (CMASK = 1 && INV = 1). RS uops dispatched is uops dispatched only. Idq_uops_not_delivered: Core - Number of non-delivered uops to RAT upon read from IDQ. Specifically: 1.Count 0 when: a. IDQ-RAT pipe serving other thread b. RAT is stalled for this thread (incl. Uop dropping & clear BE conditions) c. IDQ delivers 4 uops 2. Count 4 x when RAT is not stalled and IDQ delivers x uops to RAT (x belongs-to {0,1,2,3})

39

Assembly Code of a Sample Program


; clear counters mov ecx, 0x38f mov eax, 0 mov edx, 0 wrmsr mov ecx, 0x309 wrmsr mov ecx, 0x30a wrmsr mov ecx, 0x30b wrmsr mov ecx, 0xc1 wrmsr mov ecx, 0xc2 wrmsr mov ecx, 0xc3 wrmsr mov ecx, 0xc4 wrmsr ; configure br_inst_retired mov eax, 0x43xxxx mov edx, 0x0 mov ecx, 0x186 wrmsr ; configure br_inst_retired mov eax, 0x43xxxx mov edx, 0x0 mov ecx, 0x187 wrmsr ; configure br_inst_retired mov eax, 0x43xxxx mov edx, 0x0 mov ecx, 0x188 wrmsr ; configure br_inst_retired mov eax, 0x43xxxx mov edx, 0x0 mov ecx, 0x189 wrmsr ; IA32_CR_PERF_GLOBAL_CTRL MSR

; 3 Fixed Counters

; 4 Programmable Counters

; Events to be counted ; General Performance counter 0 MSR

; General Performance Counter 1 MSR

; General Performance Counter 2 MSR

; General Performance Counter 3 MSR

40

; enable perfMon global control before test case mov ecx, 0x38f mov eax, 0xf mov edx, 0x7 wrmsr ;-----------------; Test Case #1: indirect call ; generates 0x15 = 21 ; Increments counter by 1 for every call, 1 for every return and 1 for the final jmp. ;-----------------lea eax, fun_ica nop call eax nop call eax nop call eax nop call eax nop call eax nop call eax nop call eax nop call eax nop call eax nop call eax jmp end_ica fun_ica: ret end_ica:

; Disable counters after the test mov ecx, 0x38f mov eax, 0 mov edx, 0 wrmsr
41

; Read counter 0 mov ecx, 0xc1 rdmsr

; Each values of different event is given in the result

; jump to test fail if not equal to cmp eax, 0x ; ;counter value which is expected jne fail ; Read counter 1 mov ecx, 0xc2 rdmsr ; jump to test fail if not equal to sum of all test cases cmp eax, 0x ; jne fail ; Read counter 2 mov ecx, 0xc3 rdmsr ; jump to test fail if not equal to sum of all test cases cmp eax, 0x ; jne fail ; Read counter 3 mov ecx, 0xc4 rdmsr ; jump to test fail if not equal to sum of all test cases cmp eax, 0x ; jne fail pass: &SIGNAL_PASS fail: &SIGNAL_FAIL CODE

42

Results- Counter values of different event


Instruction Retired Instruction Decoded Uops retired Uops issued: Any Uops issued: Fused Uops issued: Stall cycles Uops executed Uops dispatched core Uops dispatched stall cycles Unhalted core cycles Idq_uops_not_delivered Br_inst_exec_all Br_inst_retired_all Br_inst_retired_taken Br_inst_retired_nottaken Br_inst_retired_call Br_inst_retired_ret Br_inst_retired_call Br_mispred_retired_all Br_mispred_retired_taken ILD_Stalls Resource Stalls - 36 - 48 -169 -190 -44 -435 -171 -171 -419 -515 -1024 -21 -21 -21 -0 -10 -10 -10 -10 -10 -22 -0

43

Observations from the above result


Branch mispredictions, Wasted Work, Mispredictions Penalties and UOP Flow Branch mispredictions can introduce execution inefficiencies that are typically decomposed into three components. 1) Wasted work associated with executing the uops of the incorrectly predicted path 2) Cycles lost when the pipeline is flushed of the incorrect uops 3) Cycles lost while waiting for the correct uops to arrive at the execution units In the Intel Core there is no execution stalls associated with clearing the pipeline of mispredicted uops (component 2). These uops are simply removed from the pipeline without stalling executions or dispatch. This typically lowers the penalty for mispredicted branches. Further, the penalty associated with instruction starvation (component 3) can be measured for the first time in OOO x86 architectures. Speculative OOO execution introduces a component of execution inefficiency due to the uops on mispredicted paths being dispatched to the execution units. This represents wasted work as these uops will never be retired as is part of the cost associated with mispredicted branches. It can be found through monitoring the flow of uops through the pipeline. The uop flow can be measured at 3 points in the Figure 6.2 shown above, going into the RS with the event UOPS_ISSUED, going into the execution units with UOPS_EXECUTED and at retirement with UOPS_RETIRED. Wasted Work/thread = (UOPS_ISSUED.ANY + UOPS_ISSUED.FUSED) UOPS_RETIRED.ANY From the above values Wasted Work/thread = (190 + 44) 169 = 65 As stated above, there is no interruption in uop dispatch or execution due to flushing the pipeline. Thus the second component of the misprediction penalty is zero. The third component of the misprediction penalty, instruction starvation, occurs when the instructions associated with the correct path are far away from the core and execution is stalled due to a lack of uops. This can now be explicitly measured at the output of the resource allocation as follows. Using a cmask =1 and inv=1 logic applied to UOPS_ISSUED, we can count the total number of cycles where no uops were issued to the OOO engine. UOPS_ISSUED.STALL_CYCLES = UOPS_ISSUED.ANY: CMASK=1; INV=1
44

Since the event RESOURCE_STALLS.ANY counts the number of cycles where uops could not be issued due to a lack of downstream resources (RS or ROB slots, load or store buffers etc), the difference is the cycles no uops are issued because there were none available. Instruction Starvation = UOPS_ISSUED.STALL_CYCLES - RESOURCE_STALLS.ANY From the above values Instruction Starvation = 435 0 = 435 The above mentioned observation is an example of how the performance can be monitored. Likewise with different performance monitoring events the performance can be monitored in a better way. This cycle accounting analysis just showed the way how it can be done.

45

CONCLUSION
The general capabilities of performance monitoring hardware described in this article have been used extensively to analyze application, operating system, and processor performance. These analyses have helped improve not only application and operating system code but also compilers and next-generation processor designs. However, as discussed previously, the performance-monitoring support offered by most processors is limited (too few counters, lack of support for distinguishing between speculative and non speculative event counts, imprecise event-based sampling, and lack of support for creating data address profiles). The recent Advanced processors provides performance-monitoring capabilities that overcome these limitations, while also providing full support for the simultaneous multithreading capabilities of the new processor. We saw how Cycle Accounting Analysis, used in all our analysis approaches, gives a good insight on how a specific application performs by means of decomposition of cycles and most importantly of stalled cycles.

46

REFERENCES
1. Intel Reference manual: Intel 64 and IA-32 Architectures Software Developers Manual Volume 3B:System Programming Guide 2. THE BASICS OF PERFORMANCEMONITORING HARDWARE by Brinkley Sprunt, Bucknell Univ., Electrical Engineering Dept., Moore Ave., Lewisburg, PA 17837; bsprunt@bucknell.edu 3. Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors. 4. Performance Monitoring Unit Sharing Guide -Peggy Irelan and Shihjong Kuo. 5. Intel Micro architecture Codename Nehalem Performance Monitoring Unit

Programming Guide (Nehalem Core PMU). 6. Hardware-based performance monitoring with VTune Performance Analyzer under Linux- Hassan Shojania,shojania@ieee.org. 7. PENTIUM 4 PERFORMANCEMONITORING FEATURES by Brinkley Sprunt, Bucknell Univ., Electrical Engineering Dept., Moore Ave., Lewisburg, PA 17837; bsprunt@bucknell.edu

47

You might also like