You are on page 1of 14

Atom: In-Order and HyperThreading

8:30 PM - June 5, 2008 by Pierre Dandumont

Twitter

The Atom uses a new architecture, but with older technologies. Its the first in-order x86 from
Intel since the Pentium, back in 1993. All other Intel processors (since the P6) use an out-oforder architecture.
In-Order: Say what?

Zoom

To simplify, think of the processor as receiving the instructions one by one and putting them
in its pipeline before executing them. In an in-order architecture, the instructions are executed
in the order in which they arrive, whereas an out-of-order architecture is capable of changing
the order in the pipeline. The advantage is that losses can be limited. If, for example, you
have a simple calculation instruction, a memory access, then another simple calculation, an
in-order architecture will execute the three operations one after the other, whereas in OoO the
processor can execute the two calculations at the same time and then the memory access, with
an obvious time saving. Quite surprisingly, whereas in-order architectures generally use a
short pipeline, the Atom has a 16-stage pipeline, which can be a disadvantage in certain cases.
HyperThreading

Intel Atom : les instructions simples en avant

HyperThreading is a technology that appeared with the Pentium 4. It can process two threads
simultaneously using the unused parts of the pipeline. While not as efficient as two true cores,
the technology can make the OS think that the CPU can process two threads simultaneously
and increase the computers overall performance. On the Atom with its long pipeline coupled
to an in-order architecture, HyperThreading is very effective, and the technology can
significantly increase performance without impacting the TDP. Intel claims an increase in
consumption of only 10%.
The processing core

Zoom

For the rest, the Atom is equipped with two ALUs (units capable of performing integer
calculations) and two FPUs (units dedicated to floating-point calculation and very important
for gaming, for example). The first ALU manages shift operations, and the second jumps. All
multiplication and addition operations, even in integers, are automatically sent to the FPUs.
The first FPU is simple and limited to addition, while the second manages SIMD and
multiply/divide operations. Note that the first branch is used in conjunction with the second
for 128-bit calculation (the two branches are in 64 bits).
Intel Has Optimized the Basic Instructions

If you look at the number of cycles necessary to execute instructions, you realize something:
Some instructions are fast and others are (very) slow. A mov or an add, for example, is
executed in one cycle, as on a Core 2 Duo, whereas a multiplication (imul) will take five
cycles, compared to only three on the Corearchitecture. Worse, a floating-point division in 32
bits, for example, takes 31 cycles compared to only 17 (or almost half as many) on a Core 2
Duo. In practice and Intel willingly admits this the Atom is optimized to execute the basic
instructions quickly, meaning that this processor short-changes performance with complex
instructions. This can be checked simply using Everest (for example), which includes a tool
for measuring the latencies of instructions.

Atom: Caches and FSB


8:30 PM - June 5, 2008 by Pierre Dandumont

Twitter
Intel has chosen a fairly out-of-the-ordinary organization for the Atom, but without sacrificing performance (which is
important with a CPU using an in-order architecture).

24 kB + 32 kB: An Asymmetrical Cache


The Atoms Level-1 cache is 56 kB total: 24 kB for data and 32 kB for instructions. This asymmetry, fairly surprising
for Intel, stems from the structure of the cache. Intel uses 8 transistors to store one bit, compared to six transistors in
a standard cache. This technique allows the voltage applied to the cache to maintain information to be reduced. It

seems that this move to 8-transistor cells was made late in the game, when the design of the processor was fairly
advanced, which meant that the size of the cache had to be reduced to fit it in which explains the 24 kB for the data
cache. This unofficial explanation was advanced by AnandTech in their article introducing the Atom in April.

Zoom

512 kB Level 2, shrinkable


The Level-2 cache has a capacity of 512 kB, and obviously runs at the same frequency as the processor. This 8-way
cache is fairly classic and is close in performance to the one used in the Core 2 Duo (its latency is 16 cycles, compared
to 14 for the Core 2). One of the new functions can deactivate part of the cache automatically if a program doesnt
require much cache memory, part of it can be shut down. In practice, the cache goes from 8-way to 2-way (thus from
512 kB to 128 usable kB). This technique is a way of shaving a few precious milliwatts.

Zoom

Zoom

The FSB: Two modes of operation


The Atoms FSB is the same one used by Intel since the Pentium 4. It operates in Quad Pumped (QDR) mode with GTL
signaling. An interesting point: The Atom uses another signaling technology CMOS mode. GTL is effective (the bus
can reach 1,600 MHz), but power-intensive, whereas CMOS allows the bus voltage to be reduced. Technically, GTL
uses resistors to improve the quality of the signal, but they arent really necessary except at higher frequencies. With
the Atom and its bus, limited to 533 MHz, its possible to change to CMOS mode the resistors are deactivated and
the bus voltage is reduced by half. At the moment, only the SCH chipset is capable of handling the FSB in CMOS mode.

Power Management: Tests and Theory


8:30 PM - June 5, 2008 by Pierre Dandumont

Twitter
Power consumption is central to this Intel platform, and theyve made a lot of efforts in that department. Aside from
the chipset, which consumes a lot of power in comparison to the processor, the Atom itself has many attractive
functions.

Bus and cache


As weve already said, Intel has put a lot of effort into the bus and the cache: A different mode for the bus was
developed (CMOS mode) and the cache can be disabled in part depending on how its being used. These functions
reduce power consumption, as do the use of an in-order architecture and 8T SRAM for the L1 cache.

C6 power state

La gestion de l' rgie de


l'Atom est tr similaire

elle des Core 2

In addition to the low voltage (1.05 V) CPU, the Atom also introduces a new standby mode, C6. As a reminder, the C
modes (0 to 6) are low-power states, and the higher the number, the less the CPU consumes. In C6 mode, the entire
processor is almost totally disabled. Only a cache memory of a few kB (10.5) is kept enabled to store the state of the
registers. In this mode, the L2 cache is emptied and disabled, the supply voltage falls to only 0.3 V, and only a small
part of the processor remains active, for wake-up purposes. The processor can go into C6 mode in approximately 100
microseconds, which is quick. In practice, Intel claims, C6 mode is used 90% of the time, which limits overall power
consumption (obviously, if you launch a program that requires a lot of CPU power or even watch a Flash video you
wont be in that mode).
We should point out, though, that the two chipsets to be used with the Atom N200s are power users: the Atom 230s
use a i945GC that consumes 22 W (4 W for the CPU) and the Atom N270s ship with a i945GSE that burns 5.5 W (2.4
W for the CPU).

In Practice
So is the Atom really low-power in practice? The processor is, yes. For the platform aimed at NetTop (low-cost desktop
computers), the answer is yes, but... Why the but? Because the chipset used uses a lot of power and the processor
is listed at a TDP of 4 W, compared to 2.4 W for the mobile versions. Our test motherboard consumes 59 W in standby,
and we reached 62 W under maximum load (with a 3.5" hard disk and a 1 GB DDR2 DIMM). Obviously, these values
are what we measured for the complete platform, not only the motherboard, and they dont take power-supply losses
into account (our test model has a yield of approximately 80%). Thats both a little and a lot its not much for a
desktop computer, of course, but its a lot in absolute terms. We should add that we recently tested a motherboard
based on a 1.5 GHz Via C7, and the configuration drew less power with the same components: 49 W at idle and 59 W
under load (always measured at the AC outlet).

Conclusion
8:30 PM - June 5, 2008 by Pierre Dandumont

Twitter
What conclusion should we draw about the Atom platform? We came away with a mixed impression. The processor
itself is a success its affordable, consumes very little power, and while its performance is weak, its sufficient for its
target market (low-cost PCs intended for Web use). In addition, HyperThreadingis a good feature and the platform is
reactive. But for us the disappointment is the associated chipsets. Intel offers only two choices, and theyre open to
criticism. The SCH Poulsbo seems efficient and autonomous, but its not viable in a standard PC due to its MID
orientation (no SATA, for example), whereas the i945GC and i945GSE chipsets are usable in PCs, but theyre
throwbacks they lack functions, their performance with 3D is disastrous (whereas more and more applications are
using it), and they consume significantly more power than the processor itself.

You get the feeling that Atom is only a trial balloon one thats a success from some points of view and a failure from
others. Will computer manufacturers and the general public go for it? Undoubtedly, and for two reasons the price
and marketing. The platform will make it possible to offer computers at a very low price, and for now Atom has a good
brand image. The publics reasoning might proceed something like this:

"An Eee PC 900 for $450 (good) with a Celeron (not good) at 900 MHz (not good)"

or

An Eee PC 901 for $450 (good) with an Atom (good) at 1.6 GHz (good)

In other words, the Atom version will appeal more to the general public, even if in practice the difference is likely to be
pretty slim.

The Intel Atom Platform

A paradoxical platform: The processor is a success (even if its performance is weak in absolute terms), whereas the
associated chipsets are not worth their salt. Overall, the gains compared to older platforms remain slim, and we hope
that Intel will be offering chipsets that are better suited in the future.

Pros

The price: $29 for an Atom 230

Low power consumption

HyperThreading, a good feature on this processor


Cons

Weak overall performance

The chipsets

Very poor 3D performances

A mismatched platform

Bonnell microarchitecture
Main article: Bonnell (microarchitecture)
Intel Atom processors are based on the Bonnell microarchitecture,[3][4] which can execute up to two instructions
per cycle. Like many other x86 microprocessors, it translates x86-instructions (CISC instructions) into simpler
internal operations (sometimes referred to as micro-ops, i.e., effectively RISC style instructions) prior to
execution. The majority of instructions produce one micro-op when translated, with around 4% of instructions
used in typical programs producing multiple micro-ops. The number of instructions that produce more than one
micro-op is significantly fewer than the P6 and NetBurst microarchitectures. In the Bonnell microarchitecture,
internal micro-ops can contain both a memory load and a memory store in connection with an ALU operation,
thus being more similar to the x86 level and more powerful than the micro-ops used in previous designs.[28] This
enables relatively good performance with only two integer ALUs, and without anyinstruction
reordering, speculative execution, or register renaming. The Bonnell microarchitecture therefore represents a
partial revival of the principles used in earlier Intel designs such as P5 and the i486, with the sole purpose of
enhancing the performance per watt ratio. However, Hyper-Threading is implemented in an easy (i.e., low power)
way to employ the whole pipeline efficiently by avoiding the typical single thread dependencies.[28]

Hyperthreading

hyh Hyper-Threading Technology


Hyper-Threading Technology
How operating systems can do more and perform better
Intel Hyper-Threading Technology (Intel HT Technology)1 uses processor resources more
efficiently, enabling multiple threads to run on each core. As a performance feature, Intel HT
Technology also increases processor throughput, improving overall performance on threaded
software.
Intel HT Technology is available on previous-generation Intel Core processors, the 3rd
generation Intel Core processor family, and the Intel Xeon processor family. By
combining one of these Intel processors and chipsets with an OS and BIOS supporting Intel
HT Technology, you can:

Run demanding applications simultaneously while maintaining system responsiveness


Keep systems protected, efficient, and manageable while minimizing impact on
productivity
Provide headroom for future business growth and new solution capabilities
Watch the demo >
Intensive graphics without compromise
With Intel HT Technology, multimedia enthusiasts can create, edit, and encode graphically
intensive files while running background applications, such as virus protection software,
without compromising system performance.
More tasks, more efficient business
Processors with both Intel HT Technology and Intel Turbo Boost Technology (or Intel Turbo
Boost Technology 2.0, available in the 3rd generation Intel Core processor family) deliver
better performance and can complete tasks more quickly. The combination of technologies
enables simultaneous processing of multiple threads, dynamically adapts to the workload,
and automatically disables inactive cores. This increases processor frequency on the busy
cores, giving an even greater performance boost for threaded applications.
Thanks to Intel HT Technology, businesses can:

Improve productivity by doing more simultaneously without slowing down


Provide faster response times for Internet and e-commerce applications, enhancing
customer experiences
Increase the number of transactions that can be processed simultaneously
Utilize existing 32-bit application technologies while maintaining 64-bit future
readiness

Assessing system readiness


Intel HT Technology is available on a variety of laptop, desktop, server, and workstation
systems. Look for systems with the Intel HT Technology logo and verify with your system
vendor that the system utilizes Intel HT Technology.
System requirements1

A processor that supports Intel HT Technology


Intel HT Technology enabled chipset
Intel HT Technology enabled system BIOS
Intel HT Technology enabled/optimized operating system

2.2.8 Intel Hyper-Threading Technology


Intel Hyper-Threading Technology (Intel HT Technology) was developed to improve
the performance of IA-32 processors when executing multi-threaded operating
system and application code or single-threaded applications under multi-tasking
environments. The technology enables a single physical processor to execute two or
more separate code streams (threads) concurrently using shared execution
resources.
Intel HT Technology is one form of hardware multi-threading capability in IA-32
processor families. It differs from multi-processor capability using separate physically
distinct packages with each physical processor package mated with a physical
socket. Intel HT Technology provides hardware multi-threading capability with a
single physical package by using shared execution resources in a processor core.
Architecturally, an IA-32 processor that supports Intel HT Technology consists of two
or more logical processors, each of which has its own IA-32 architectural state. Each
logical processor consists of a full set of IA-32 data registers, segment registers,
control registers, debug registers, and most of the MSRs. Each also has its own
advanced programmable interrupt controller (APIC).
Figure 2-5 shows a comparison of a processor that supports Intel HT Technology
(implemented with two logical processors) and a traditional dual processor system.
Unlike a traditional MP system configuration that uses two or more separate physical
IA-32 processors, the logical processors in an IA-32 processor supporting Intel HT
Technology share the core resources of the physical processor. This includes the
execution engine and the system bus interface. After power up and initialization,
each logical processor can be independently directed to execute a specified thread,
interrupted, or halted.
Intel HT Technology leverages the process and thread-level parallelism found in
contemporary operating systems and high-performance applications by providing
two or more logical processors on a single chip. This configuration allows two or more
threads1 to be executed simultaneously on each a physical processor. Each logical
processor executes instructions from an application thread using the resources in the
processor core. The core executes these threads concurrently, using out-of-order
instruction scheduling to maximize the use of execution units during each clock cycle.

2.2.8.1 Some Implementation Notes

All Intel HT Technology configurations require:

A processor that supports Intel HT Technology


A chipset and BIOS that utilize the technology
Operating system optimizations
See http://www.intel.com/products/ht/hyperthreading_more.htm for information.
At the firmware (BIOS) level, the basic procedures to initialize the logical processors
in a processor supporting Intel HT Technology are the same as those for a traditional
DP or MP platform. The mechanisms that are described in the Multiprocessor Specification,
Version 1.4 to power-up and initialize physical processors in an MP system
also apply to logical processors in a processor that supports Intel HT Technology.
An operating system designed to run on a traditional DP or MP platform may use
CPUID to determine the presence of hardware multi-threading support feature and
the number of logical processors they provide.

Although existing operating system and application code should run correctly on a
processor that supports Intel HT Technology, some code modifications are recommended
to get the optimum benefit. These modifications are discussed in Chapter 7,
Multiple-Processor Management, Intel 64 and IA-32 Architectures Software
Developers Manual, Volume 3A.

7.3 TASK SWITCHING


The processor transfers execution to another task in one of four cases:

The current program, task, or procedure executes a JMP or CALL instruction to a TSS descriptor
in the GDT.

The current program, task, or procedure executes a JMP or CALL instruction to a task-gate
descriptor in the
GDT or the current LDT.

Figure 7-7. Task Gates Referencing the Same Task


LDT
Task Gate
GDT TSS
TSS Descriptor
IDT
Task Gate
Task Gate
7-10 Vol. 3A

TASK MANAGEMENT

An interrupt or exception vector points to a task-gate descriptor in the IDT.


The current task executes an IRET when the NT flag in the EFLAGS register is set.
JMP, CALL, and IRET instructions, as well as interrupts and exceptions, are all mechanisms for
redirecting a
program. The referencing of a TSS descriptor or a task gate (when calling or jumping to a task) or
the state of the
NT flag (when executing an IRET instruction) determines whether a task switch occurs.
The processor performs the following operations when switching to a new task:
1. Obtains the TSS segment selector for the new task as the operand of the JMP or CALL
instruction, from a task
gate, or from the previous task link field (for a task switch initiated with an IRET instruction).
2. Checks that the current (old) task is allowed to switch to the new task. Data-access privilege
rules apply to JMP

and CALL instructions. The CPL of the current (old) task and the RPL of the segment selector for
the new task
must be less than or equal to the DPL of the TSS descriptor or task gate being referenced.
Exceptions,
interrupts (except for interrupts generated by the INT n instruction), and the IRET instruction are
permitted to
switch tasks regardless of the DPL of the destination task-gate or TSS descriptor. For interrupts
generated by
the INT n instruction, the DPL is checked.
3. Checks that the TSS descriptor of the new task is marked present and has a valid limit (greater
than or equal
to 67H).
4. Checks that the new task is available (call, jump, exception, or interrupt) or busy (IRET return).
5. Checks that the current (old) TSS, new TSS, and all segment descriptors used in the task switch
are paged into
system memory.
6. If the task switch was initiated with a JMP or IRET instruction, the processor clears the busy (B)
flag in the
current (old) tasks TSS descriptor; if initiated with a CALL instruction, an exception, or an
interrupt: the busy
(B) flag is left set. (See Table 7-2.)
7. If the task switch was initiated with an IRET instruction, the processor clears the NT flag in a
temporarily saved
image of the EFLAGS register; if initiated with a CALL or JMP instruction, an exception, or an
interrupt, the NT
flag is left unchanged in the saved EFLAGS image.
8. Saves the state of the current (old) task in the current tasks TSS. The processor finds the base
address of the
current TSS in the task register and then copies the states of the following registers into the
current TSS: all the
general-purpose registers, segment selectors from the segment registers, the temporarily saved
image of the
EFLAGS register, and the instruction pointer register (EIP).
9. If the task switch was initiated with a CALL instruction, an exception, or an interrupt, the
processor will set the
NT flag in the EFLAGS loaded from the new task. If initiated with an IRET instruction or JMP
instruction, the NT
flag will reflect the state of NT in the EFLAGS loaded from the new task (see Table 7-2).
10. If the task switch was initiated with a CALL instruction, JMP instruction, an exception, or an
interrupt, the
processor sets the busy (B) flag in the new tasks TSS descriptor; if initiated with an IRET
instruction, the busy
(B) flag is left set.
11. Loads the task register with the segment selector and descriptor for the new task's TSS.
12. The TSS state is loaded into the processor. This includes the LDTR register, the PDBR (control
register CR3), the
EFLAGS register, the EIP register, the general-purpose registers, and the segment selectors. A fault
during the
load of this state may corrupt architectural state.
13. The descriptors associated with the segment selectors are loaded and qualified. Any errors
associated with this
loading and qualification occur in the context of the new task and may corrupt architectural state.

NOTES
If all checks and saves have been carried out successfully, the processor commits to the task
switch. If an unrecoverable error occurs in steps 1 through 11, the processor does not complete
the
task switch and insures that the processor is returned to its state prior to the execution of the
instruction that initiated the task switch.
If an unrecoverable error occurs in step 12, architectural state may be corrupted, but an attempt
will be made to handle the error in the prior execution environment. If an unrecoverable error
occurs after the commit point (in step 13), the processor completes the task switch (without
performing additional access and segment availability checks) and generates the appropriate

exception prior to beginning execution of the new task.


If exceptions occur after the commit point, the exception handler must finish the task switch itself
before allowing the processor to begin executing the new task. See Chapter 6, Interrupt
10Invalid TSS Exception (#TS), for more information about the affect of exceptions on a task
when they occur after the commit point of a task switch.
14. Begins executing the new task. (To an exception handler, the first instruction of the new task
appears not to
have been executed.)
The state of the currently executing task is always saved when a successful task switch occurs. If
the task is
resumed, execution starts with the instruction pointed to by the saved EIP value, and the registers
are restored to
the values they held when the task was suspended.
When switching tasks, the privilege level of the new task does not inherit its privilege level from
the suspended
task. The new task begins executing at the privilege level specified in the CPL field of the CS
register, which is
loaded from the TSS. Because tasks are isolated by their separate address spaces and TSSs and
because privilege
rules control access to a TSS, software does not need to perform explicit privilege checks on a task
switch.
Table 7-1 shows the exception conditions that the processor checks for when switching tasks. It
also shows the
exception that is generated for each check if an error is detected and the segment that the error
code references.
(The order of the checks in the table is the order used in the P6 family processors. The exact order
is model specific
and may be different for other IA-32 processors.) Exception handlers designed to handle these
exceptions may be
subject to recursive calls if they attempt to reload the segment selector that generated the
exception. The cause of
the exception (or the first of multiple caus es) should be fixed before reloading the selector.
Adressing modes

3.7.5 Specifying an Offset


The offset part of a memory address can be specified directly as a static value (called
a displacement) or through an address computation made up of one or more of the
following components:

Displacement An 8-, 16-, or 32-bit value.


Base The value in a general-purpose register.
Index The value in a general-purpose register.
Scale factor A value of 2, 4, or 8 that is multiplied by the index value.
The offset which results from adding these components is called an effective
address. Each of these components can have either a positive or negative (2s
complement) value, with the exception of the scaling factor. Figure 3-11 shows all
the possible ways that these components can be combined to create an effective
address in the selected segment.
The uses of general-purpose registers as base or index components are restricted in
the following manner:

The ESP register cannot be used as an index register.


When the ESP or EBP register is used as the base, the SS segment is the default

segment. In all other cases, the DS segment is the default segment.


The base, index, and displacement components can be used in any combination, and
any of these components can be NULL. A scale factor may be used only when an
index also is used. Each possible combination is useful for data structures commonly
used by programmers in high-level languages and assembly language.
The following addressing modes suggest uses for common combinations of address
components.

Displacement A displacement alone represents a direct (uncomputed) offset


to the operand. Because the displacement is encoded in the instruction, this form
of an address is sometimes called an absolute or static address. It is commonly
used to access a statically allocated scalar operand.

Base A base alone represents an indirect offset to the operand. Since the
value in the base register can change, it can be used for dynamic storage of
variables and data structures.

Base + Displacement A base register and a displacement can be used

together for two distinct purposes:


As an index into an array when the element size is not 2, 4, or 8 bytesThe
displacement component encodes the static offset to the beginning of the
array. The base register holds the results of a calculation to determine the
offset to a specific element within the array.
To access a field of a record: the base register holds the address of the
beginning of the record, while the displacement is a static offset to the field.
An important special case of this combination is access to parameters in a
procedure activation record. A procedure activation record is the stack frame

Figure 3-11. Offset (or Effective Address) Computation


Offset = Base + (Index * Scale) + Displacement

created when a procedure is entered. Here, the EBP register is the best choice for
the base register, because it automatically selects the stack segment. This is a
compact encoding for this common function.

(Index Scale) + Displacement This address mode offers an efficient way


to index into a static array when the element size is 2, 4, or 8 bytes. The
displacement locates the beginning of the array, the index register holds the
subscript of the desired array element, and the processor automatically converts
the subscript into an index by applying the scaling factor.

Base + Index + Displacement Using two registers together supports either


a two-dimensional array (the displacement holds the address of the beginning of
the array) or one of several instances of an array of records (the displacement is
an offset to a field within the record).

Base + (Index Scale) + Displacement Using all the addressing

components together allows efficient indexing of a two-dimensional array when


the elements of the array are 2, 4, or 8 bytes in size.

Features of 8086 Microprocessor


1) 8086 has 16-bit ALU; this means 16-bit numbers are directly
processed by 8086.
2) It has 16-bit data bus, so it can read data or write data to memory
or I/O ports either 16 bits or 8 bits at a time.
3) It has 20 address lines, so it can address up to 220 i.e. 1048576 = 1Mbytes of memory
(words i.e. 16 bit numbers are stored in consecutive memorylocations). Due to the
1Mbytes memory size multiprogramming is made feasible as well as several
multiprogramming features have been incorporated in 8086 design.

4) 8086 includes few features, which enhance multiprocessing


capability (it can be used with math coprocessors like
8087, I/O processor 8089 etc.

5) Operates on +5v supply and single phase (single line) clock


frequency.(Clock is generated by separate peripheral chip
8284).
6) 8086 comes with different versions. 8086 runs at 5 MHz,
8086-2 runs at 8 MHz, 8086-1 runs at 10 MHz.
7) It comes in 40-pin configuration with HMOS technology having around 20,000
transistors in its circuitry.
8) It has multiplexed address and data bus like 8085 due to which the pin count is
reduced considerably.9) Higher Throughput (Speed)(This is achieved by a concept
called pipelining).

You might also like