Professional Documents
Culture Documents
Overview of document
-------------------- a guide for programmers writing their own NES/FC emulator software
- provides many code optimization tips (with focus placed on the x86-based
personal computing platform)
- provides lists of features to implement in an emulator intended for
public-domain release
- created in an effort to improve the quality of the user's NES gaming
experience
Topics discussed
---------------General PPU emulation
Pixel rendering techniques
Merging playfield & object pixels
Frame store optimizations
+---------------------+
|General PPU emulation|
+---------------------+
Most likely, the key to your emulator's performance will be based on the
speed at which it can render NES graphics. It's pretty easy to write a slow
PPU render engine, since overall there's a good deal of work that has to be
done. Accurate emulation of the PPU is difficult, due to all the trickery
various NES games use to achieve special video effects (like split screen
scrolling), otherwise not possible by "clean" or conventional means. In
reality, all these "tricks" are simply accomplished by writing to the
appropriate PPU (or related) registers at the right time during the
rendering of a frame (picture).
On a hardware level, the CPU & PPU in the NES run simultaniously. This is
why a game can be coded to make a write out to a PPU register at a certain
time during a frame, and the result of this is that the (on-screen) effect
occurs in a specific location on the screen. Thus, the first instinct one has for
writing a NES emulator is to execute both the CPU & PPU engines alternately
on every (NES) clock cycle. The results of this will give very accurate
emulation, BUT- doing this will also be VERY processor intense (this will
mostly be due to all the overhead of transfering program control to so many
hardware emulation routines in such little time (1 CPU clock cycle)). As a
result, emulators coded like this turn out to be the slowest ones.
PPU info
-------NES graphics consist of a single scrollable playfield, and 64 objects/sprites.
The screen resolution is 256*240 pixels, and while games can control the
graphics on a per-pixel basis, it is usually avoided since it's pretty difficult.
Instead, the PPU makes displaying graphics easier for the programmer by
dividing the screen up into tiles, which index an 8*8 pixel bitmap to appear in
that particular spot. Each object defines 1 or 2 tiles to be displayed on a
randomly-accessable xy coordinate on the screen. There are also 8 palette
tables in the PPU that bitmap data can refer to (playfield & object bitmap
data each have 4 palettes). Each palette has 3 indexable colors, as tile
bitmaps only consist of 2 bits per pixel (the 00 combination is considered
transparency). A single transparency color palette register is also defined,
and is only used as the on-screen color whenoverlapping pixels (due to
objects being placed on the playfield) of all playfield/object pixels are defined
as transparent.
Objects attribute memory (sprite RAM, or "OAM" which contain private tile
index and palette select information) is evaluated every single scanline (ycoordinate entries are examined), and in-range objects have thier tile
bitmaps loaded into the PPU inbetween scanlines. The contents are then
merged with the playfield's pixel data in real-time.
For example, say the CPU engine is executing instructions. Then, on clock
cycle 13000 (relative to the last VINT), a write to the PPU's scroll registers are
made (which causes a split-screen effect). Now, first the PPU translates
13000 CC's into X/Y coordinates (in this case, on-screen scanline 93, roughly
pixel #126 (the equations to do these calculations will be revealed later)).
Ideally*, all pixels before this point will now be rendered to a buffer, using the
data in the PPU registers prior to the write. Now the screen area before the
write occured has been rendered accurately, and the screen will progressively
continue to be updated in this fashion as more mid-frame writes occur. If no
more occur, when the CPU arrives at the # of clock cycles per frame, the rest
of the image (if any) can be rendered.
PO
PO
scroll registers
Note that any PPU mapped memory (which means name, pattern, attribute &
palette tables) can only be changed while objects & the playfield are disabled
(unless cartridge hardware provides a way to do this through the CPU
memory map). Since the screen is blanked to black during this time
(regardless of the current transparency color the palette is programmed
with), these writes do not effect how the screen is rendered, and
subsequently, updating the screen can be postponed.
Collision flag
-------------Games without hardware for scanline counting often poll this bit to find out
when to make a write out to a PPU register which will result in a split screen,
or a pattern table swap/bankswitch. The collision flag is set when the first
non-transparent pixel of object 0 collides with a playfield pixel that is also
non-xparent. Since the screen position of the first colliding pixel can be
determined at any time (and therefore, exact CPU clock cycle at which the
collision is expected to occur), when a game requests the status of this flag
for the first time, a routine part of the PPU engine can calculate at which
clock cycle this flag will be set (calculations will be shown later). Subsequent
requests for the collision flag's status after this would then only require the
engine to compare the current CPU clock cycle, to the calculated collision
clock cycle. Whenever a mid-frame change occurs (whether it effects the
playfield, or objects), the clock cycle at which the collision flag will go off will
have to be recalculated (unless it has already gone off).
--------------------------------The PPU renders 3 pixels in one CPU clock. Therefore, by multiplying the CPU
CC figure by 3, we get the total amount of pixels that have been rendered
(including non-displayed ones) since the VINT.341 pixels are rendered per
scanline (although only 256 are displayed). Therefore, by dividing PPUCC by
this, we get the # of completely rendered scanlines since the VINT.
21 blank scanlines are rendered before the first visible one is displayed. So,
to get a scanline offset into the actual on-screen image, we simply subtract
the amount of non-displayed scanlines. Note that if this yeilds a negative
number, the PPU is still in the V-blank period.
PPUCC = CPUCC * 3
Scanline = PPUCC div 341 - 21; X- coordinate
PixelOfs = PPUCC mod 341;
Y- coordinate
CPUcollisionCC = ((Y+21)*341+X)/3
Note that if the PixelOfs equation yeilds a number higher than 255, the PPU is
in the H-blank period.
the VBL flag must get set in the middle of the LDA instruction. Since there
are 2 instructions in the main loop, there's about a 50% chance of the read
value from $2002 being pushed on the stack with the vbl bit set. A
work-around for emulators that can't handle this mid-instruction taboo, is
to set the vbl bit slightly before the NMI routine is invoked.
Other notes
----------- some games rely on the proper implementation of collision, and dropping
object flags in register $2002. this is usually done to implement up to 3
independent horizontally-tiled scrollable playfields. Make sure these flags
are set at the right time, and stay set until scanline 20 of the next frame
(relative to /NMI).
+--------------------------+
|Pixel rendering techniques|
+--------------------------+
3 rendering techniques are described in this section. They are all real-time
techniques. An unreleased version of this document discussed a tile
Basic
----This method, which is the most straightforward, is to store the PPU's
52-color matrix as constant data in the VGA palette registers (or otherwise,
other palette registers used for an 8-bit per pixel graphics mode). Before a
pixel can be drawn, pixel color is calculated (via pattern table & palette
select data). The PPU palette registers are looked up in some way or
another, and the contents of the palette register element is written to a
virtual frame buffer as the pixel data. This technique is the easiest to
implement, and provides the most accurate PPU emulation. However, since
every pixel drawn requires an independent palette look-up, this method is
naturally very slow.
that the lookup table grows from 2^2*1=4 bytes for a single pixel lookup, to
2^4*2=32 bytes for a 2-pixel lookup, to 2^8*4=1024 bytes for a 4-pixel
lookup. Each of the palette's 4 colors is also mirrored across these tables,
and this has to be maintained. Since I've never tried this optimization
technique, I can't tell you how effective it is (or when it stops being
effective).
Another way to increase the speed of this approach is to change the bit
ordering of the pattern tables stored in memory to favor this rendering
algorithm. For example, store the bitmap for any scanline of a tile in an 82-bit packed pixel format, instead of the 2- 8-bit planar method used by
default. By doing this, it will allow the tile rendering routine to easily
extract the 2-bit number for indexing the 4 color palette associated with
the particular tile. Of course, by changing the pattern tables, whenever
pattern table memory is read or written to, the format of the data will have
to be converted. Since this happens much less often (even in games that use
CHR-RAM), it's a good idea.
bits for each horizontally sequential pixel are stored in 8-bit increments).
This chunk of fetched pixel data can then be masked (so that other pixel
data from the chunk is not used), an indexed "VGA palette select" value can
be added to the value, and finally can then be written out to the virtual
frame buffer in one single store operation. The "VGA palette select" value
is fetched via the VGA palette select table, which corresponds to the 8
classic PPU palettes (4*2 elements in the table; therefore, a tile's
attribute data (either PF or OBJ) is used as the index into this table).
This table indicates which 4-color group of 64 groups in the VGA palette to
use for color selection for the group of pixels being written. The idea is
that when a mid-frame palette change occurs (or at any time, for that
matter), the affected PPU palette in this table is changed to point to where
the new palette modifications will be made in the VGA's palette. The
corresponding VGA palette entries will also have to be updated appropriately
(generally, VGA palette updates will be made in a ring-buffer fashion. A
pointer which keeps track of the first available 4 palette entries will be
incremented when any entries in a 4-color PPU palette are changed).
Basically, this method offers the fastest possible way to render NES
graphics, since data is fetched from pattern table memory and written
directly to the virtual frame buffer. The number of pixels processed
simultaniously can be as high as 8 (with MMX instructions). However, the #
of mid-screen PPU palette modifications possible is limited to 64 times (or
32 for PF and 32 for OBJs, if one of the bits in every pixel needs to be
used to distinguish a playfield pixel from an object), but multiple
consecutive modifications to a single 4-color PPU palette only count as one
actual modification.
The pattern tables have already been reorganized so that the bitmap data for
4 scanlines of tile data can be loaded into an MMX register, and used in the
most efficient way possible. Pixel data for 4 sequential scanlines under the
same horizontal coordinate is stored in a single byte, with the 2 MSBs
containing the lowest logical scanline coordinate. Sequential bytes, up to
the 8th one, contain the pixel data for every successive horizontal
position. Finally, the first 8 bytes of a tile's pattern table data contain
the full bitmap data for the first 4 scanlines of the tile, and the next 8
bytes contain the last 4 scanlines.
####################################
;register assignments
;-------------------;EAX: destination pixel pointer
;EBX: points to the palette to be used for this tile (essentially determined
by the attribute table lookup)
;ESI:
mm3,[esi]
ecx,-4;
have to be stored in memory since MMX instructions don't allow the use of
immediate data as an operand.
@1:
movq
mm0,_C0x8;
contains C0C0C0C0C0C0C0C0h
movq
mm1,_00x8;
contains 0000000000000000h
movq
mm2,_40x8;
contains 4040404040404040h
mm0,mm3
pcmpgtb
mm1,mm3
pcmpgtb
mm2,mm3
psllq
mm3,2;
pixels
;to perform color lookup, a precalculated palette table is used & ANDed with
the resulting masks of the last operation. Since XOR operations are used to
combine the results, this requires the elements in the palette table to be
XORed with adjacent values, so that they'll be cancelled out at the end of
the logic processing here. The required precalculated XOR combination of
each color element is shown in the comments below by the corresponding
element. Note that each lookup is 8 bytes wide; this requires the same
palette data for a single element to be mirrored across all 8 sequential
bytes.
pand
mm0,[ebx+00];
2^3
pand
mm1,[ebx+08];
3^0
pand
mm2,[ebx+16];
0^1
pxor
mm0,[ebx+24];
pxor
mm1,mm2
pxor
mm0,mm1
mm1,mm0
psllq
mm0,mm4
psrlq
mm1,mm5
por
mm0,[eax]
movq
[eax+8],mm1
movq
[eax ],mm0
;loop maintenence
add
eax,LineLen;
position
inc
ecx
jnz
@1
###################################
To use the renderer, point EAX to the beginning of your render buffer (due
to how the fine horizontal scrolling works, tiles must be rendered next to
each other, incrementing along the horizontal tile axis). Without some ugly
extra logic, the render buffer will have to be increased in size by 8 pixels
per scanline, to accomodate for the extra tile pattern fetch required
whenever the fine horizontal scroll value is not equal to zero. Once the
routine has been executed enough times to fill your render buffer, consider
the starting horizontal coordinates of the rendered playfield to be offset
by 8 pixels, due to a required "spilloff area" for when the first tile
pattern for that line needs to be shifted off the screen.
Branch prediction
----------------Pentium MMX and later processors have improved branch prediction
hardware
over the original Pentium, and consequently can correctly detect a branch
condition pattern, so long as the condition does not stay the same for more
than 4 times in a row. The new system is based on keeping track of the last
4 known conditions for any branch that may be allocated in the BTB. Those 4
bits are used to index a 16-element table to fetch 2 bits that indicate the
predicted branch condition (strongly taken, taken, not taken, strongly not
taken), which is then written back after using saturated addition to
increment or decrement the value, based on the actual branch condition that
came from the program.
- Don't modify the above algorithm to draw a full 8-scanline tile. Instead,
use another loop counter to have the renderer code reused when more
4-scanline tile blocks have to be drawn.
+---------------------------------+
|Merging playfield & object pixels|
+---------------------------------+
The most efficient way to effectively combine playfield & object data into
your final rendered frame, is to always first, render your playfield (or a
section of it, in the case of dealing with a split screen) directly to the
image buffer itself. At this point, to effectively merge object pixels with
the playfield's, each pixel in your image buffer must have an extra 2 bits
associated with it, one of which will represent the transparency status for
a playfield pixel, and the other the same, except for object pixels (when
drawn later).
Naturally, after rendering the playfield, the image buffer won't have any
pixels with the transparency status for object pixels marked as false. But
now, as objects are rendered, the condition on that the actual pixel is
drawn, depends on these two transparency status bits, the objects own
transparency status, and it's priority. Starting in the order from object 0
(highest priority) up to 63, object bitmaps are "merged" with the playfield,
in the fashion that the following few lines of pseudo-code will show:
IF(SrcOBJpixel.xpCond=FALSE)THEN
IF((DestPixel.OBJxpCond=TRUE)AND((DestPixel.PFxpCond=TRUE)OR(SrcOBJpi
xel.Pri=foreground)))THEN
DestPixel.data := SrcOBJpixel.data
FI
DestPixel.OBJxpCond := FALSE
FI
So, as you can see, the destination's OBJxpCond is marked as false, even if
the object's pixel is not meant to be drawn. This is to prevent the pixels
of lower priority (numerically higher-numbered) objects from being drawn in
those locations.
This may raise the question, "Why do you render objects in the order of
0->63 (effectively requiring 2 bits for transparency status), when you can
render them in the opposite direction (which only requires 1 bit for
transparency status)?" The answer is because of what happens on a priority
clash (see the "PPU pixel priority quirk" section of the "2C02 technical
reference" document). Rendering objects in order of 0->63 is the only way to
emulate this PPU feature properly (and some games DO depend on the
functionality of this, as it provides a way to force the playfield to hide
foreground priority object pixels). Otherwise (for 63->0), it would be
neccessary to merge objects to an image buffer filled with the current
transparency color, and then, merge playfield data with the buffer as well.
Granted, this technique will only require 1 transparency (background
priority) status bit per pixel, but since merge operations are slow, and
this technique requires way more of them, this technique is inferior to the
aforementioned one.
Other tips
---------- Depending on your implementation of pixel rendering, you may be able to
store the 2 transparency status bits inside the pixel data itself. For
example, if only 52 combinations of a rendered pixel are being generated,
the upper 2 bits in the pixel's byte can be used for this storage. This may
mean that you'll have to mirror your video buffer's palette register RGB
information 4 times, but is otherwise a good idea. For 8-bit color VGA
modes, a legacy mask register (3C6h) allows the programmer to mask out
any
bits of the written pixel data that are unrelated to color generation.
- Don't use branching to avoid drawing a pixel out somewhere. First of all,
it only allows you to process 1 pixel at a time, which is slow. Second, CPUs
have a hard time predicting branches based on random data (or at minimum,
data that produces a branch pattern which is too long to be stored in the
CPU's branch target buffers). Finally, sequences of SIMD arithmetic and
logical operations can be used to merge multiple bytes of data
simultaniously (espically with MMX instructions).
- Avoid unaligned memory access to any data area used by your rendering
routines. Each unaligned store incurs a minimum penalty of 3 clocks on a
486, and many more clocks on modern processors. Generally, the shift &
merge
code required to align data which may be stored on any bit boundary, is not
going to take more than 5 clocks on any processor. (The MMX-coded example
previously shown, demonstrates how to do the shift & merge operation.)
+-------------------------+
|Frame store optimizations|
+-------------------------+
One of the simplest approaches to render emulation is to draw the entire
playfield to the video buffer, and then place applicable object data in the
buffer afterwards (this makes object/playfield pixel decisions easier to
- Writing to a non- linear frame buffer (LFB) is slow. At least on one card
I tested, there was a 333% increase in video buffer write speed, after
switching from using the legacy one at address 000A0000. I understand that
basically any PCI video card has LFB-capabilities, but may be inaccessable
due to it's BIOS, or drivers. I guess that this is really a responsibility
of the OS, but either way: use the LFB any way you can.
Now you should see that it's just not a good idea to render graphics
directly to the video buffer (although I don't think any one would do this,
anyway). Old versions of this document discussed using a virtual frame
buffer, which was basically a buffer allocated in regular memory used to
render graphics to (instead of directly to the video buffer). When the
virtual buffer was full, it would then be copied to the video buffer in a
large, sequential operation (just the way the video card likes it!).
However, this method is actually quite inefficient, as the next paragraph
explains.
486's started the on-chip x86 CPU cache trend, with a whole 8K bytes shared
between both data and code. Intel 486DX4 models had 16K bytes. Pentiums
had
seperate 8K byte caches, each for data & code. 6th generation x86
processors
again, doubled the on-chip cache size (although maintained the seperate
code/data cache architecture started by the Pentium). The point is, the size
of the (level-1) cache is basically the size of memory that the CPU can
randomly access for the smallest amount of time possible. For even a 486,
this means up to 8K bytes of cachable data structures, which can actually be
quite a bit of memory, if the software is written carefully.
The trick to effective use of the cache is all how software is written. The
best thing to do, is to write software algorithms which work with an amount
of temporary memory smaller than the size of the CPU's level-1 cache. Even
computational algorithms which appear to require a large amount of memory,
can sometimes be broken down into sub-algorithms, in order to reduce the
required amount of temporary memory. While taking this approach does incur
a
little load/store overhead, it's more important that your data stay in the
cache any way it can. These guidelines will pretty much guarantee that your
software will perform in the most efficient way on any CPU with an internal
cache.
Now- here's the first problem: the target of the store operation to the VFB
is unlikely to be in the cache. This means that the CPU ends up actually
*reading* main memory after your first 4-byte pixel store. Of course, now
you can write to this line for free, but main memory access is slow, and
considering what we're doing here (which is exclusively store operations),
it's kind of ridiculous that the programmer has no way of telling the
processor that the merge operation (and moreover the reading of main
memory)
Anyway, you get the idea: after every few stores to the VFB occur, a new
line from the VFB will be read in from main memory (or, the level-2 cache,
if it's in there). But guess what? this isn't even the worst part of it. As
you keep filling the VFB, your CPU's cache overflows, since your CPU's L1
cache is smaller than the VFB you're working on. This means that not only
will your VFB-rendering eventually push any lines out of the cache which
aren't used directly by the render routine (causing lost cycles for even
local routines that may need them immediately after the render), but after
the render when you go to copy the VFB to the video memory, the entire
buffer has to be loaded back into the CPU's cache.
Scanline stores
--------------By reducing the size of the VFB from full size down to a few scanlines (or
even just one), most or all of the caveats of what has been mentioned can be
avoided. Since typically a VFB scanline is 256 bytes (in the example for the
NES's PPU), this makes the memory requirement small enough to ensure
good
performance even on a 486.
Of course, this creates a new problem for writing the PPU render enginetiles can no longer be rendered completely (unless you're using an
8-scanline VFB, but the rest of this topic assumes you're using only a
single scanline VFB). Some overhead caused by only rendering a single
scanline of a tile at a time can be avoided by pre-calculating pointer work
for each sequential tile, and storing it in an array, so that calculations
can be reused for the tile's other scanlines. A similar technique can be
done for object pointer calculations as well.
First, dealing with object pixels which overlap areas of any playfield
scanline will be very difficult (without the use of at least a scanline
buffer), since the playfield tile rendering is usually performed
+-------------------------+
|Smooth audio reproduction|
+-------------------------+
This chapter describes ways to improve NES sound emulation.
overview
-------Very few NES emulators out there emulate sound channel operations to the
precision that the NES does it at, and the result is that emulation of some
high-frequency rectangle and noise waves that many NES games produce on
a
frequent basis, will end up sounding like there are artifacts in the audio
(i.e., two or more apparent frequencies present, even though only one
frequency is supposed to be heard). Increasing sample playback frequencies
can fix this problem, but in the end, sampling frequencies on sound cards
found in PC's and such can only go so high.
solutions
--------Sound blasters have hardware in place to overcome this transparently from
the user, whenever audio signal digital capture is desired. The proof is in
sampling NES music at 44100 Hz, 16 bits/sample: there is no distinguishable
difference between how the real-time generated analog audio from the NES
sounds when compared to the digitally captured sample track. They're either
using primitive RC integrator function circuits on the inputs of it's ADCs
to approximate a time-accumulated average voltage between ADC samples,
or
they are sampling the signal many times faster than the output PCM sample
rate (some 2^n multiple), and using digital averaging hardware to produce
each "downsampled" PCM result. Here's more, courtesy of an NESdev
veteran:
"What I'm suggesting is that you do the above at a high sampling rate, some
power-of-2 multiple of the output rate, for example, 4*44100 = 176400
samples per second. You would add every four samples together, and divide
by four (downsample), and that would be your output sample.
EXAMPLE 1
Oversample Results: 1, 1, 1, 1
Downsampled Output: (1 + 1 + 1 + 1) / 4 = 4 / 4 = 1
EXAMPLE 2
Oversample Results: 1, 1, -1, -1
Downsampled Output: (1 + 1 + -1 + -1) / 4 = 0 / 4 = 0
EXAMPLE 3
Oversample Results: 1, -1, 1, 1
Downsampled Output: (1 + -1 + 1 + 1) / 4 = 3 / 4 = 0.75
So your output samples will not always be a simple 1 or -1. You're really
raising the sampling rate, and then converting the results back to the
- When the wavelength.whole counter goes negative (count expires), this not
only means that the rectangle wave output has toggled somewhere in the
middle of the PCM sample timeframe, but also that volume output will scale
based on how many cycles the channel output was positive during the PCM
sample timeframe. To calculate this, the leftover value in the wavelength
counter can be used.
- If the leftover wavelength value represents the wave while positive, then
the wavelength.whole value can be negated; otherwise, add 40 and 45/77ths
to
it.
- To calculate the final PCM output sample, simply scale the channel's
volume level by the ratio between the adjusted wavelength counter, and 40
and 45/77ths.
- Caveat: output rectangle waveforms may not change state more than once
per
produced PCM sample, and this makes accurate emulation of wavelengths
less
than 40 and 45/77 clock cycles not directly possible with this algorithm.
However, wavelengths that go below this value may be raised from here by
the
absolute difference of the two values, to produce an output wave pattern
similar to the actual one that would be produced. Generally though, these
frequencies cannot be heard by humans, and therefore accurate
implementation
is not as important, if neccessary at all.
other notes
----------- Always represent non-integer-based counters (like ones that have to
increment by numbers like 40 and 45/77ths) with rational
whole-numerator-denominator grouped integers, rather than using floating
point numbers to represnt the ratio. While floating point numbers can be
very precise, due to how rational number bit patterns repeat forever,
calculations are never 100% guaranteed accurate, and this makes successive
calculations based on calculated data a bad idea. However, whole-numerator
- Make sure you use cycle count information passed to sound hardware
emulation routines from the CPU core to effect sound channel outputs at
correct times in the emulated frame. That means that sound channel
operation
updates should *not* be on a per-frame basis, even though this technique
works for the majority of NES game music code. Many writes to sound
channel
registers are effective almost immediately after the write, and apparently,
some NES games actually take advantage of timed sound port code to
produce
some really neat sounding effects. Also, for emulators that support more
than the regular amount of 6502 clock cycles per frame, sound hardware
should ignore any clock cycles greater than 29780 and 2/3rds, relative to
when the game's main sound animation routine was last triggered (assuming
that PPU-based NMIs are used for sound animation, but sometimes the 2A03's
frame counter is used for this).
+------------------------------------------------+
|6502 instruction decoding & execution techniques|
+------------------------------------------------+
- Instruction component-based emulation. This core model breaks all 6502
opcodes down into just two components: addressing mode, and ALU
operation.
Since addressing modes and ALU operations are combined to make all 6502
opcodes, it seems to make sense to emulate 6502 opcodes on this basis. As a
result, only essential 6502 core routines will need to be coded, and this
will not only save big on code memory, it will make implementation easier.
Also, this technique is only slightly slower than the opcode-handled
approach, due to the extra jump in the instruction decoding process, but
this is made up for in the host CPU's cache performance, due to more
efficient use of code structures. In general, this technique will yield the
best well-rounded performance for any PC platform.
- Dynamic 6502 opcode recompiliation. In this CPU core model, 6502 opcodes
are decoded, but instead of emulating the behaviour of the CPU with
subroutines, platform-specific CPU machine code based on the decoded
instruction is generated and executed to do that instead. Eventually all
6502 opcodes will be translated & cached in the emulator's memory map,
provided adequate processing time is given to the core to trample through
all the 6502 code it may ever execute. The throughput of executing
recompiled 6502 instructions can actually be higher than doing so on a real
6502 itself, provided the programmer does a good job of implementing
optimizations in the recompiled instructions (i.e., the requirement of
including flag maintenence code for most recompiled instructions is not
neccessary, since only branch and add/subtract instructions rely on them.
Another optimization may be possible through the use of clock cycle tables
for 6502 code segments (code that's defined between branch targets or PC
xfer instructions), in order to eliminate clock cycle maintenence
instructcions in some of the recompiled code as well). Caveats of this CPU
core model (besides very complicated implementation of the architecture),
include the requirement for large amounts of RAM (a few or more
megabytes),
and other complexities that arise when a 6502 program frequently modifies
it's own code (stored in RAM) which has already been translated & cached by
the CPU engine. For multitasking dozens, even hundreds of NES applications
on a single, state of the art computer however, dynamic recompiliation is
the only way to go.
Other tips
---------- Some NES games rely on the extra dummy store cycle that RMW
instructions
perform on a 6502. This is usually done to pulse a bit in the mapper port,
with a single RMW instruction. Other 6502 "features" (even undocumented
opcodes) may also be assumed to be implemented in the host CPU for an
NES
game (or sometimes game genie codes/patches), so don't skip over any
details
during your implementation of a core. For more info, check out the "2A03
technical reference" document.
- Implement a clock cycle counter into your 6502 engine, which will be
maintained by every 6502 instruction executed. This counter will mainly be
used by the PPU to figure out how timed writes will effect how the output
image will be rendered. However, if used also as a terminal counter, when
the count expires, program control can be transferred to the handler
originally requesting the count operation (like for generating the PPU
VINT/NMI signal). Also, don't forget that you can manage any number of
"virtual cycle counters", without ever having to make the CPU core maintain
more than one physical one. NES hardware may have several IRQ-generating
counters going simultaniously, but the order in which each will cause an IRQ
is always known to the emulator, which is why the cycle count register only
has to be programmed with the count value for the next IRQ to occur (after
which, the next count to expire can be loaded into the cycle count
register).
Furthermore, have your emulator store the 6502 flags in the format that
they're stored in on the x86 CPU. This way, the flags do not have to be
formatted, thus saving time. The only time the flags will have to be
converted to/from the 6502 order, is when 6502 instructions PHP, PLP, BRK
#xx, RTS, and hardware interrupts are executed. Since these happen much
less
often than more common arithmetic and logical instructions, it's more
efficient to handle the flags in this way.
- the 6502 apparently has about 12 opcodes which jam the machine
(processor). These opcodes are ideal for implementing emulator-specific
custom 6502 instruction set extentions for trap/debug purposes.
+--------------------------+
|Emulation address decoding|
+--------------------------+
Emulation address decoding is taking a formed 6502 address, plus the 6502's
read/write status, and running it through (most the time) static logic to
determine the access method, and the emulator-equivelant memory address
that
this 6502 address corresponds to, in order to emulate access to that memory
Using a 1:1 address decode look-up tables for both read & write 6502
memory
maps is the fastest and most accurate way to determine where an NES
memory
area is, and what address it maps to. Generally, a byte should be used as a
single element in the memory maps to represent the type of mem area (up to
256 types for each table), and you'll have 128KB of them, since the 6502's
R/W line is also used during address calculations. Even though this
technique _seems_ to waste a lot of memory, the memory decode tables are
most commonly accessed in parallel with memory areas containing NES ROM
and
RAM structures, and this means that cached data structures residing in the
emu's host CPU (due to simulated 6502 memory bus transfers) will usually
never require more than twice the amount as normal. This is a small price to
pay to ensure that adapting your 6502 core engine to any foreign NES/FC
architecture/technology, is as easy as adding a few new memory area type
handlers to your emulator's core, and then building a new address decoder
table.
+----------------------+
pros
---- program control transfers are evaded when common hardware ports are
accessed by the CPU core. This in turn reduces code & data cache misses,
and
espically branch mispredicts, in the physical CPU running the emulation
software.
- dynamically adding hardware devices to the CPU core's virtual memory map
will be easier, due to the architectural enhancements that hardware port
queueing requires the CPU core to support.
- less code will be produced in the emulator software's image file, due to
there being less hardware port emulation handlers present.
- large overhead penalties that are incurred when hardware emulation routine
loops (like for rendering pixels, creating audio samples, etc...) have to be
broken (due to the CPU core writing out to the hardware handler at that
moment in the simulated frame), can be avoided. This is important for 2
reasons:
1. your NES emulator core engines can now be designed to operate in one big
loop, without having to worry about intervention from other hardware devices
during the same virtual NES emulation time, unless it's absolutely
necessary. This means that say, the PPU engine can render a complete frame
at any instant (as opposed to having to depend on data sent to the PPU
engine in real-time via the CPU core), thanks to hardware port queueing.
2. no matter how your NES-written 6502 code abuses the PPU, APU, MMC,
etc.
hardware in the NES, your core engines of all these devices can all now be
designed to use a nearly constant amount of CPU clock cycles on the physical
processor running your emulator's software, thanks to the simple loop design
of emulator core devices, in combination with branchless code solutions to
if/else constructs and the like.
cons
---- uses some extra data structures/memory
overview
-------The hardware port queueing concept is only benificial for those hardware
devices that do not interact with (i.e., change or effect the operation of)
the CPU core, outside of readable ports like $2002. So, for example, you
wouldn't want to buffer writes to the cart mapper hardware if it's effecting
a PRG-bank (due to the fact that the write is supposed to effect CPU
emulation immediately), but the opposite is true for CHR-bank changes. So,
this is essentially the criteria that you must base your decisions on, when
deciding which hardware ports should be queued.
Hardware devices that generate interrupts on the CPU are a little easier to
deal with, since interrupt sources almost always come from some sort of
on-going counter in the NES (the MMC3's scanline counter, is a slight
exception, since it relies on the clocking of A13 on the virtual PPU).
Execution of the events that are to occur on the terminal count clock cycle
can be queued to the CPU by creating an instance of a virtual cycle counter
by the hardware emulation routine that needs it.
implementation
-------------The "port queueing" idea really revolves around assigning back & forward
pointers to _all_ hardware-related (PPU, in this example) memory addresses
that can be modified by the CPU. These pointers then link into a 1+2 way
list that represents the queued data for that memory address. This means a
- each standard PPU registers (2000-2007, though you might not need to do
all of them (keep reading...))
- each palette memory element
- each OAM element
- each name table element*
- each patten table element
- any bankswitching regs
- each element in CHR-RAM, if it exists*
- etc...
When the CPU core decodes writes to ports like $4014, the CPU core will
examine that port's status as a queued port, along with the pointer to the
last allocated link in the list of queued writes for that port will be
decoded. If queueing is enabled for this port, the CPU will use the pointer
info, along with memory allocation info and the current cycle count, to
insert a new link into that list, containing the CPU write data.
A relative clock cycle tag value allows hardware emulation routines reading
the value later on to determine when the next related write to this port
occurs.
Fwd/back pointers are used in each element in the list for 2-way travel.
This is required, since it is often neccessary for the hardware to know the
last-known value of any memory it may have access to.
A third, one-way pointer in each element in the list will be used to link
all nodes created from the same core engine in your emulator together. This
makes deallocation of all those links very easy, with list length being a
direct function of the number of hardware writes that occured that frame
(so, generally not that much). Note that links with the "last allocated
link" field = 0 are *not* to be deallocated, since these represent links
that must be present for the next frame's calculations.
For writing to ports like $2004 and $2007, which are designed to have data
streamed into it, this will require some additional logic on the CPU core's
part to calculate the link list address (since there's an additional lookup,
and an address increment required). This would normally be done with a
hardware port handler, but this approach would be frowned upon, since the
For handling CPU reads from hardware ports, it's a simple matter of
determining whether or not the port handler has to be called or not. For
example, when $2002 is read, it's status often doesn't change until a
certain (independent) clock cycle in the frame has been reached. In this
case, the port would be read for the first time, and the handler would be
invoked. The handler would then calculate the next clock cycle at which
$2002's status is expected to change, and creates a virtual cycle counter
instance, programmed to execute another $2002-related handler when the
cycle
count expires. Meanwhile, the handler changes the CPU memory map layout
so
that subsequent reads from this port simply causes the CPU core to read from
a regular memory address, where the last known port value is stored, thus
avoiding unneccessary calling of $2002's read handler, until the virtual
counter goes off.
For handling CPU reads from ports like $2004 and $2007, the CPU core simply
has to return the last-known value of the element being accessed from the
array queues.
+--------------------------+
|Threading NES applications|
+--------------------------+
Lately, x86-based PC's have become so blazingly fast, that emulating just
one virtual NES on a modern PC, would seem to be a waste of processing
power. With that said, modern PC's have enough processing power to
emulate
dozens of virtual NES machines, but there is one big problem with
multitasking NES applications: they were never designed to be threaded.
Instead, an entire frame's worth of NES CPU clocks have to be wasted for
each NES application, in order to consider the application's frame
calculations complete, whether or not this may be true (and if not, a
slowdown will occur). The following hints and tips suggest ways to reduce
wasted time in virtual 6502 emulation normally lost due to spin-wait, poll,
or cycle count loops.
- For ports frequently used in polling loops (like $2002), these handlers
can do a basic poll loop comparison to the current location of the PC, to
determine if the port is being polled, and the condition under which the
loop will be exited. Since flags like vblank, >8sprites, and priobjcollision
all happen at a static moment in an emulated frame, it's easy to make the
PPU handler advance the CPU's cycle counter directly to the clock cycle at
which these flags will meet the loop exit condition, and thus saving virtual
6502 CPU time.
+----------------------------+
|Emulator features to support|
+----------------------------+
This section merely contains some innovative and interesting suggestions for
features to support in new NES emulators being developed.
This not only allows gamers to play NES games on your emulator with an
original controller/lightgun/etc. (rather than having to use the keyboard),
but also allows unused buttons on a SNES controller to have customizable
functionality during gameplay (game/state change, suspend, fast forward,
save/load machine state, and reset functions would be most handy).
- Fully adjustable virtual APU framerate emulation. For games that use this
interrupt source, changing the frequency of this signal will change the
playback speed of the game's audio.
- Slow downs in NES games should be eliminated by either providing the user
a way to adjust the number of CPU clocks to execute per PPU frame, or by
threading the game's NMI handler. Besides, if the player wants to slow down
the game action, they should be able to do it by activating a slow-motion
button, as opposed to being forced to slow down simply whenever the game's
frame calculations get a little too heavy for a standard 29780 2/3 cc-based
frame.
- provide a way to let the user custom mix audio generated by any NES sound
hardware used by the game, into 6 audio tracks for playback through a 5.1
soundsystem.
- Allow the user to specify a custom size and additional scroll offset to
apply to the displayed PPU playfield (rather than just defaulting it to
256*240, 0:0+ScrollCtrs) in your emulator. This not only allows gamers to
crop the edges of an NES game's playfield that has messy graphics around
there, but it also allows the gamer to extend the size of the playfield to
include displaying the contents of 1 or 3 other nametables simultaniously,
as is very useful for games like Pin Ball, Wrecking Crew, Super Mario Bros.,
Duck Tales, Metroid, Jackal, and Gauntlet to name a few. An option should
also be provided to prevent PPU scroll counters (X or Y) from being used in
the final playfield scroll offset caclulation, but rather have them applied
to the offset of the object frame (this causes the objects to move around
the screen, rather than having the playfield do that while objects stay
relatively in the middle of the playfield).
- Provide a graphics filter for virtual OAM set swapping. This technique is
used when the game needs to display more objects than the PPU hardware
supports per frame. Games alternate between two (or more) OAM sets
between
frames, and this does let the gamer see the extra objects, but not without
having to settle for a large amount of flickering sprites. A primitive
technique for filtering OAM set swaps is to extend the number of sprites
displayed on any frame to include one or more from previous frames.
Normally, only the last frame's OAM set needs to be saved to eliminate
serious flicker from sprites in games like Mega Man 2, but somtimes two or
more old OAM sets are neccessary. In this case, it's better to implement a
sophisticated OAM set pattern search engine that eliminates the high
overhead of re-rendering a same typed & placed sprite appearing in 2 or
more
OAM sets.
- Provide rewind play motion and record NES movie support. these two work
together, along with save states, to produce NES movies of only your finest
play performances in a favorite game.
- Support an on-line text & art galery. Users should be able to look through
a collection of bitmap-formatted images relating to NES stuff (this may be
screenshots, scanned pages of instruction booklets, label art, etc.). Just
think of how the "Super Mario All Stars" game selection menu looks, and now
pretend that there are many more selections, and they span off in two
dimensions. Now you're talking about an interesting new feature to
implement
in an NES emulator.
user sees fit, possibly increasing or decreasing the size of the monitor
matrix. And of course, personal emulation settings can be stored for each
game state, so that for example, only selected NES games states will be
animated during the time the NES arcade emulator runs (suspended game
states
can simply be displayed that way on monitors in the virtual NES arcade).
+-------------------------------------------------+
|New object-oriented NES file format specification|
+-------------------------------------------------+
This section details a new, extremely easy to use standard for digital data
storage of NES ROM images and related information, which provides as much
object-orientation for the individual files as possible.
So basically, the idea here is to use existing file formats to store all
information related to a single game, within a private directory on your
filesystem amongst others, making up your electronic NES game library. All
like file types may have similar extentions, while having different
filenames, usually relating to the specific description of what the file
represents (i.e., files relating to save state info, may have a title that
describes the location or game status of the state, or patch files may
describe the operation of the patch during emulation, etc). As other
relivant file formats (like *.jpeg, *.gif, *.bmp, etc.) have been long
established computer standards, only file formats relating to NES operation
are defined here.
*.WRAM
*.VRAM
*.XRAM
extra RAM (other than CHR RAM) used on the game cart
*.SRAM
This list isn't complete (as 2A03, 2C02, and MMC memory structures will
always be emulator-specific), but it should give you an idea of how to
seperate files relating to raw dumps of large internal memory structures
used inside the NES, in order to improve the portability of the ROM files,
large RAM structures, save state dumps, patches, hacks, and such.
*.PRG and *.CHR: the digital contents of program & character ROMs found on
the NES game board. It would be nice to see these files maintain at all
times a 2^n count of bytes, except when other PRG/CHR ROMs have to be
appended to their respective files, due to the possibility that an NES game
may use two or more differently-sized ROMs to make up a larger one (before
1987, this was mostly done to increase a game's ROM size with more chips,
since it seems that ROMs larger than 32KB were just either very expensive,
or not available back then). The filename always relates to the name of the
game, including if it's been hacked, country it's from, or whatever. *.CHR
files that are produced for save state purposes when NES game carts use
CHR-RAM, use a related save state's description as a filename.
*.INES: a 16 byte file containing the iNES header equivelant of what a *.MMC
text file would normally represent. This file only exists because digital
storage of NES game ROMs is currently dominated by the dated iNES format.
Support is not recommended in new emulators (if you're not part of the
solution, you're part of the problem, right?).
*.WRAM, *.VRAM, *.XRAM: these all define files which contain mirror images
of the RAM chips they represent in the NES being emulated. The filename for
all of them relates to the save state description.
extention. Filename here always relates to the effects the patch has during
emulation. Filesize is used to determine the length of the patch (minus 4 to
exclude the offset value).
*.PRGHACK, *.CHRHACK: these files define lists in plain text that define the
patch files to apply to game emulation, when this specific HACK file is
chosen to be applied for the emulated game. The filename relates to the
group of patches you've chosen for this file (normally, this doesn't matter,
but it's useful for storing multiple hack profiles (ones that make the game
easier, harder, wierd, behave like an NSF file, change graphics, etc)). Use
ASCII formfeed and/or carriage return codes (13 and 10) to seperate listed
patch types in the file.
notes
----- when more than one file of type *.PRG, *.CHR (when not RAM-based), or
*.SRAM is stored in a single game's directory, the emulator is responsible
for making sure the gamer may select the active RAM/ROM(s) to use during
emulation, since game emulation can only be based on one source of these.
- the emulator must have the ability to detect & present all the different
HACK files available in a game's directory, since the effects of only a
single HACK file may be applied to a selected game ROM in there.
author.
- Any game ROMs that do not have a matching MMC-typed filename in the
same
directory, should cause the emulator to refuse to emulate the game ROMs.
- This format _does_ complicate the transportation of NES ROM files a bit
for general emulator users/gamers, but in the end, there's only the PRG,
MMC, and optional CHR and SRAM-typed files required for transport (so, 2..4
files max). This is hardly difficult for even a basic user to comprehend.
EOF