You are on page 1of 2

Exchange ESE, Failed Backups, the Checkpoint Depth,

and You
A question came up recently on a mailing list I hang around. A poster wrote: "I'm looking for
a concise explanation of when/how the transaction logs are committed. Specifically, is there
another way to force logs to be committed other than the backup or stopping the IS? (E2K3)"
As with many things in Exchange, when you get down to the bits and the bytes, things are not
always as they seem. My response, which some of you might be interested in, went more-or-
less less like the below (I've expanded it a little, and added a few paragraphs on checkpoint
exhaustion). Also note that this only applies to streaming backups. Shadow-copy backups are
different.
A concise explanation? Hrmmm.
Well, first, doing a backup does NOT force logs to be committed. It means that the backup
process waits until a checkpoint occurs, flushes the current transaction log, and during the
backup additional checkpoints are not allowed to occur (that is, nothing is allowed to be
flushed to the ESE database until after the backup is complete – leading to an increasing
checkpoint depth, and in extreme situations, checkpoint exhaustion – discussed a little later).
The ESE buffers are not flushed. The "checkpoint depth" is how many checkpointed logfiles
must be processed before the ESE database is current with the transaction log files.
So...in the normal case, data is written in a serialized fashion to the in-memory ESE cache
and to the log buffers, as updates occur in an Exchange database. Logs are written to disk as
logs fill up, or as transaction commits occur. (This may mean that a log can have nothing but
a checkpoint record in it – but that is not the normal case.) In the ESE buffers, I/O is
accumulated and prioritized by a process known as the “lazy writer”. The lazy writer scans
the ESE buffers for dirty (modified) pages and builds an optimized list to flush those to disk.
As that list is flushed to disk, the pages are marked as “clean”, and the checkpoint is marked
as having advanced (on a transaction by transaction basis, not a log by log basis). Whenever
the transaction in a log are fully advanced, then the checkpoint file and the database header
are updated.
During a backup, the lazy writer is paused.
The ONLY time you are assured that logs are fully committed is during clean shutdown, or
after running soft recovery.
Now, consider the situation when a backup is in process and it gets hung (i.e., waiting on a
tape, or there was a fatal write error on the tape and the backup application crashed). What
happens? There are three possible answers.
1) The backup application caught the exception (i.e., the error that caused the crash) and
cleaned up after itself. In this case, it appears as if the bad backup never happened and
everything just goes back to normal. This is the best possible case.
2) The backup application didn't catch the exception, but "notices" when it next runs that
there was a backup in process and isn't one now, but the backup state is still "dirty". Then, the
backup application cleans up the old "dirty" backup state, and proceeds as normal. This is the
middle of the road case.
3) The backup application didn't catch the exception, and when it goes to start the next
backup, it gets an error from Exchange - "backup already in process". Then, because of that
error, the backup aborts. This is the worst situation. To clear the "backup in process" flag
requires a restart of the Microsoft Exchange Information Store service.
In any of these cases, the checkpoint depth is continuing to advance - and the log files are
accumulating, but nothing is being committed to the database. This is a problem! Eventually,
the Information Store will say "enough!" and dismount the storage group. That's right - it will
dismount the entire storage group.
This is called "Checkpoint depth exhaustion". After 1,008 logfiles have been generated,
Exchange automatically shuts down the storage group. There is an implementation limit of
1,024 logfiles in a checkpoint, and this shutdown prevents the Exchange database from being
corrupted.
After the storage group is dismounted, MSExchangeIS can simply be restarted and the
storage group remounted (this will cause all of those transaction logfiles to be replayed - it
may take quite awhile!).
This can also happen in very very high update situations (such as running many move
mailbox operations all at once time).
Your best bet is to monitor for this. If you are using OpsMgr 2007, this is already one of the
issues checked for. If you want to add this check manually, add the performance counter
"Database ==> Instances\Log Generation Checkpoint Depth".
For more information on this issue, see KB 905801 and KB 819771.
Until next time...
As always, if there are items you would like me to talk about, please drop me a line and let
me know!
Published Monday, April 28, 2008 11:33 AM by michael
value:Filed%20un

Filed under: Administration, Exchange


Comments
Saturday, May 03, 2008 4:22 AM by subject: exchange
# Weekend reading
How to customize OWA in Exchange Server 2007 Top 10 Microsoft Exchange Server 2003
registry hacks What
Tuesday, May 06, 2008 6:29 AM by jameswells
# re: Exchange ESE, Failed Backups, the Checkpoint Depth, and You
Thanks for posting the perfmon counter! We've had a few problems around this (mostly gone
since we went to VSS backups) - no one, including PSS, ever mentioned a counter that can be
monitored!

You might also like