You are on page 1of 7


ow ephemeral is content on the internet? Can a library rea-

If you want to sonably expect to collect each new version of its website or
blog as it would collect every issue of a magazine or journal?
maintain your web Large-scale projects such as the Internet Archive ( send out crawlers to gather snapshots of much of the
archive into the web. This massive collection of archived websites may include content of
interest to your patrons. But if you want to control exactly when and what
future, you will need is archived, relying on someone else to do the archiving isn’t ideal.
Though large-scale web archiving and preservation are outside the mis-
to think about sion of most libraries, it is possible for even the smallest operations to main-
tain an archive of a group of sites. This article outlines a simple workflow
long-term for Macs that uses free software and requires only the most basic techni-
cal expertise. There is no programming involved, and you won’t have to
touch a command-line interface. This is web archiving for the rest of us.
preservation. We developed this process in response to an assignment at the Sim-
mons College Graduate School of Library and Information Science. At
Simmons, Katharine is a student and an editorial fellow, and Nick is a
student while he works as a library assistant in preservation services
at the Massachusetts Institute of Technology (MIT) Libraries. We have
put this process to the test on relatively static webpages, blogs, and even
the social networking site Twitter.

12 | SEPTEMBER 2009 »

Katharine Dunn and
Nick Szydlowski
web archiving for the rest of us

When you’re learning to use SiteSucker, it is trouble determining the difference,

look for pages that end with the file ex-
tensions .htm or .html—these are more
best to start with a target webpage that likely to be plain HTML. SiteSucker
cannot follow links that appear within
consists primarily of plain HTML. JavaScript, so some types of sites will
not render correctly when harvested
with this program. We discuss the de-
Workflow tails of using SiteSucker on database-
arranged on its server. SiteSucker is backed sites later.
There are just three things you need donation-ware and can be down-
to do to create an archive of websites. loaded from PC
Version Control
(Preservation is another matter; we’ll dis- and Linux users will find that the
cuss this briefly at the end of the article.) free software HTTrack (available at In the software development com- offers similar munity, the term “version control”
1. Harvesting: First you must functionality. refers to the process of managing the
acquire, or harvest, the content To download a site using Site- updates of files and documents, such
you are collecting. Because many Sucker, simply type the URL of the site as software code, that may be edited or
websites change frequently, you into the box marked “Web URL” and changed many times by multiple users.
will need to reacquire a current press the “Download” button. Site- Version control software such as CVS
version of the site at some desig- Sucker will follow the links on each or Subversion makes it possible to re-
nated interval. page, downloading every file that a turn to any past version of a file. In
user can access to a user-designated essence, version control gives you in-
2. Version Control: Once you set of folders on your hard drive. tellectual control over the content you
have harvested multiple versions Clicking on the “Settings” icon gives collect—and if you have born-digital
of a site, you will need to keep you much greater control over the content, it can be essential when per-
track of the versions so you can de- downloaded files. For example, it is forming traditional library and archive
termine which iterations are dif- possible to specify which file exten- tasks such as cataloging, appraisal,
ferent enough from one another to sions the program should download and providing access. Version control
be worth keeping. and which it should ignore. Use the helps us reach one of the end goals of
settings to limit or expand the number web archiving: making past versions
3. Presentation: The reason to of files downloaded and to ensure that available to users.
archive sites is to use them again, you get the files you need. It is possi- When we first started working on
right? How you choose to make ble to save your settings so that each this project, we attempted to use Sub-
your new web archive available to time you download a page, SiteSucker version, a popular and free version con-
your patrons will depend on what performs the same tasks. trol package, to track different itera-
resources you have at your tions of the sites we downloaded.
disposal. But it is possible to However, as nonprogrammers
view the files on a computer unaccustomed to using a com-
(PC or Mac), burn them to a mand-line interface, we strug-
CD, or even place them on a gled to produce usable results
server for remote viewing. with Subversion. As a tool cre-
ated for the software develop-
ment community, it is perhaps
Harvesting A SiteSucker screen best suited for those with some
In order to efficiently acquire programming skills.
entire sites, we chose to use a piece of When you’re learning to use Site- In response to these difficulties, we
software called SiteSucker. This soft- Sucker, it is best to start with a target developed a two-tiered version control
ware was designed to rapidly download webpage that consists primarily of method using programs created by Ap-
entire websites and place the files on plain HTML rather than a site where ple. The first of these is Time Machine,
your hard drive in a folder structure content is held in a database and has which is preinstalled as part of the
that mirrors the way the site was pages generated by scripts. If you have Mac OS 10.5 (Leopard) or higher. The

14 | SEPTEMBER 2009 »

web archiving for the rest of us

second, FileMerge, is part of the free chine may immediately start a backup. After collecting a number of ver-
Apple Xcode developer package. Both If not, click on the clock icon in the sions of your website, you may want
programs feature intuitive graphic in- menu bar on the upper right of your to compare these versions and make
terfaces, and neither requires any spe- screen and scroll to “Back Up Now.” appraisal decisions about which are
cial skills. Using these programs, it is You can continue working while the worth keeping and which you can put
possible to maintain control of the dif- computer is performing a backup. It is in the trash. To retrieve older versions
ferent versions of the websites you’re wise to make sure your external hard of your website from Time Machine,
tracking and to quickly appraise new drive has plenty of space since when make sure your external hard drive is
versions to determine if they are dif- the disk becomes full, Time Machine plugged in and turned on. Click the
ferent from previous iterations. will begin to erase the oldest versions icon and press “Enter Time Machine.”
Step 1: Managing backups with to make room for new ones. A Finder window (which shows your
Time Machine. As mentioned, Time
Machine comes preinstalled on Macs
with operating system versions 10.5 or
higher. Most Macs purchased since late
2007 should come with this program,
which creates incremental backups,
meaning that each new backup does
not override previous versions. This
feature allows you to track down and
restore files that, for example, you may
have accidentally deleted from your
hard drive. The program can be set up
to back up your entire hard drive, mul-
tiple files, or a single file onto an ex-
ternal drive.
It is possible to achieve the same
end results without using Time Ma-
chine, but using Time Machine has two
significant benefits:

• You will save on disk space

because all archived versions of
files will be on an external drive
rather than on your computer.
Time Machine will manage the
versions efficiently rather than
using disk space to store multiple
copies of the same file.

• Because an external drive holds

incremental backups, you can
delay appraisal decisions. This
will allow you to batch these
decisions, creating a more flexible
and efficient workflow.

To begin, you should run a backup

using Time Machine after each web
harvest. First, connect the external
Here are two screen shots of the Time Machine in action. On April 6, we entered the Time Machine and searched “sahughes,”
hard drive you will be using to store the folder name of our website. Pressing the back arrow on the lower right-hand side took us to the files as they existed the
your backed-up versions. Time Ma- first time we backed them up, on March 21 (first screen shot). The second screen shot is of the files on April 6 (“Today (Now)”).

16 | SEPTEMBER 2009 »

web archiving for the rest of us

files and folders) will drop down in press the back arrow on the lower do so, the version of the files that you
front of an identical stack of windows right of the screen. You can see every are looking at will replace the latest ver-
on an “outer space” backdrop. version saved along with the date it sion, and you may lose some files.
Enter the name or partial name of was saved. Select the versions you’re The folders containing the files are
your file or folder in the spotlight interested in and copy and paste each now clearly labeled and ready to be
search box on the upper right of the one to the desktop. Rename each compared. Time Machine is limited in
Finder window. Time Machine searches folder, adding the date of the files at this way; it holds versions but cannot
for the files and presents the most re- the end, e.g., “Library_Website_June29.” show you what changes have been
cent version in the Finder window at Do not press the “Restore” button on made. To do this you need the program
the front. To go to earlier versions, the bottom right of the screen. If you FileMerge.
Step 2: Comparing versions of
the site with FileMerge. FileMerge
is included in Apple’s Xcode developer
package, which is freely available at
The program compares two similar
folders and shows any differences be-
tween them. When you run the appli-
cation, it will prompt you for two fold-
ers or files, labeled “right” and “left.”
Begin by selecting the first and second
versions of the site you have archived.
When you press the “Compare” button,
FileMerge will display a folder struc-
ture, showing all files that appear in
either folder. Files that are identical in
both folders will appear in gray. Files
that are only present in one folder or
the other, or which are not exactly the
same in each folder, will appear in
black. The check boxes at the upper
right of the window allow you to
choose which categories of file will be
You may wish to make an appraisal
decision based simply on how many
files have been added, removed, or
changed from one version of the site to
another. Depending on the goals of your
archive, you may decide that a new ver-
sion of the site only merits retention if
it includes more than 10 new files, or
that any change at all represents a new
version that should be kept.
If you are familiar with HTML, you
can also use FileMerge to view more
specifically what has changed about a
page without having to read the entire
contents of the page. To do this, select
Here are two screen shots of FileMerge. We asked the program to compare the April 6 and April 2 versions of the sahughes the page in FileMerge and choose
files. The program lists individual files and notes differences between the two versions. Here, we’ve highlighted the file
“travel.html,” which has four changes. We then double-clicked on travel.html, and a window popped up, showing the newer “Comparison” from the drop-down menu
and the older versions of the file in columns next to each other. “View.” FileMerge will display the code « SEPTEMBER 2009 | 17

web archiving for the rest of us

of the two versions side by side, with FileMerge compares two similar folders and
each change highlighted and enumer-
ated. Using this information, you may
find it easier to make a precise deter- shows any differences between them.
mination about whether changes are
significant or minor. Of course, what
constitutes a significant change will
depend on the nature of your project. software that holds web content in a page of a Twitter account, though it
If you determine that a version of the database and generates pages dynam- cannot reach archived posts.
site is not different enough to keep, ically, based on the user’s request. This
simply delete it from your computer includes sites created using either blog-
and/or external hard drive. ging software or a content management
system (CMS) such as Joomla. Archiving websites can be ex-
In order to mirror these sites ex- tremely useful, but simply owning the
actly as they exist on their servers, files doesn’t guarantee they will con-
Once you have decided to retain a SiteSucker would have to download tinue to be usable. If you want to
particular version of a site, you’re the database software from the server maintain your web archive into the fu-
ready to make that version available and run it on your computer. For many ture, you will need to think about
to your patrons. From a technical reasons, this is not possible. What long-term preservation. This is a com-
standpoint, you can serve the files by SiteSucker will do instead when pre- plex issue, even for relatively stable
any method you would use for other sented with this type of site is to fol- file formats such as HTML. But it can
HTML files: You can host them on a lo- low every link on every page. In data- be simplified by timely and thought-
cal computer or network, lend them base-backed sites, this can produce ful action. Two great sources of infor-
out on a CD-R, or even place them on thousands of possibilities because mation on the topic are the Digital
a public web server. Using any of these each piece of content can often be dis- Curation Centre in the U.K. (www.d
methods, it is important to direct the played in a variety of contexts. and Australia’s PADI (Pre-
user first to the index file (either in- SiteSucker’s settings can help limit serving Access to Digital Information;
dex.html, index.htm, or index.php). your archive to a reasonable number An excellent
This can be achieved by written in- of files, either by setting a hard limit first step is to maintain consistent
structions or by linking directly to the on the number of files downloaded or and detailed metadata about the files
file from a menu. Care should be taken by limiting the number of links the you are archiving. Simply knowing
to avoid violating the copyright of con- program will follow in succession, us- the format and the creation date of a
tent owners. Though the law in this ing the levels setting. Limiting levels file may make the difference between
area remains somewhat unsettled, a to two will return only the homepage a usable file and an unusable one, 20,
useful and brief guide is the Oakland and files linked directly from it. 50, or 100 years from now.
Archive Policy (http://www2.sims.ber Sites created on many blogging and CMS platforms do not render correctly
removal-policy.html), which has been when downloaded by SiteSucker. This
adopted by the Internet Archive. is often because the formatting infor-
mation for these sites is contained in
scripts that SiteSucker cannot harvest Katharine Dunn (katharine.dunn
Notes on Blogs and Other
or interpret. Whether a particular is a graduate student
Database-Backed Sites platform will work with SiteSucker in library and information science at
SiteSucker, the software we’ve used depends on the details of the software Simmons College in Boston. Dunn, who
to harvest websites, was primarily de- itself. Fortunately, some of the most is also a freelance magazine writer,
signed to work on sites based on plain popular blogging platforms, including works as the school’s editorial fellow.
HTML. When you harvest a plain WordPress and Blogger, rendered cor- Nick Szydlowski is a library assistant
HTML site, the end product should rectly in our tests. For blogs created in in preservation services at the MIT Li-
mirror what was originally on the web other software, we were able to archive braries. He is currently pursuing an
server: the same files in the same folder the content, but the formatting infor- M.L.S. from the Simmons College
structure. However, more and more mation was not retained. SiteSucker Graduate School of Library Science. He
sites these days are designed using can also be used to harvest the main can be reached at

18 | SEPTEMBER 2009 »

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.