Professional Documents
Culture Documents
Abstract
As a result, preserving and cataloging the earliest electronic records consisted of two intertwined problems: the
task of finding and copying the data off magnetic media before the media deteriorates, and the challenging of
reading older and sometimes obscure formats that are no
longer in widespread use[1].
Archivists are now on the brink of a far more disruptive change than the transition from paper to electronic
media: the transition from personal to cloud computing. In the very near future an archivist might enter the
office of a deceased writer and find no electronic files
of personal significance: the authors appointment calendar might split between her organizations Microsoft
Exchange server and Yahoo Calendar; her unfinished and
unpublished documents stored on Google Docs; her diary
stored at the online LiveJournal service; correspondence
archived on the Facebook walls of her close friends; and
her most revealing, insightful and critical comments scattered as anonymous and pseudonymous comments on the
blogs of her friends, collaborators, and rivals.
Although there are numerous public and commercial
projects underway to find and preserve public web-based
content, these projects will not be useful to future historians if there is no way to readily find the information
that is of interest. And of course, none of the archiving
projects are able to archive content that is private or otherwise restrictedas will increasingly be the case of personal information that is stored in the cloud.
Introduction
1.1
This paper introduces and explores the problem of find Invited paper, presented at the First Digital Lives Research Confering and archiving persons Internet footprint. In Section 2
ence: Personal Digital Archives for the 21st Century, London, England,
we define the term Internet footprint and provide numer911 February 2009
Corresponding Author: slgarfin@nps.edu
ous examples of the footprints extent. In Section 3 we
1
2.1
2.2
2.3
2.5
2.4
3.1
3.2
Forensic Analysis
Figure 1: The first page of output from bulk extractor program; the actual output runs more than 40 pages.
3.3
Provable References Known references could be indicated by the presence of a username/password combination which maps directly to a specific website
and can be validated by testing to see if the account
can still be accessed.
Reliable References A reliable reference could be indicated by the presence of an alias and URL/cookie
combination but does not include a password, preventing the researcher from actually testing the account.
Passing References A passing reference could be indicated by the presence of a URL or cookie which
points to a social networking site or internet e-mail
site. The difference here is that there is only one indicator of reference to a website which could hold
historically interesting material.
3.4
Unexpected Complications
Figure 2: Postings to Craiglist may one day provide fascinating contemporanious documents of the career of writers or artists.
integrity assurances.
The third category, Passing References, will require
significant time and effort on the part of the historian
and it is anticipated that the level of automation will decrease. Since the historian is provide little information
on which to go on exhaustive manual searches of both local and deep/hidden content will be required. For public
content the use of traditional search engines, like Google
and Yahoo, and Webcrawlers, like Webcrawler.com and
DataRover could be utilized. As local search engines index mostly based on hyperlinks which include location
information they typically exclude high quality local
content available in the Deep Web[40]. Deep Web crawling may be accomplished through the use of tools such
as Deep Web Crawler and LocalDeepBot. Additionally,
Hidden Web Agents may be used as well. These agents
can search and collect information on pages outside the
Publically Indexable Web (PIW)[32].
4.1
4.2
While there are many different ways to archive web content, each has significant technical problems.
There are several fundamental problems in making an
archival copy of a web page:
Because web pages can appear differently on different computers, it is not clear what should be
archiveda picture of the web page, or the HTML
code of a web page?
Web sites such as Facebook and LiveJournal may
show web pages differently depending on who is
logged in. Should the web page be archived as it
appear to the author, to a person in the authors circle of friends, to an un-friended registered user, or
as it appear if no one is logged in?
Alternatively, web sites may display pages differently at different times of day, or change their
theme to take into account current events. If there
are significant time-dependent changes, should multiple copies be archived?
Once the archivist decides what should be archived, the
next question to answer is how should it be archived.
The nave approach for archiving web content is to print
it. While archivists generally frown on this approach, because all it does is exchange one set of problems for another.
Instead of printing to paper, the web page could be
10
Legal Issues
There are primarily two legal issues that could arise during the conduct of the collection of Internet works being proposed in this paper: violations of copyright law,
and violations of computer crime statutes such as the US
Computer Fraud and Abuse Act, or the UK Computer
Misuse Act. There are also a number of ethical issues
that might arise as well.
5.1
No Site Content may be modified, copied, distributed, framed, reproduced, republished, downloaded, scraped, displayed, posted, transmitted, or sold
in any form or by any means, in whole or in part,
without the Companys prior written permission, except that the foregoing does not apply to your own User
Content (as defined below) that you legally post on the
Site. Provided that you are eligible for use of the Site,
you are granted a limited license to access and use the
Site and the Site Content and to download or print a
copy of any portion of the Site Content to which you
have properly gained access solely for your personal,
non-commercial use, provided that you keep all copyright or other proprietary notices intact. Except for
your own User Content, you may not upload or republish Site Content on any Internet, Intranet or Extranet site or incorporate the information in any other
database or compilation, and any other use of the Site
Content is strictly prohibited[18].
Figure 4: This section of Facebooks Terms of Use would
seem to prohibit the archiving of a persons Facebook profile for historical purposes.
5.2
Computer Crime
5.3
Ethical Issues
References
Computer systems have the potential to record more information, retain it for a longer period of time, and make
it available to more individuals than is possible with paper works. More than ever, every effort should be made
to clearly differentiate between what is public and what
is private information. This is especially the case when
collecting from online information systems, since there
is the chance that the information collected may belong
to another person (in the case of a mistaken identity), or
may involve other people (in the case of a social network
website).
The problem of mistaken identity is especially problematic for online data collection. There is little chance
when going through a persons office that the archivist
will accidently pick up and catalog a diary belonging to
a person who has the same name but who lives in another
countrybut this is exactly what can happen when downloading a originators online diary.
Conclusion
It is no longer sufficient to simply analyze local computers and associated media when attempting to catalog
a persons life works. Ever increasingly communication,
personal documents and published works are migrating to
the web space. Social Networking sites contain photos,
videos and personal communication. Blog sites contain
personal ramblings and commentaries; named and anonymous. E-mail and chat as well as personal videos are also
migrating to the web. The archivist of the present must be
technically savvy and be able to use the myriad of forensic analysis, web searching and cataloging tools in order
to be efficient and create a complete set of works.
Many of the approaches discussed in this paper need
not be confined to the archivist profession. Individuals
can apply these approaches on themselves to determine
the extent of their own digital shadow. These approaches
may also be useful in civil litigation for e-discovery, and
even in law enforcement.
6.1
[7] Ziv Bar-Yossef, Andrei Z. Broder, Ravi Kumar, and Andrew Tomkins. Sic transit gloria telae: towards an understanding of the webs decay. In WWW 04: Proceedings
of the 13th international conference on World Wide Web,
pages 328337. ACM, New York, NY, USA, 2004. ISBN
1-58113-844-X.
[8] Ron Bekkerman and Andrew McCallum. Disambiguating
web appearances of people in a social network. In WWW
05: Proceedings of the 14th international conference on
World Wide Web, pages 463470. ACM, New York, NY,
USA, 2005. ISBN 1-59593-046-9.
[9] Susan L. Bryant, Andrea Forte, and Amy Bruckman. Becoming wikipedian: transformation of participation in a
collaborative online encyclopedia. In GROUP 05: Proceedings of the 2005 international ACM SIGGROUP conference on Supporting group work, pages 110. ACM,
New York, NY, USA, 2005. ISBN 1-59593-223-2.
Acknowledgements
Our thanks Jeremy Leighton John at the Digital Lives research project for suggesting that we explore this relevant
and interesting topic and providing valuable feedback on
this paper.
12
[11] Moira Burke and Robert Kraut. Taking up the mop: identifying future wikipedia administrators. In CHI 08: CHI
08 extended abstracts on Human factors in computing
systems, pages 34413446. ACM, New York, NY, USA,
2008. ISBN 978-1-60558-012-X.
[12] Mary Elaine Califf and Raymond J. Mooney. Bottom-up
relational learning of pattern matching rules for information extraction. J. Mach. Learn. Res., 4:177210, 2003.
ISSN 1533-7928.
[13] Fred Cohen. Risks of believing what you see on the wayback machine (archive.org). RISKS Digest, 25, January 7
2008. http://seclists.org/risks/2008/q1/
0000.html.
[14] Susan Crawford. The computer fraud and abuse act, May
19 2008. http://scrawford.net/blog/thecomputer-fraud-and-abuse-act/1172/.
[15] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z. Wang.
Image retrieval: Ideas, influences, and trends of the new
age. ACM Comput. Surv., 40(2):160, 2008. ISSN 03600300.
[28] Adam Jatowt, Yukiko Kawai, and Katsumi Tanaka. Detecting age of page content. In WIDM 07: Proceedings of
the 9th annual ACM international workshop on Web information and data management, pages 137144. ACM, New
York, NY, USA, 2007. ISBN 978-1-59593-829-9.
[17] John P. Elwood. Admissibility in federal court of electronic copies of personnel records, May 30 2008. http:
//www.usdoj.gov/olc/2008/electronicpersonnel-records.pdf.
[19] Simson Garfinkel. Forensic feature extraction and crossdrive analysis. In Proceedings of the 6th Annual Digital
Forensic Research Workshop (DFRWS). Lafayette, Indiana, August 2006. http://www.dfrws.org/2006/
proceedings/10-Garfinkel.pdf.
[31] Judge alex kozinski calls for probe into his porn postings.
Los Angeles Times, June 13 2008.
13
[32] Juliano Palmieri Lage, Altigran S. da Silva, Paulo B. Golgher, and Alberto H. F. Laender. Automatic generation
of agents for collecting hidden web pages for data extraction. Data Knowl. Eng., 49(2):177196, 2004. ISSN 0169023X.
[33] Lawrence Lessig.
The Kozinski mess, June 12
2008.
http://www.lessig.org/blog/2008/
06/the_kozinski_mess.html.
[34] Malorie Lucich. Re: face book pages of dead people, January 16 2009. personal communication.
[35] Andrew Martin.
Whole foods executive used alias.
The New York Times, July 12 2007.
http://
www.nytimes.com/2007/07/12/business/
12foods.html.
[36] Frank McCown, Norou Diawara, and Michael L. Nelson. Factors affecting website reconstruction from the
web infrastructure. In JCDL 07: Proceedings of the 7th
ACM/IEEE-CS joint conference on Digital libraries, pages
3948. ACM, New York, NY, USA, 2007. ISBN 978-159593-644-8.
[37] Declan McCullagh.
Finish line, September 1996.
http://w2.eff.org/Misc/Publications/
Declan_McCullagh/hw.finnish.line.
090696.article.
[38] Elinor Mills. Conde nast to buy wired news, July 11
2006.
http://news.cnet.com/Conde-Nastto-buy-Wired-News/2100-1030_3-6093028.
html.
[39] Jose E. Moreira, Maged M. Michael, Dilma Da Silva,
Doron Shiloach, Parijat Dube, and Li Zhang. Scalability of the nutch search engine. In ICS 07: Proceedings of
the 21st annual international conference on Supercomputing, pages 312. ACM, New York, NY, USA, 2007. ISBN
978-1-59593-768-1.
[40] Dheerendranath Mundluru and Xiongwu Xia. Experiences
in crawling deep web in the context of local search. In
GIR 08: Proceeding of the 2nd international workshop
on Geographic information retrieval, pages 3542. ACM,
New York, NY, USA, 2008. ISBN 978-1-60558-253-5.
[41] Maureen Pennock and Brian Kelly. Archiving web site
resources: a records management view. In WWW 06:
Proceedings of the 15th international conference on World
Wide Web, pages 987988. ACM, New York, NY, USA,
2006. ISBN 1-59593-323-9.
[42] Herman Chung-Hwa Rao, Yih-Farn Chen, and Ming-Feng
Chen. A proxy-based personal web archiving service.
SIGOPS Oper. Syst. Rev., 35(1):6172, 2001. ISSN 01635980.
[43] Craig Richmond.
Why mirroring is not a backup
solution.
January 2 2009.
http://hardware.
slashdot.org/article.pl?sid=09%2F01%
2F02%2F1546214.
[44] Arnaud Sahuguet and Fabien Azavant. Building lightweight wrappers for legacy web data-sources using w4f.
In VLDB 99: Proceedings of the 25th International Conference on Very Large Data Bases, pages 738741. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
1999. ISBN 1-55860-615-7.
[45] Adobe Systems. Adobe acrobat 4.0 for macintosh readme,
March 15 1999.
14
[46] Jordi Turmo, Alicia Ageno, and Neus Catal`a. Adaptive information extraction. ACM Comput. Surv., 38(2):4, 2006.
ISSN 0360-0300.
[47] John Updike. Cut the unfunny comics, not spiderman.
The Boston Globe, October 27 1994.
[48] Fernanda B. Viegas, Martin Wattenberg, and Kushal Dave.
Studying cooperation and conflict between authors with
history flow visualizations. In CHI 04: Proceedings of the
SIGCHI conference on Human factors in computing systems, pages 575582. ACM, New York, NY, USA, 2004.
ISBN 1-58113-702-8.
[49] Xiaoyun Wang and Hongbo Yu. How to break md5 and
other hash functions. In Ronald Cramer, editor, EUROCRYPT, volume 3494 of Lecture Notes in Computer Science, pages 1935. Springer, 2005. ISBN 3-540-25910-4.
[50] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld.
Face recognition: A literature survey. ACM Comput. Surv.,
35(4):399458, 2003. ISSN 0360-0300.