Adobe Acrobat’s ability to ‘capture’ a website in PDF has one noteworthy flaw that you need to keep in mind.

 

Links on the captured website function the way they are __coded__ rather than the way they behave. I wish I could find a more eloquent way to describe that. Perhaps this example will help.

A homepage (index1.html) has text “Frequently Asked Questions” hyperlinked to faq1.html using the coding <A HREF=faq/faq1.html>. Because this is a relative URL, the expected result of moving to the FAQ page in the PDF file occurs. However, if the coding uses an absolute URL, i.e.  <A HREF=http://www.example.org/faq/faq1.html>, clicking on the link in the PDF file sends the user out to the Internet instead of to another page in the same PDF page.

 

This behavior will deliver whatever substance is at http://www.example.org/faq/faq1.html at the time the PDF is used. This could easily be different content because the live webpage has been updated. It might also simply no longer be available on the Internet, providing the researcher with a “404 File not found” window. Other explanations for this latter behavior include changed mapping for the live file (e.g. http://www.example.org/faq.html) or change in the domain name (e.g. www.example.org becomes www.example.net).

 

Whether the file is different, no longer exists, or has been moved, the point is that the link in the PDF-captured object directs you outside of the object itself when its original behavior was clearly to have navigated the user to another page at that same website.

 

Such is the nature of the web. In our rush to it, we have made it possible to accomplish many things using different approaches. All are valid. The preservation choices are not as forgiving.

 

I have other qualms. Contact me offline, if you want to hear them.

 

My best to all of you,

 

Ricc

 

Riccardo Ferrante

Information Technology Archivist & Electronic Records Program Director

------------------------------------------------------------------------------------------------------------------

Smithsonian Institution Archives - 900 Jefferson Dr. S.W. MRC 414 - Washington, DC 20013

------------------------------------------------------------------------------------------------------------------

[Email] [log in to unmask]edu - [Phone] 202.357.1420 - [Fax] 202.357.2395

 

 

The Smithsonian Institution Archives is relocating to new offices. Records management, reference services and history of the Smithsonian queries are unavailable until we reopen in early fall 2006. In order to serve you better, please check our website for updates and specific information http://siarchives.si.edu


From: Archives & Archivists [mailto:[log in to unmask]] On Behalf Of Rick Barry
Sent: Wednesday, July 26, 2006 4:17 PM
To: [log in to unmask]
Subject: Re: Capturing websites

 

In a message dated 7/26/2006 3:01:38 A.M. Eastern Standard Time, [log in to unmask] ([log in to unmask]  (Jessica Tanny)) writes:

In 2002 there was an interesting conversation on the archives listserv
regarding archiving websites. At that time someone mentioned using Adobe
Acrobat as a way to capture a website (Mount Holyoke has a "how to"
guide
online:
http://www.mtholyoke.edu/lits/csit/documentation/archiving/archiving_web
sit
es.htm).

I believe the current address is http://www.mtholyoke.edu/lits/ris/documentation/archiving/archiving_websites.htm

 

I back up my own Website www.mybestdocs.com by regularly creating and naming a blank file using the date in the file name, e.g., <bu-mbd060725> in a backup directory on my C-Drive, then publishing my live Website to that directory/file using MS FrontPage, which is fine for my purposes. I keep it there and also simply copy that file to a DVD in case -- I should say for when -- my PC hard drive dies someday when I least expect it. It is then possible at any time to open that file in FrontPage and republish it as a whole or examine/edit any page/sub-page within the Website. I don't do that, because I use such backups for archival snapshots of my Website reflecting major changes in content or design. But it would be possible to do so, something that anyone concerned about maintaining the integrity of archival versions of a Website would want seriously to consider.

 

For institutional purposes, I would recommend an open source product such as Heretrix http://archive-crawler.sourceforge.net/ used by the Internet Archive, rather than a proprietary software product such as FrontPage, Acrobat, etc. Alternatively, a good enterprise content management (ECM) system that is 5015 certified would also work and provide at least the minimum recordkeeping functionality, if not more.

 

Regards,
 
Rick

Rick Barry
www.mybestdocs.com
Cofounder, Open Reader Consortium
www.openreader.org

A posting from the Archives & Archivists LISTSERV List sponsored by the Society of American Archivists, www.archivists.org. For the terms of participation, please refer to http://www.archivists.org/listservs/arch_listserv_terms.asp.

To subscribe or unsubscribe, send e-mail to [log in to unmask] In body of message: SUB ARCHIVES firstname lastname *or*: UNSUB ARCHIVES To post a message, send e-mail to [log in to unmask]

Or to do *anything* (and enjoy doing it!), use the web interface at http://listserv.muohio.edu/archives/archives.html

Problems? Send e-mail to Robert F Schmidt <[log in to unmask]> A posting from the Archives & Archivists LISTSERV List sponsored by the Society of American Archivists, www.archivists.org. For the terms of participation, please refer to http://www.archivists.org/listservs/arch_listserv_terms.asp.

To subscribe or unsubscribe, send e-mail to [log in to unmask] In body of message: SUB ARCHIVES firstname lastname *or*: UNSUB ARCHIVES To post a message, send e-mail to [log in to unmask]

Or to do *anything* (and enjoy doing it!), use the web interface at http://listserv.muohio.edu/archives/archives.html

Problems? Send e-mail to Robert F Schmidt <[log in to unmask]>