Web Archiving

Introduction

Archiving is a confusing term and can mean the backup of digital resources and/or the long-term preservation of those records. This document talks about the physical archiving of your Web site as the last in a series of steps after selection and appraisal of Web resources has taken place. This will be part of a 'preservation policy'.

Approaches

Before archiving it is important to consider approaches to preserving your Web site:

What to do now: This includes quick-win solutions, actions that can be performed now to get results, or to rescue and protect resources that you have identified as being most at risk. Actions include domain harvesting, remote harvesting, use of the EDRMS, use of the Institutional Repository, and '2.0 harvesting'. These actions may be attractive because they are quick, and some of them can be performed without involving other people or requiring changes in working. However, they may become expensive to sustain if they do not evolve into strategy.
Strategic approaches: This class includes longer-term strategic solutions which take more time to implement, involve some degree of change, and affect more people in the Institution. These include approaches adapted from Lifecycle Management and Records Management and also approaches which involve working with external organisations to do the work (or some of it) for you. The pay-off may be delayed in some cases, but the more these solutions become embedded in the workflow, the more Web-archiving and preservation becomes a matter of course, rather than something which requires reactive responses or constant maintenance, both of which can be resource-hungry methods.

Domain Harvesting

Domain harvesting can be carried out in two ways: 1) Your Institution conducts its own domain harvest, sweeping the entire domain (or domains) using appropriate Web-crawling tools. 2) Your Institution works in partnership with an external agency to do domain harvesting on its behalf. Domain harvesting is only ever a partial solution to the preservation of Web content. Firstly, there are limitations to the systems which currently exist. You may gather too much, including pages and content that you don't need to preserve. Conversely, you may miss out things which ought to be collected such as: hidden links, secure and encrypted pages, external domains, database-driven content, and databases. Secondly, simply harvesting the material and storing a copy of it may not address all the issues associated with preservation.

Migration

Migration of resources is a form of preservation. Migration is moving resources from one operating system to another, or from one storage system to another. This may raise questions about emulation and performance. Can the resource be successfully extracted from its old system, and behave in an acceptable way in the new system?

Getting Other People to Do it for You

There are a number of third party Web harvesting services which may have a role to play in harvesting your Web site:

UKWAC: The UK Web-Archiving Consortium [1] has been gathering and curating Web sites since 2004. To date, UKWAC's approach has been very selective, and determined by written selection policies which are in some ways quite narrow, it currently only covers UK HE/FE. However it is now possible to nominate your Institutional Web site for capture with UKWAC.
The Internet Archive: The Internet Archive [2] is unique in that it has been gathering pages from Web sites since 1996. It holds a lot of Web material that cannot be retrieved or found anywhere else. There are a number of issues to consider when using the Internet Archive. To date it lacks any sort of explicit preservation principle or policy and may not have a sustainable business model and so its use cannot guarantee the preservation of your resources. There are also issues with the technical limitations of the Wayback Machine e.g. gaps between capture dates, broken links, database problems, failure to capture some images, no guarantee to capture to a reliable depth or quality. The National Archives use a model where they contract out collection to the Internet Archive, but also maintain the content themselves.
HANZO: Hanzo Archives is a commercial Web-archiving company [3]. They claim to be able to help institutions archive their Web sites and other Web-based resources. They offer a software as a service solution for Web archiving. It's possible for ownership to be shared at multiple levels; for instance, one can depend on a national infrastructure or service to do the actual preserving, but still place responsibility on the creator or the institution to make use of that national service.

References

UKWAC, <http://www.webarchive.org.uk/>
The Internet Archive, <http://www.archive.org/>
HANZO, <http://www.hanzoarchives.com/>

Related Links

Licence For Reuse Of This Document