Brian Kelly presented a paper on Approaches To The Preservation Of Web Sites at the Online Information 2002 conference which was held at Olympia, London on 3-5th December 2002. The talk took place in the "Archiving the web: tackling digital preservation" session which was held from 16:00-17:30 on Tuesday 3rd December 2002.
The paper is available from the University of Bath repository.
The author has been involved in studies of the Web sites provided under UK and EU funding programmes. The results of the studies have shown that a number of the Web sites ceased to exist once the funding finished, thereby denying access to project deliverables.Funding bodies with the UK's Higher and Further Education communities are justifiably concerned at the loss of potentially valuable scholarly resources.
The author will present the findings of a study to investigate the digital preservation of Web sites, by reviewing the scale of the UK Web, reporting on a project to mirror a number of Web sites and making recommendations for the preservation of Web sites.
The paper will give an indication of the scale of the problem by measuring the extent of the UK Web space, by making use of a number of publicly-available resources including search engines such as AltaVista and Google, the Netcraft survey of Web servers, data provided by the Internet Archive and OCLC's Web Characterization Project and studies of the "Invisible Web".
The reasons for variations found in estimating the size of the Web will be discussed and the implications on a large-scale Web-preservation strategy explored.
The paper will outline the reasons for Web site preservation and will attempt to answer the question "What are we preserving and why?"
The paper will then review the experiences of a project to mirror a small number of Web sites, using a harvesting approach based on Web mirroring tools.
The limitations of this approach will be described, including both the technical challenges and legal issues. The technical issues include defining the extent of the Web site, attempting to mirror dynamic content and the limitations of a client-side view of Web sites. The legal pitfalls in digital preservation will be summarised.
Alternative approaches to Web site preservation will be described including making use of the Internet Archive's large-scale archives of Web sites.
Advice will be provided on ways in which Web site developers can facilitate mirroring of their Web sites by adopting various best practices, such as URI naming policies, avoiding the disclosure of backend technologies and adopting accessible design principles to ensure that the Web site if available to automated software agents (as well as to people with disabilities).
The paper will conclude by outlining approaches for a comprehensive strategy for the preservation of Web sites.
Brian Kelly is UK Web Focus - an advisory post on Web issues for the UK's Higher and Further Education communities.
Brian is based at UKOLN - a national focus for digital information management, located at the University of Bath.
Brian has been involved in Web development and support activities since 1993, initially as an early adopter of the Web while based at the University of Leeds. He then worked as a network trainer at Netskills, University of Newcastle. He moved to his current post at UKOLN in October 1996.
Brian is a regular columnist on Web issues in the Ariadne e-journal. He has written articles and papers and given presentations at a wide range of events, including the international World Wide Web conferences.
Brian was involved in a study on the digital preservation of Web sites during summer, 2002.