A JISC/BRITISH LIBRARY Workshop as part of the Electronic Libraries Programme (eLib)
Organised by UKOLN
27th and 28th November 1995 at the University of Warwick
British Library R&D Report 6238
This report may be reproduced by photocopying for consultation by
interested parties inside and outside higher education institutions.
© The British Library Board 1996
© Joint Information Systems Committee of the Higher Education Funding
Bodies 1996
The opinions expressed in this report are those of the contributors and
not necessarily those of the British Library or JISC.
RDD/C/185
British Library R&D Reports are published by the British Library
Research and Development Department and may be purchased as photocopies or
microfiche from:
the British Thesis Service,
British Library Document Supply Centre,
Boston Spa,
Wetherby,
West Yorkshire,
LS23 7BQ.
This report of the workshop was prepared by The Marc Fresko Consultancy
Telephone 0181 645 0080 E-mail marc@easynet.co.uk
INTRODUCTION
PRESENTATIONS
SYNDICATE DISCUSSIONS
ANNEXES
This workshop was sponsored by JISC and the British Library, and organised by UKOLN . JISC is sponsoring many projects related to electronic libraries. Some of these projects are already creating digital resources, such as teaching materials resource banks and a range of electronic journals. In the near future, the Arts and Humanities Data Service expects to let contracts which will create data resources; and a call for proposals for World Wide Web pre-print services has recently been announced. JISC is active on behalf of the university community in negotiating licences for a large range of datasets, including electronic journals. The British Library has many reasons to be interested; as a major research library, as the leading source of funding for library-related research and development, and as a legal deposit library. Its declared intention is to develop collections in electronic formats and to deliver an increasing number of services to its users electronically. Like JISC, the British Library Research and Development department is supporting electronic library research in a number of institutions.
The workshop was held shortly after the CPA/RLG Task Force in the US issued a draft report on digital archiving . The contents of this draft acted as a touchstone for the workshop, shaping ideas and prompting discussion (particularly on the applicability of its recommendations in the UK). We are grateful to the Task Force for the work which went into producing this draft.
This level of activity serves to underline the importance of digital information, and the rate at which it is being generated. The effectiveness of our efforts on preservation, including our early steps in this workshop, will have a marked impact on the long term credibility of the use of digital media for academic purposes.
The workshop was organised in the UK, with highly-valued contributions from the USA. The discussion centred on UK needs and structures, but we recognise that activity in this area increasingly takes place in an international context.
The aims of the workshop were to:
We believe that these aims were met. Attendees listened attentively to several original and well-informed presentations from experienced practitioners. These presentations are reproduced in the body of this report. Resulting from these presentations, and from the lively discussions which followed them, eighteen action points were identified (see "Action Points" below). These points will form a realistic framework for us to structure our early activities to promote understanding and knowledge of this important field.
However, the Action Points presented below are but a beginning. Much of the discussion thus far has inevitably been very generic, too generic in some cases for specific conclusions to be drawn. The quality of our debates will be improved by increasing specificity, by developing and using some stratification and models of the realm of preservation; and the value of "learning by doing" must equally be recognised.
Eighteen potential action points emerged during the workshop. All were discussed at some stage, and received a measure of agreement; however, it is, naturally, too soon for all the points to have unanimous or unqualified support. Accordingly, they are presented here in a "raw" format, before full development. The British Library and JISC expect to take these points as a basis for initiating further activity, once they have been appropriately refined.
This report is an account of the presentations at the Long Term Preservation of Electronic Materials Workshop held as part of the Electronic Libraries Programme (eLib). Most of the accounts were assembled by the author using notes taken during the workshop and copies of speakers' notes, slides and handouts; they are not formal papers submitted by the presenters. Two exceptions are noted in the body of the report.
I am grateful to all participants and contributors for their co-operation in supplying materials. In all cases, significant questions and answers discussed after the formal presentations have been integrated into the accounts of the presentations. All credit for the information in this report belongs to the speakers; all blame for errors or inaccuracies remains mine.
Marc Fresko
marc@easynet.co.uk
Associate University Librarian for Technical and Networked Information Services, Rutgers University Libraries
This section is an edited version of a paper supplied by Peter Graham. Preliminary forms of this material were presented at the ALCTS Institute: The Electronic Library (October, 1993 and 1994) and at a task force meeting of the Coalition for Networked Information (November, 1993). In a different form this paper was published as "Requirements for the digital research library" in College and Research Libraries (July, 1995).
ABSTRACT
This paper is about what it means to be a research library in the
electronic age. It draws on the traditional definition of libraries,
particularly research libraries. It goes on to describe several components
of libraries in the electronic environment, covering on the way
preservation challenges which are new to electronic resources. The
emphasis throughout is on the need for custodians of a digital research
library to provide for the preservation of information at every step, in
contrast to the print environment where these issues may be put off for
some time. The paper concludes by describing the commitments needed to
effect such a library, some of which are wholly new.
INTRODUCTION: DIGITAL RESEARCH LIBRARIES
The topic of this paper is the long-term preservation of electronic
information (note: not electronic materials). Both archives and research
libraries have this as their concern. It should be noted in passing that
our concern is not merely technological, but social. Paul de Man is said
to have remarked "Technology burns history, leaving no material
residue." Most of the following addresses the research library rôle
in preserving our culture, but will have some relevance to the archival
community as well.
What in fact is a digital research library? The answer merges the histories, capabilities and missions of research librarianship and of computing science to produce a new service meeting long term needs.
The mission of research libraries is to acquire information, organise it, make it available and preserve it. This has been their significant distinctive and successful rôle with print and other artefactual materials for the past several hundred years. An implicit mission of computing science has been to make the benefits of computing technology of use to society at large. These missions, needs and capabilities must now come together to assure the continuity of scholarship. It will take conscious, planned efforts within both librarianship and computing to make this happen.
The primary requirement for a digital research library is that from the start it be committed to organising, storing and providing electronic information for periods of time longer than human lives. A library is not simply a network full of databases nor is it simply a building full of books. A digital research library is a collection of electronic information organised for the long term.
Many libraries of all kinds around the world now provide an increasing volume of scholarly information to their clients in the form of current information needs. However, research libraries have only begun to take on the provision, organisation and preservation of information for the long term, that is with the same long-term commitment they made for print materials.
THE NECESSITY FOR A SOLUTION
Until long term commitments are made, many currently proposed
solutions will have only temporary effects. For example, discussion of
cataloguing network resources will remain tentative, for until resources
being catalogued have a permanent network presence (whether at fixed or
virtual locations), the cataloguing that points to them must also have an
ephemeral quality. Similarly the expensive products of recent valuable
digitising demonstration projects will be at risk after only a few years
if tools and commitments are not in place for the preservation of what has
been achieved.
More important, the willingness of the scholarly community to give serious weight to electronic information depends upon scholarly trust in such information being dependably available, with its authenticity and integrity maintained. Such change is bound up with the future of electronic journals in the academic tenure process. The ability of the academy to count on long term, secure existence of electronic scholarly work will be an important determinant of the prestige and success of academic electronic publishing. Thus both libraries and universities have a stake in helping electronic publishing to succeed, and therefore have an interest in establishing secure digital research libraries.
Reader needs will continue to be what they long have been. Readers will want information to be reliably locatable, so that when they go there (whether personally or on the net) they can expect to find what they are looking for. Readers will want information easily accessible: the cataloguing must be clear and accurate, and the information must be promptly retrievable. In the electronic environment the needs for access tools will be more evident, and readers will expect appropriate and standard software to be readily available. Readers will expect information that was placed in the library's care a long time ago to be available; and they will expect that the integrity of the information they get from the library to be assured.
Implementation of a digital research library will require two major specific components and three kinds of new commitments. The two major components are the electronic repository, and the access tools and policies. The three major commitments are organisational, fiscal and institutional. In the following, the two components are discussed at more length, yet as technical problems they are probably the easiest to solve; they will only cost money. The institutional commitments will be much more difficult to achieve.
In spite of the greater space it will take, the description of the components will be in a cursory form. Each could be developed in great detail, but at the moment the outline and overall programme are most important. Early implementations will test many of these assumptions and will add more requirements to the list. Work needs to begin.
COMPONENTS OF A DIGITAL RESEARCH LIBRARY
A digital research library will be manifest to its users as
collections of information existing in various places and accessible
through the use of widely available tools. A locus of information may be
called the electronic storage repository.
Over time, we will learn how collection development plays out in an access environment as well as in an ownership environment. It is sometimes loosely proposed (seldom by librarians) that libraries need not acquire electronic information, for it will be available somewhere on the network. Such proposals ignore the obvious truth that some institution must still, in the end, take responsibility for information, and that this has always been a definition of the library responsibility.
There will be many electronic storage repositories, responding both to requirements of redundancy and to the individual needs of institutions. In contrast to print collections, it is unlikely that there will be a high degree of content duplication across large electronic repositories, since for most purposes existence in a single place allows world-wide access. Aside from their actual contents, however, repositories that are part of a digital research library will have many common characteristics. Some of these are described below.
Megadocument Contents
Even an initial repository will comprise many gigabytes of information,
growing quickly to millions of electronic documents. Disk storage is cheap
and the possible resources are plentiful.
Sources and Potential Participants
It is easy to cite numbers of electronic scholarly resources that now
exist. A few are noted here only as examples:
These are only examples. Nothing, of course, should be selected automatically; collection development policies should be adapted and followed. The continuing substantial costs of providing electronic information will require that electronic collection decisions be made even as carefully and parsimoniously as for print.
Backup Mechanisms
Backup/restore procedures must be in place. They must be automated and
economical, for libraries are never likely to have expensive labour
available in quantity. Backups must be multi-generational, using remote
storage, with regular disaster simulations and tests.
Staged Access
"Staging" refers to the prioritised use of different
mechanical methods of storing data as it waits to be recalled. All data
does not need to be immediately available on the fastest and most
expensive storage media. Alternatives for providing immediate online
access to the enormous potential volume of scholarly information need to
be provided. What can be off-line, and how can it be retrieved?
Data Structure Standards
In a repository, does information simply exist as is (as first
created) or is complementary information associated with it? Widely
differing possibilities include SGML (Standard Generalised Markup
Language) headers, ICPSR codebooks, picture captions, hypertext links and
early software versions for use with data files. OCLC and NCSA jointly
held a recent conference at which a number of core elements of metadata
have been defined (the "Dublin core").
Redundancy
It will be important to establish standards for the number of
repository locations necessary to assure long-term existence of electronic
information and access to it. The issues concern longevity of information
but also the dynamic interplay between costs of network bandwidth and
response time, and costs of storage. Geographic location, nationalism and
regionalism will likely play a rôle (at least intercontinentally,
and probably intracontinentally). Major institutions may separately or
consortially establish repositories. It is not yet clear how much
redundancy of their components will be desirable among them. In addition,
it seems likely that many library consortia will be formed on the basis of
joint contracts with information vendors, also leading to information
redundancy. One location won't do for a major electronic document or set;
will two, or three? How many?
Preservation modes
The electronic repository must be preserved. Preservation of
information needs to be looked at from at least three points of view:
medium preservation, technology preservation and intellectual
preservation. The problem, and what is new about preservation in the
electronic environment, is that electronic information must now be dealt
with separately from its medium. This can be illustrated by an analogy,
one which is very oversimplified, as readers will be aware: if a book is
placed on a closet shelf, and the closet door is closed for 500 years,
then at the end of that time one can, broadly speaking, open that door and
read the book. With an electronic resource one does not have that
confidence after ten years, and for several reasons.
Medium Preservation
The artefact or medium can decay. Medium preservation is the
concern for preserving the medium on which information is stored, such as
tapes, disks, optical disks, CD-ROMs and the like. Backup is appropriate,
as is copying to other devices of the same kind, a technique which we know
of as "refreshing"; we speak of refreshing a tape by copying its
contents to another similar tape. In the current climate of protection of
intellectual property rights, copyright concerns must be recognised
(recent proposed USA legislation has many flaws, but it does recognise
this need).
Technology Preservation
More problematic than medium decay are the rapid changes in the means
of recording, in the storage formats and in the software that allows
electronic information to be of use. We need to be aware of technology
obsolescence as even more of a problem than medium decay, and undertake
steps of technology preservation. Rather than simply refreshing,
we also need to speak of migration: of migrating information forward
through technology stages as they become available and as the old
technologies cease being supported by vendors and the user community.
Intellectual Preservation
There remains a third preservation requirement, intellectual
preservation, which addresses the integrity and authenticity of the
information as originally recorded. Preservation of the media and of the
software technologies will serve only part of the need if the information
content has been corrupted from its original form, whether by accident or
design. The need for intellectual preservation arises because the great
asset of digital information is also its great liability: the ease with
which an identical copy can be made, quickly and flawlessly, is paralleled
by the ease with which an undetectable change may be made.
Here are some of the intellectual preservation questions that arise for a researcher using electronic information: How can I be sure that what I am viewing is what I want to see? How do I know that the document I have found is the same one that you read and made reference to in your footnote? How can I be sure that the document I now read has not been changed since the last time I read it? Note that in this instance backup is not the issue; rather, it is how we know which version we have or don't have.
There are at least three kinds of possible changes:
1. Accidental change: for example, data loss during transfer, accidents
during updating, saving the wrong version.
2. Intended change (well meaning):
a. New versions or drafts (authorial texts, legislative bills);
b. Structural changes: updating Books In Print or a telephone directory;
c. Interactive documents, e.g. hypertexts with note-taking capabilities.
3. Intended change (fraud): e.g. of one's own work to cover one's tracks or change evidence; or of another's work. Possible examples: political papers, laboratory notebooks, historical rewriting, legal documents, contracts.
Whatever technique is used must provide generality, flexibility, ease of use, privacy protection where desired, openness of documents where desired, low cost - and functionality over long periods of time on the human scale.
So far, this paper has dealt with the electronic repository component of the digital research library. The second major component comprises the Access Tools and Policies.
Usage and Retrieval Mechanisms.
The digital research library must of course support the full panoply
of present access tools (for example online catalogues and OPACs, FTP,
gopher, and certainly the World Wide Web and its multiple browser
clients). The digital research library must also prepare to support the
new access tools that are likely to appear regularly, in particular the
implementations of Uniform Resource Names (URNs) and Uniform Resource
Characteristics (URCs).
The granularity of documents needs to be addressed: how may one retrieve only part of a document when the full document may be of substantial size (for example the full text of Moby-Dick or of a legal code; or a presentation of many images from which one is desired). Must documents be pre-coded (or pre-marked) to allow such granular access, or can access-time mechanisms be made available?
We need to be able to provide documents which change rapidly, for example ANSI standards, monthly statistical reports and draft document versions. Therefore we will need to develop techniques for dynamic documents and consequent archiving and labelling, as well as flags indicating obsolescence or suppression (or conversely indicating status as an authorised version). A form of SGML may be appropriate in some cases, for example the format proposed by the TEI (Text Encoding Initiative). The URNs and URCs referred to above are likely to be part of the solution to this problem.
Cataloguing
Providing access to voluminous information is an intellectual problem
that historically has been solved in the print environment by abstracting
and indexing services and by library cataloguing. We have developed
extensive rules and procedures to ensure consistency and accuracy. These
tools, adapted to suit new needs, will work for electronic information as
well. They should be linked to the new retrieval mechanisms so that users
can smoothly navigate from location of information to retrieval of it
without having to shift their mode of use. Early mechanisms will probably
link catalogue records to documents using tools such as the WWW, the
Uniform Resource Locator, and the MARC 856 field. SGML may offer other
possibilities for linking of certain documents through its document
description techniques. In any case, there eventually will need to be
consensus both on the display of physical electronic locations in
bibliographic records and on representation of virtual locations.
If the digital research library's catalogue system works well, users will be able to search for information, locate bibliographic records for desiderata, and use those records directly to draw the desired information to their workstation. Where an authentication technique is available, we must provide means for including and testing the certification. Standards for such cataloguing and remote access still need to be developed, particularly for providing catalogue access to non-owned materials.
Remote Access
A digital research library should from the outset be intended for access
from multiple remote locations. Presumably the bibliographic utilities,
such as OCLC and RLIN, will play their accustomed rôle.
Internet-wide access should generally be possible to a digital research
library. In initial pilot implementation it may be advisable for a few
libraries to plan and develop a closed set of catalogue and access
mechanisms to their own individual libraries' electronic collections. In
the process they will create catalogue records that allow access to
electronic information of their own. We will need procedures for
dissemination of such catalogue records. It is not only a technical
matter, but a policy matter, for such catalogue records will then provide
access to local holdings for non-local readers. There are compensation,
capacity and intellectual property considerations here.
Fees and freedom
In practice these are often linked issues. Standards and techniques
will be necessary to solve a knot of interconnected problems surrounding
access and ownership, including:
COMMITMENTS
Much of what has been described so far is merely technical. The
outlines of solutions are becoming clear even if the details remain to be
worked out (and the non-trivial matters of cost have admittedly not been
dealt with). More difficult will be the social compacts, that is the
agreements on standards, on intellectual property and on access modes.
Most difficult of all to achieve, if electronic preservation and access are to be accomplished on any significant scale, will be the long term commitments to these goals by institutions. Nothing makes clearer that a library is an organisation, rather than a building or a collection, than the requirement for institutional commitment so that electronic information can have more than a fleeting existence. In this context, we should welcome the recent statement by the CPA/RLG Task Force on Archiving of Digital Information in their draft report that "the key that unlocks the path to the digital environment is not technological but organisational".
Three kinds of commitment will be necessary:
Commitment to organisational change
The organisation of libraries is already changing as electronic
information increasingly becomes part of their charge. Most research
libraries now have substantial systems departments. Some libraries locate
the responsibility for electronic information distinctly from that for
print. Other libraries see the forms as inseparable and include electronic
responsibilities along with artefactual responsibilities in assignments
for collection development, cataloguing and public service.
What is new will be the permanent assignment of staff responsibility for the long term maintenance of electronic information within a library. There is no obvious artefactual parallel for this responsibility: the various departments of circulation, stack maintenance, preservation and physical plant now share it for print. Nor are there present parallels in academic computing centres, where staffs typically focus on technological advance and availability, leaving data to the users. The electronic preservation responsibility will be focused as it will require technical expertise likely to be located in a single functional area.
It is by no means clear that this functional area will be what we used to call the library's systems department. As libraries move more into the electronic environment the traditional tripartite division of libraries into public services, technical services and collection development will continue but in more fluid arrangements. People who combine bibliographic understanding, problem-solving abilities and process orientation have often been found in technical services but also elsewhere in libraries. Such librarians will take on the demanding new technical, collection and service responsibilities for long term support of digital collections. It is also becoming clear that the traditional computing community is fertile with ideas, analysis and skills that will be important to electronic library goals, for example in the work of the Internet Engineering Task Force work groups on the Uniform resource Identifiers and in the work of Jerome Saltzer of MIT.
Fiscal Commitment
The permanent existence of a digital research library will require
assured continuity in operational funding. Almost any other library
activity can survive a funding hiatus of a year or more. Acquisitions,
building maintenance, and preservation can be suspended, or an entire
staff can be dispersed and a library shut down for several years, and the
artefactual collections will more or less survive. But digital
collections, like the online catalogue, require continual maintenance if
they are to survive more than a very brief interruption of power,
environmental control. backup, technological advance and related technical
care.
Our online catalogues are cheap compared to what the digital research libraries will cost. The hardware and software maintenance costs for online catalogues have reached a rough steady state, and the capital costs for new OPACs are decreasing relative to the capabilities provided. The catalogue size will continue to increase, but catalogue records are small relative to the information to which they refer. Digital research libraries, however, as a proportion of the library's supply of information, will grow for the foreseeable future, and the quantity of information requiring care will become considerable (and much larger than the catalogue). Unit costs of storage are likely to continue falling for some time, which will make the financial burden manageable. And, in the US at least, staffing costs are not expected to increase, simply because overall staff growth in most libraries is likely to be consciously retrained for the foreseeable future; reassignments, however, are likely.
Long term funding will be required to assure long term care. Libraries and their parent institutions will need to develop new fiscal tools and use familiar fiscal tools for new purposes. Public institutions, usually constrained to annual funding, will have particular difficulties; existing procedures for capital or plant funding may provide precedents. One familiar technique is the endowment. It has been difficult to obtain private funding for endowments and services rather than books and mortar, but it is possible (it will not be hard in the current environment to create "virtual bronze plaques" that can be guaranteed to be around for some years). Institutions might also build endowments out of operating funds over periods of time. In the UK at least there appears to be some evidence that the government may actually want to listen to needs. In the US there is almost none.
Some revenue streams associated with digital research libraries may be practical. Consortial arrangements may allow for lease or purchase of shares in a digital research library. Shorter term access might be provided to other institutions on a usage basis. Access could be sold to certain classes of users, for example businesses, non-local clienteles, or specific information projects. New relations with publishers, presently difficult to perceive through the mists rising from intellectual property, might result in fee income for storage of electronically published materials for the copyright lifetime, during which publishers collect usage fees. With commitment and imagination in the long term, fiscal tools will be found.
Institutional Commitment
All these are instrumental means of accomplishing the greatest
requirement, that of conscious, planned institutional commitment to
preserve that part of human culture which will flower in electronic form.
The advent of electronic information will by no means create museums of
the book out of libraries, as so often alleged. Instead, it will force the
realisation of the distinct rôle of research libraries in preserving
information rather than simply artefacts. Our traditional museums have
always preserved artefacts (often beautiful) that embody information,
while libraries have preserved information that has been embedded in
artefacts which are only occasionally of aesthetic interest in themselves.
The advent of electronic information will accentuate the difference
between these rôles, as libraries and their parent institutions
take on the responsibility for preservation of information in non-
artefactual form.
For the past century most research libraries have been associated with universities, and this connection seems likely to continue in the immediate future. But whatever the institutional parent of a major library, an institution wishing to benefit from electronic information will have to make a constitutional commitment to providing resources. Michael Buckland, of the University of California at Berkeley, has distinguished between a library's rôle and its mission. Buckland suggests that although the rôle of a library is to facilitate access to information, its mission is to support the mission of its parent institution. The implication is that if a university or a state wishes to continue relying on support from its library, it will have to make commitments to support the library's rôle. In the electronic environment, this means new and long-standing financial commitments which the library and its governors together must identify and establish.
The commitment will have to be clearly and publicly made if scholars and other libraries are to have confidence that a given digital research library is indeed likely to exist for the long term. Guidelines or standards will be desirable that define what is meant by a long term commitment. Such standards will define which electronic repositories of data can qualify to be termed part of a digital research library. Just as donors of books, manuscripts and archives now look for demonstration of long term care and commitment, so too will scholars and publishers as they create electronic information and require for it a home.
CONCLUSION
Establishing a digital research library continues the research library
rôle. For a major library or a university to do so should be
considered as natural as acquiring the next book or cataloguing the next
journal. Not to do so will be an abdication of that responsibility. The
skills and understandings of both the library and computing communities
will be essential in carrying out this goal of preserving the human record
in the electronic environment.
The tasks call not so much on new knowledge nor on new technologies, but upon our collective informed commitment; that is, upon will. It is clear that the new knowledge being created in electronic form will not survive by itself if nothing is done. Specific action is necessary. It is extremely gratifying for an American to see a conference convened such as the present event, apparently with some level of support from the academic community and the government, planning for long-term preservation of electronic information. As so often in human history, it is concerted human action which answers the pessimists; technology need not burn history, for we as librarians shall see to it that there persists an informing residue.
REFERENCES
Following is a short list of references which may be useful.
PROF. DENISE LIEVESLEY
Director, The Data Archive, University of Essex
ABSTRACT
This paper presents a number of factors and changes which are shaping
data archives, and the challenges facing them. It goes on to discuss The
Data Archive as an instance of an established archive, and to explore
important issues of policy and strategy.
THE IMPORTANCE OF DATA ARCHIVES
Data archives are playing, and will continue to play, a growing rôle
in our electronic and cultural institutions. Their importance is
increasing in parallel with the explosive growth in the volume of
electronic data being created. At the same time, the cost of collecting
data is increasing, and perceptions are changing: the value of using and
re- using data is becoming accepted, as is the importance of making
research "transparent". Combining this with the specialist
skills and equipment needed to preserve data makes a strong case for the
importance of data archives.
The very nature of data makes it unique; it is inexhaustible, non-renewable and non- substitutable. That is to say, data cannot be exhausted no matter how much it is used (by contrast with information held in almost any non-digital form); it frequently can only be captured at a specific time, and once that chance has passed the potential data vanished forever; and nothing other than data can serve the function of data.
A TIME OF GREAT CHANGE
In common with other functions in the academic sector, archiving is
seeing many changes:
Technological Change
Usage of the Internet continues to rise, bringing issues (among
others) of distributed computing, requirements for a wider range of
software, and new media.
Organisational Change
The university sector is growing, and is increasingly obtaining
external funding for academic research. The population of users is
becoming more diverse, and a range of "data brokers" are
appearing in line with their increasing demands for data. And
correspondingly, data providers are increasingly interested in giving
access to their own data.
Cultural Change
New generations of researchers have greater expectations for access to
digital data. The data itself increasingly is being gathered in the
private sector, as data collection agencies are privatised. Privatisation
and the trend towards "open government" both bring the need for
fair competition.
Internationalisation
Simultaneously, users, data and archives increasingly have
perspectives which are not limited to national horizons. This has
complicated the environment for data sharing, since legislation on access,
data protection, confidentiality and copyright varies from state to state;
European legislation is also becoming relevant. The internationalisation
of data has additionally created a new culture among some data users: for
example, it has fostered the ethos of data being a public good.
THE CHALLENGE
The challenge facing data archives is how to anticipate the new whilst
continuing to provide a service. This challenge is giving rise to a great
deal of debate, centring on questions such as:
Various new services are being considered in response to the challenge; these in turn raise issues, including:
THE DATA ARCHIVE AT THE UNIVERSITY OF ESSEX
An instance of a significant archive is the Data Archive at the
University of Essex. Funded by the ESRC, JISC and the University of Essex,
it exists in order to promote wider and more informed use of data in
teaching and research, and to preserve the data so that it remains
accessible over time. Its functions are:
THE COMMUNITIES SERVED
Archives serve a number of communities. These can be divided broadly
into two groups: data producers (including data owners and funders as well
as those who actually produce and deposit the data) and data users (eg
teachers and researchers).
Producers
It is vital to obtain and retain the support of data producers. In
summary, we can do this by offering numerous benefits to producers,
namely:
Note that the Archive (and the ESRC) will normally support effective information services rather than dataset acquisition. It will only support acquisition if there is a high demand, a cost effective mechanism for maintenance and support and perhaps a key leverage rôle for its support.
Users
The benefits to users are more clearly cut. They are:
This range of benefits should raise a question in our minds: what are users' priorities? Indeed, do we know enough about these priorities? What are their requirements for acquisition, media and formats, speed of service, timeliness, quality, and charges?
ACQUISITIONS POLICY
A selection policy is essential. It is simply not practical to accept
all potential data without some selection - there is simply too much data
available. The policy must be based on users' likely needs and demands,
and also:
Of course, an integral part of an acquisitions policy is a list of rejection criteria. These should include:
Quality
Quality of data is difficult to pin down. Quality can only be measured
in terms of fitness for purposes, and different purposes have different
needs. Equally, the method of informing users about quality needs to be
determined. A related issue is achieving a balance between measuring or
improving data, and getting data out faster; experience shows that
different European archives are reaching different conclusions.
Finally, we have to remember to consider the legal liability we may have for distributing defective data. At present, the position is not entirely clear.
THE IMPORTANCE OF DOCUMENTATION
Documentation is best produced by depositors, but they need
persuasion, help and guidance to encourage them to produce and deposit it.
The format(s) in which documentation is kept need to be considered;
appropriate format and media are important for ease of use and of
delivery.
The preservation of documentation is also critical, as data without its documentation is greatly reduced in worth if not entirely worthless. Consequently, documentation becomes a resource in its own right, and thus its preservation becomes a critical requirement.
Documentation does not stop at objective description of the data; it can also include contextual data. For example, the value of a data set on a family income survey can be greatly enhanced by the availability of documentation on the income tax structure in force at the time of the survey. Deposit of contextual documentation should be encouraged, though such encouragement may not be easy.
DATA PRESERVATION
The Aims of Preservation
The preservation must include the following:
Preservation Challenges
The key challenges facing data preservation are mainly to do with
growth and change. They are:
Preservation Strategy
A preservation strategy must take these factors and challenges into
account. The strategy needs to incorporate a data management strategy,
which may include the relationship between data preservation and
distribution (eg the data being preserved may not be the same as that
being distributed). The strategy must also recognise the wide range of
access rates which different data sets might experience.
Factors other than data and preservation must be taken into account. The opportunities for staff training and development are key to the success of an organisation, for example. And flexibility is definitely required in a archive; "Change is usually stressful and often unwelcome - but less so if provision for it has been built into the culture and ethos of the entire organisation" (Osborne and Gaebler, "Re- inventing Government"). Finally, nurturing a sense of "service orientation" is critical.
CONCLUSION
Partnerships are essential for the success of data archives. These
partnerships can take a number of forms; they can, for example, allow the
sharing of expertise through existing archives; involve the use of third
party distributors (eg the Manchester centre); exploit the expert
knowledge of data sources and needs; and result in the building of new
facilities (eg R-Cade).
DANIEL GREENSTEIN
Director, Arts & Humanities Data Service Executive
ABSTRACT
This paper is founded on the recognition that structures and policies
of the Arts & Humanities Data Service will gradually develop. It sets
out the structure which is envisaged for the AHDS, and some of the key
functions which it will perform. Finally, a possible stratified collection
policy is outlined.
BACKGROUND
The Arts and Humanities Data Service is a new national service funded
by the Joint Information Systems Committee (JISC) of the Higher Education
Funding Councils. The AHDS's mission is to co-ordinate access to, and
facilitate the creation and use of, electronic resources in the arts and
humanities by offering a range of services to higher education
institutions and their members. This paper is presented by the
Executive-designate of the AHDS. The appointment and the establishment of
the AHDS are both recent, and it follows that the ideas below are at a
relatively early stage in their development.
STRUCTURE OF THE AHDS
Over the next three years, the Executive expects to develop a
structure of three organisational "branches". Each branch will
include one distinct type of service provider. The three types of service
provider will be:
The current level of funding only permits the establishment of data supply services. User support and network service providers will be established only with additional funding (not necessarily from the JISC) or through association with existing service (not necessarily funded by the JISC) and after feasibility studies are conducted in these two relatively under-evaluated areas (studies are currently being commissioned by the executive for completion by October 1996).
Data Supply
These services will focus on particular types of data irrespective of
their origins and use within the academic community. They will focus on
collecting, describing/ documenting, cataloguing, and preserving
electronic information in their respective domains. They will promote
relevant standards for data creation, description, and preservation, and
will develop guides to good practice for would-be data creators. They will
negotiate access to similar data whether commercially produced or stored
at other public or semi-public sites.
Five services are envisaged. They will focus on:
Data Integration and Catalogue Access
At least one service provider will be established to implement on a
system-wide basis network access mechanisms to AHDS data and metadata
(catalogues, training materials, etc.) wherever these are stored. It will
integrate the facilities and data stored by disparate services (of the
data supply and user support types) in a distributed system which will
provide a genuinely seamless user environment.
Other Organisational Structures
Essential collaboration between service providers of each type will be
assured through a Service Providers' Forum in which managers from each
service provider will participate.
RATIONALE FOR THE ORGANISATIONAL MODEL
The model seeks efficiencies in two ways.
It concentrates expertise which is necessary to collect, catalogue, document, and preserve different types of electronic information (viz. databases, texts, image, time- based, and GIS/site-mapping data). An alternative model which would loosely tie data supply services to academic disciplines would introduce inefficiencies. For example, three data services - one supplying historians, one supplying archaeologists and art historians, and one supplying literary and linguistic scholars would all have to develop the expertise necessary to create, catalogue, preserve, and describe database materials which emerge in substantial number from each of the disciplines. Concentrating that methodological expertise in one supplier makes economic sense.
The model also conforms to the logic of humanities research and teaching. For example, historians as end users will not be interested in databases exclusively. They will require access to material supplied by the textual, image, time-based, and site- mapping data services. Philosophers, art historians, literary scholars, etc. will have similarly wide ranging data needs. According to this model, subject-based user support services will provide their respective communities with relevant training and other materials for data known to the AHDS irrespective of how, or by which service supplier, they are held. The end user will therefore have greatest contact with his/her user support service which will travel freely across the range of different data suppliers. The end user need not know that there are several underlying data suppliers. An alternative model would staff each data-supplying service with personnel sufficient to support the same whole range of humanities disciplines. This model, however, would introduce substantial overlaps and inefficiencies.
COLLECTION POLICY FOR ELECTRONIC TEXTS
A policy is essential. It will emerge progressively and
collaboratively as the AHDS develops its ideas together with both
providers and users. It is to be hoped that the policy will evolve towards
a stratified and prioritised approach. One possible policy approach for
texts can be summarised as building a series of collections, defined
roughly and tentatively as:
Of course, there may be other holdings in addition to these collections. The key to success will be to recognise the value of different digital resources and to prioritise them by means of extensive negotiations.
ABSTRACT
This paper introduces the current situation regarding legal deposit,
and explores The British Library's proposals for extensions to encompass
electronic and other non-print publications. The scope of proposed
legislation, access mechanisms and possible repositories are discussed.
LEGAL DEPOSIT
The requirement in the UK for legal deposit stems from a 1911 Act of
Parliament. This Act specifies that The British Library, along with the
other specified deposit libraries, is entitled to copies of publications.
The Act applies only to printed publications; though its intent, as
generally understood, is to ensure the preservation and availability of
publications generally.
Of course, publications are now being produced on many media other than paper. The limitation to paper will, if continued, mean that the proportion of publications being deposited will decrease; and potentially some particularly important reference works might not be deposited where they are published solely on (say) CD-ROM. This clearly would cause a problem for those interested in preserving our intellectual heritage.
PROPOSED LEGISLATION
The British Library's approach to this problem is to press for new
legislation which requires deposit for publications on non-print media.
The legislation should cover both current media and media yet to be
devised.
The British Library is recommending that new primary legislation should not be limited to any medium or media. Rather, the Statute should provide for subsidiary legislation to allow for the coverage of future new media.
A proposal is currently being drafted, with the intention of submitting it for consideration by the current Parliament.
WHAT IS A "PUBLICATION"?
With the huge variety of electronic formats, media and business
models, the meaning of the term "published" is no longer as
obvious as with printed media. Therefore we need to agree on a definition
for this term in the context of a requirement for legal deposit.
The British Library suggests as a working definition that any hand-held item which is offered for sale should be considered as "published". Further, sale is not a necessary element of publication, so some free items would also be considered as "published". Obviously there is scope for some grey areas with this kind of definition, and so The Library proposes the establishment of an independent public body to decide whether specific borderline items should or should not be subject to legal deposit. This body would, we assume, be constituted of unpaid citizens, some of which might be drawn from professional institutions, The Royal Society, The British Academy, etc.
REPOSITORIES
The proposed legislation recognises that no single institution need
cope with all possible deposit media. However, it seems reasonable to
think that the existing deposit libraries may form a basis for the new
deposit scheme. Each deposit library is currently formulating for the
Secretary of State its views on how responsibilities might be divided.
Legal deposit repositories will have to be extremely selective in their choice of electronic publications to accept for deposit if the scale of the deposit operation is to remain manageable. We have identified three classes of non-paper publications:
The technological problems associated with the first two can be regarded as largely soluble. However, the technological and economic context associated with the third is in too great a state of flux for legislation to be appropriate at this stage.
ACCESS
As we all recognise, electronic media bring with them considerable
intellectual property issues. For example, one deposited copy of a
valuable electronic publication might be made available to a wide
population by network; this obviously would be a concern to its publisher.
There are also financial considerations. Whereas the marginal cost of depositing one copy of a printed publication is generally negligible, the costs associated with depositing an electronic item may be substantial. There may be costs resulting from the need to provide documentation, and from the removal of copy protection for example. Consequently a repository can be expected to seek a trade-off between immediate access and long term availability.
Currently, our discussions centre on the levels of public access to deposited items. Given the extreme positions of (1) consultation at one terminal only and (2) unlimited and free remote access, The Library's working party has chosen a compromise. The current proposals are considering:
PRESERVATION
We recognise the difficulties of preserving digital documents, and we
recognise also that we have not yet determined how best to achieve
preservation. There are two fundamentally different approaches, namely (1)
continual migration to new media and formats and (2) provision of original
support environments by hardware and software emulation.
One particularly complex area is that of online databases. As examples of the complexity, we can point out that online databases exist in a multitude of formats; they are constantly and rapidly being changed; and they are made available in a number of forms (which may or may not correspond to our understanding of "publishing" as described above). This makes them too challenging to be considered for legal deposit today, though one day all should be taken within its scope. It is already too late to preserve some publications, yet it is too early for legal deposit to be practical. We shall have to wait until the economics of this form of publishing are less turbulent and better understood.
Associate Professor, School of Information and Library Studies, University of Michigan
ABSTRACT
This paper summarises the findings of a Task Force in the USA which
has recently produced a draft report on the Archiving of Digital
Information. The need for a national infrastructure of digital archives is
argued. Critical issues (operating environment, migration strategies,
intellectual property and finances) are examined. The paper ends with a
summary of the draft report's recommendations.
INTRODUCTION
We tend to dwell on the problems of digital preservation; we can
easily overlook some of the unique benefits that digital storage can
bring. By way of introduction, it is interesting to reflect on some of
these benefits, lest we take them too much for granted. As an example, the
author's notebook computer stores ten years worth of work between
600 and 700 large documents. The entire collection was recently migrated
from another computer in about one hour. It is routinely backed up in a
quarter of that time. These are performance levels which we simply cannot
hope to emulate with paper or non-digital formats. Benefits such as this
are inherent in digital formats; they will allow us to perform some
functions more easily than before, and others which we could not
previously perform at all..
DRAFT REPORT OF THE TASK FORCE ON ARCHIVING OF DIGITAL INFORMATION
The author was a member of the task force established by The
Commission on Preservation and Access (CPA) and The Research Libraries
Group (RLG). The task force issued a
draft report
on 25th August 1995 for comment. This draft will be used as a basis for
this presentation.
Information on obtaining and commenting on the draft is given at the end of the paper.
THE TASK FORCE
The task force was charged with the following duties:
THREE KEY CONCEPTS
Three key concepts developed by the task force are used throughout the
report. These are defined below.
Digital Archives
"Repositories of digital information that are responsible for
storing and ensuring, through the exercise of various migration
strategies, the long term accessibility of the nation's social, economic,
cultural and intellectual heritage instantiated in digital form."
Note that this definition distinguishes digital archives from libraries. Whereas libraries have access as a main objective, archives' priorities centre around storage and preservation.
Migration
"A set of organised tasks designed to achieve the periodic transfer
of digital material from one hardware/software configuration to another,
or from one generation of computer technology to a subsequent generation."
The Task Force adopted this definition instead of the concept of "refreshing" which had been used in its original terms of reference, because refreshing was felt to be insufficient in scope.
Digital Preservation
"Retaining the ability to display, retrieve, manipulate and use the
digital information in the face of constantly changing technology."
THE NEED FOR A DEEP INFRASTRUCTURE
The Task Force concluded that a national infrastructure is called for.
This should include a number of recognised repositories. Recognition would
be achieved by certification of an independent authority. A fail-safe
mechanism would be needed, for example to "rescue" data if an
archive closes.
Other mechanisms will be needed, for example to direct data producers who find themselves unable to maintain particular data sets, and for archives to proactively seek out data sets in danger of being "orphaned".
CRITICAL ISSUES
The following critical issues have been identified and are examined
below:
The operating environment will be conditioned by the diversity of attributes which describe it. Archives will have to contend with:
The Task Force takes the view that owners, creators and copyright holders have the initial responsibility for archiving their data sets. This is not to say that they will look after every object which they should care for; they represent only a first line of defence. Pressure should be applied in this area, particularly on publishers.
Certified archives will have a right and the responsibility to exercise aggressive rescues of endangered data sets.
Migration Strategies for Digital Information
Migration strategies will have to be developed. The nature of the these
strategies will depend on the relevant application environments, on the
formats involved, and on the degrees of functionality sought.
Migration will take into account the need to change media, change formats, and in some cases incorporate standards. It is quite possible that this will be achieved by specialist processing centres or bureaux, possibly consortium-owned.
One action which would greatly ease the task of migration would be the incorporation of migration paths into new software.
Intellectual Property
The relevant legislation is the US Copyright Act Section 108, which
protects intellectual property while allowing libraries to make copies of
protected material for preservation purposes. However, this legislation
did not anticipate the need to copy digital documents for the same
purpose; consequently legislative changes will be called for before data
can be migrated with impunity.
Similarly, other preservation-related activities may require the permission of copyright holders, under present rules. These activities are:
Clearly, this would be an onerous responsibility. Requiring owner authorisation for each preservation action could undermine the effectiveness of an archives network.
The Task Force therefore proposes that digital archives would not be required to seek authorisation to create a copy or to store, migrate and manage that copy. Intellectual property owners would retain control over the making of new copies in other circumstances, however.
We also propose that any work which is not protected intellectual property can be accessed, used and disseminated according to the terms of the archive; but for any work which is thus protected, the actions would require agreement of the rights holder.
Costs and Financing
The questions of cost, and relative cost, are complex. A large number of interrelated factors contribute to the cost of digital archiving, namely the costs of:
Cost models therefore must include consideration of the functions which are included. They must also allow for predictions of change in technologies and costs over long time periods. Trends caused by the archives themselves will also affect the costs: for example, once operations are routine and predictable, we can anticipate that unit costs would greatly decrease; and there may be some economies of scale.
The costs of an archive must of course be matched by its revenues. A model will need to allow for potential income from tax and accounting incentives, user fees and subscriptions.
RECOMMENDATIONS TO CPA AND RLG
The Task Force is currently reviewing its draft recommendations, with
a view to strengthening some in response to comments. At this point, the
recommendations include the following:
NEXT STEPS
The next steps will be to reconvene the Task Force, early in 1996. The
final report will then be issued, and we look forward to the
recommendations being implemented, hopefully in 1996.
ADDITIONAL INFORMATION
Copies of the draft report are available from:
WWW at URL: http://ukoln.bath.ac.uk/mirror/archtf/archtf.html
FTP server: ftp://lyra.stanford.edu/
Task Force discussion list: comments can be addressed to ARCHTF-L, the open listserv sponsored by the Task Force. To subscribe to the listserv send the message subscribe archtf-l to listserv@yalevm.cis.yale.edu
Director of Collections and Preservation at the British Library
This section is an edited version of a paper supplied by Mirjam Foot
ABSTRACT
This paper poses two basic questions. First: what is a preservation
policy for digital material? Second: does it differ from a preservation
policy for "conventional" library and archive material and, if
so, in what way? It examines many detailed issues related to these two
fundamental questions, including the influences of other factors
(collection purpose, format, medium etc) on preservation policy.
PRESERVATION Definitions In the UK, preservation as it applies to conventional library materials is usually defined as "all managerial and financial considerations including storage and accommodation provision, staffing levels, policies, techniques and methods involved in preserving library and archive materials and the information contained therein" (NPO Glossary). The Oxford English Dictionary puts it more succinctly as the art of "keeping safe", "keeping alive", "maintaining" and "retaining". In a digital context, we have to look anew at this definition, as an extra dimension has to be taken into account.
According to the OED, policy is "a course or general plan of action". In other words, a preservation policy at its most basic is a plan of action for safe keeping. Such a plan of action should address the questions of what needs to be preserved, why, for what purpose, and for how long.
In order to address these questions we shall have to look at the function and purpose of the collections themselves, and of those of the institutions in which they are kept. For example:
It is not necessary to spell out for this audience that although the answers to the questions of what, why and for how long differ with the aims and purpose of the institution or the collection in question, they are also influenced by the nature of the material itself. If we consider a broadly-based international collection, comprising original sources and secondary material, basic research material and ephemera, we will encounter a wide variety of formats. These can include:
Different formats and different media demand different technical solutions as well as different storage conditions, but the aim and the purpose of a library itself and its functions determine its preservation policy which should cover all formats and all media. This policy then steers the preservation programme, which sets out the order in which collections or items will be preserved and the method by which this should be done.
Relationship Between Purpose and Preservation Needs
If we talk about the purpose of a collection as a determining element in
its preservation needs, we need to look at other library and archive
functions that are closely linked to preservation, such as acquisition,
retention and access. How strong these links are and what their relative
importance is depends on the purpose of the library or archive in
question, as well as on the nature of the material.
Although the aims and purposes of the various kinds of libraries and archives vary enormously they all have some basic objectives in common. All libraries and archives acquire material (or have at one stage in their existence done so), mostly with the aim of making it available at some time or other; and all want to retain some of it for a shorter or longer period of time, some in perpetuity. If we assume that all research libraries want to make their collections available for use now or in the future, they will have to ensure that those collections can be used and are in a fit state to be used. This "fit state" applies both to the information contained in these collections and, in many cases and certainly for conventional material to their actual format, to their physical entity.
DIGITAL MATERIALS
When talking about digital materials, there is an extra dimension that
needs to be preserved, and that is the dimension of access. For
conventional materials the human body provides its own access mechanism.
Moreover this is renewed with each new generation. For digital material
this is not the case. Eyes alone are not much use when faced with any of
the formats or media in which digital data is presented. Unless we have a
separate usable and maintainable access mechanism, we simply cannot use
the acquisition data.
The question of what to preserve is answered in part by the reason why an item is acquired. If it is acquired in order to serve a community of undergraduates for one or at the most two years, there may be reasons for acquiring multiple copies, but little reason for preserving the copies once they have served their very limited purpose. If on the other hand an item is acquired for permanent addition to and retention in a collection, its preservation becomes as important as its acquisition. Short term use may still call for a short term conservation fix; it does not call for a controlled long term preservation policy.
Selection
If we consider a national deposit library as a library of "last
resort" for publications which otherwise may disappear, and as a
place where the entire "published archive" of a nation is kept
together and is recorded, the principles of selection and acquisition of
material are the same whether we talk about conventional or electronic
formats.
The way in which these formats are selected and acquired will vary. As a matter of principle, all publications, whether conventional or electronic should at least be considered for acquisition in a deposit collection. In practice selectivity is forced upon us by constraints in resources, storage space, handling capacity and funding. Selectivity may also be influenced by technical capacity. It has been said that the selection of electronic publications should be limited to those that can be acquired, handled and stored locally by the library. However, in a digital environment one could equally well argue that giving access to publications that reside elsewhere also fulfils one of the major purposes of any library, namely to make information available to its users (although it is not a deposit function).
Dynamic documents (such as frequently-updated online databases) pose an acquisition problem that we do not face with conventional texts. Although one may argue for selective acquisition that is frequent enough to preserve all information contained in such a publication during its lifetime, prohibitive costs may well compel a much greater selectivity aimed at only acquiring representative samples (however difficult it may be to decide what is representative).
Format and Medium
The format in which the information is presented should not influence its
selection or non-selection, as a format that cannot be easily handled may
be converted to one that the library or archive can handle. This may be
problematic, but it should be attempted; time and effort should be spent
to achieve it. Nor should the medium be regarded as a selection criterion.
Here again, the information content may be transferred to another medium
that can be accommodated. Selection criteria relating to the intrinsic
value or importance of the material to be acquired will be the same for
conventional and for electronic material. In libraries where maximum
access of most up-to-date material is the prime objective, selection
criteria may well be guided by medium or format.
RETENTION VS. PRESERVATION: CONVENTIONAL MATERIALS
The question of retention is inextricably linked with preservation. It
is technically possible at least for conventional material
to preserve an item virtually for ever (provided it has not been neglected
beyond rescue to begin with). The decision of whether or not an item will
be retained needs to be made, as well as the decision of whether an item
needs to be retained in its original format or in surrogate form. In many
cases, the format is as important as, sometimes more important than, the
information it contains. Format alone can provide information over and
above its contents and there are library and archive users who have a real
need to consult the material in its original format. For many users a
surrogate will suffice and can at times be preferred. The decision whether
to retain the original once a surrogate has been made is not clear cut.
RETENTION VS. PRESERVATION: ELECTRONIC MATERIALS
While for conventional material we can still make the distinction
between retention and preservation, for electronic material such a
distinction no longer applies. The main reason is the lack of longevity of
the storage media for electronic information, coupled with the imminent
obsolescence of their retrieval hardware and software. Simply "leaving
things as they are" is not an option for digital collections. The
choice whether to retain the document as an artefact, or to retain the
information it contains, or both, is less of a real choice with electronic
material. If we try and keep electronic publications as artefacts (i.e.
exactly as received from the publisher) they will eventually become
inaccessible and their contents will be lost. On the other hand, if we
attempt to retain the content, many aspects of the visual presentation and
perhaps even of the "functionality" of the electronic document
will be lost. We may also lose what Peter Graham has called the "integrity
and authenticity of the information as originally recorded".
Experience so far seems to indicate that in the long run the intellectual
content of an electronic publication is all we can retain and we shall
have to accept (at least for the time being) that certain interactive
dynamic and presentational aspects of the original cannot be retained.
In parallel with conventional publications, the off-line digital publication as a physical object is itself an expression of a part of our culture. It could therefore be argued that we must try to retain at least a representative sample of such physical objects and of their retrieval mechanism, in the knowledge that once the latter have broken down or can no longer be replaced, we will end up not as a functioning library or archive but as a museum of dead digital dodos.
RELATIONSHIP BETWEEN ACCESS AND PRESERVATION
The need for access has already been mentioned several times. Many
libraries and archives take the amount of use that is made of their
collections as an indication of their preservation needs. One can argue
that the nature and purpose of the use, rather than the amount of use an
item may get or is expected to get, is of paramount importance when making
retention and preservation decisions. To give low use as a reason for
neglect or non-preservation is dangerous. Some material may not be in
immediate demand nor in frequent demand, but it may be needed by someone
at some stage to increase knowledge or improve understanding. If we
believe this, then the model proposed for digital preservation by Donald
Waters as the "just-in-time" model (versus the just-in-case
model of conventional preservation) is one that should be used only in
awareness of its limitations. The increasing tendency in some parts of the
library world away from collections in favour of access reduces the
chances of our long term ability to fulfil the research needs of future
generations.
Nevertheless, the question of why an item or a collection should be preserved is closely linked to considerations of use and considerations of access. Only if we want to create a time capsule is there any point in preserving material to which access is withheld and even then, a time capsule is only of value if people know what it contains or if it is opened one day.
Most libraries and archives have a rôle that is wider than that of guardian of the cultural heritage. They have the duty to make their collections available to those who need to use them, now and in many cases also in the future. Providing access to the collections while preserving them for future use can, at least for conventional material, be seen as two conflicting aims. There are indeed kinds of access that defeat or prevent future use, in the same way as there are preservation methods that inhibit instant access. Nevertheless, such conflicts can be resolved and if the need for, and the purpose of, access are considered carefully, the dilemma between access and preservation is not quite so acute. Per contra, for digital material we can argue that access can assist preservation.
Unlike conventional items, electronic items do not deteriorate through use, but if they are not used for a long period, they may prove not to work any longer (because of deterioration of the mechanism and/or technical obsolescence). While not in itself sufficient, a high level of systematic access helps to check the usability of electronic publications.
The kind of use, the kind of access that is needed, influences preservation decisions and preservation methods. It has already been pointed out that with electronic material we may not have the choice to preserve both content and physical integrity. We do, however, have the choice whether to preserve electronic documents in digital format, on-line or off-line, and whether we "convert" them (for the purpose of long term retention) to non-electronic media. These choices will to some extent be steered by the medium and format of the publication, but also by the type of access that is needed. In many cases electronic publications cannot be preserved as originally received, whether this is because the medium will not survive, or because the technical environment becomes obsolete, or for intrinsic reasons (for instance networked publications by definition cannot be acquired and stored in their original medium so have to be converted to another). If access is needed to the content only, irrespective of any other functional considerations, the cowardly way out may be to convert from electronic media to paper or microfilm. However, such a strategy may only be valid for publications which are not true electronic documents but are just non-interactive static documents distributed on an electronic medium. For dynamic, interactive documents and multimedia, such conversion is not an option.
If we want to preserve publications as electronic publications there are two basic options for their archival storage, either off-line storing them as physical objects, or on-line, on a database. These options provide a different kind of access. in the case of storage off-line, access needs to be provided first of all through reference in a catalogue, then by fetching the object and putting it into a suitable reading device. On-line storage implies on-line access, and a reference in a catalogue will give an on- line storage location, allowing direct access to the publication. If distributed or networked access is necessary, the on-line storage option will be preferable.
Having discussed what to preserve and why, the vexed question of how to preserve may well be asked. This has not been covered here, partly because the author does not feel qualified to do so, at least not for digital material, and partly because the "how" is not really part of a preservation policy.
ECONOMIC AND MANAGEMENT CONSIDERATIONS
It is however relevant to mention two more considerations that will
influence a preservation policy, for any sort of material, namely economic
and management considerations. Although human intellect, human
understanding, historical and technical knowledge, common sense, energy
and a will to succeed are all vital, no preservation policy, no
preservation programme, however well conceived, stands a chance of being
implemented without sufficient funding. But preservation is only one of
many library and archive functions that cry out for funding. In order to
find a proper balance between the funding of preservation and other
functions, we must again consider how they are related. Historically,
libraries have looked at the balance of funding between acquisitions and
preservation, between access and preservation, and sometimes between
public services and preservation. In recent decades the balance of funding
between computing and telecommunication services and preservation has also
been considered. However, when we talk about the preservation of
electronic material, the latter distinction may well disappear.
Lack of resources has always stood in the way of the successful implementation of a preservation policy or strategy and will certainly do so no less for electronic material. Perhaps the situation is even worse. At least once one has conserved a book, one can be reasonably satisfied of its continued existence (provided the item is properly stored and not over-handled). Similarly, once one has made a microfilm, provided the film and its production methods are of archival quality and it is stored in the right conditions, the contents of a book or manuscript will be preserved for about 300 years. However, this is not the case with electronic material. Long term access to such material requires an ongoing commitment to reformat, refresh or migrate data, and only if libraries and archives are willing to commit long term funding and long term effort should they embark on the acquisition and maintenance of electronic collections. To do otherwise is irresponsible. Planning for long term preservation of electronic material is made even more difficult because of the rapid changes in technology and the impossibility of predicting what the state of technology will be, even in the medium term.
THE IMPORTANCE OF COLLECTIONS
Notwithstanding all these uncertainties and all these problems, there
is one thing that remains certain, and that is the importance of the data
itself, of in old fashioned terminology the library and
archive collections and their continued existence. The collections form a
library and archive's most valuable and most important asset, and the
provision of access to those collections their most important duty. The
argument has been presented in the past that in an electronic environment
a library will become an information broker, an institution that does not
own the data but simply enables access to them. If that is our future rôle,
someone will have to ensure that the data remains accessible and usable.
Technology will help. It will continue to improve and to become more and more useful and affordable. We must seize it when appropriate, but we must not think that it provides an answer to all our problems at least thus far it has failed to do so. The answer may lie in the human ingenuity to develop and use it, but we must also endeavour to make the best possible use of the available resources; we must ensure that we do not duplicate efforts; we must combine to work together, to share the responsibility for preserving our cultural heritage, and we must be selective, in the full knowledge that selectivity is almost certain to damage future research. It is therefore the more important to be selective in the context of a national or international preservation strategy.
It seems fitting to close with words from Northrop Frye: "Society, like the individual, becomes senile in proportion as it loses its continuous memory". In an electronic age these words are not merely a warning, they are a threat.
Technical Director, Cimtech Ltd.
ABSTRACT
This paper discusses the practical implications of the tasks required
to preserve digital works. It presents two main options ( on-line and
off-line storage), and relates these to The British Library's needs.
Finally, estimated costs are stated.
BACKGROUND
This paper stems from a consultancy study carried out by Cimtech for
The British Library. The study examined the issues which surround the
preservation of digital materials. It started with a literature review,
then moved on to a review of the preservation process, developed a
statement of objectives, reviewed the preservation options, and considered
the resource requirements.
OBJECTIVES FOR THE LIBRARY
The starting point is expressed neatly by David Martin: "Any
document which is published within the UK shall be eligible to be
designated for legal deposit". Though few would disagree with this
basic idea, in the electronic age it does require understanding of the
scope of the term of "published". It is proposed that in this
context we consider:
DIGITAL PUBLICATION MANAGEMENT REQUIREMENTS
The Library's requirements can simply be categorised in conventional
form as falling under the following headings:
Selection
Identify publications/publishers; sign agreements; enforce deposit;
maintain list of classes for deposit; update list of exclusions.
Accessioning
Log receipt; assign accession number; check documentation; count copies;
check permissions; check media; forward copies to deposit libraries; pass
on.
First Handling
Check media; send out; virus check; read documentation; load data; run
tests; repeat for copies; check keys to usage restrictions; download data;
technical notes; pass on.
Record Creation
Link accession record, publication, documentation, documentation, notes;
view and inspect; create bibliographic record and profile; record storage
location of data and documentation.
Initial Preservation
Label publication; store data online and back-up, or download and store
off-line and record location.
Access/reader services
Provide users with access to publications, manually at standalone
workstations, on- line at local workstations or at deposit library
workstations via a wide area network. Some of these tasks are familiar to
libraries from the handling of books. Others are unique to digital
material; of these some (eg checking documentation) can represent enormous
levels of human effort.
OPTIONS FOR MANAGING DIGITAL MATERIALS
The Cimtech consultancy study identified two options for ways in which
The British Library can handle digital materials.
Option 1: On-Line
This option has the following features:
Option 2: Off-Line
This option has the following features:
Although this approach is followed at the Library of Congress to an extent, it is not practical for very high volumes. It also raises problem of media, security and standards.
THE LIBRARY'S DIGITAL PRESERVATION REQUIREMENTS
The main requirements are to ensure that data is not lost, and to
ensure that the data can be interpreted in the future.
Ensure Data is Not Lost
This can be taken to mean that the data is preserved for "digital
archaeologists" of future generations to decode. This would mean that
no effort is made to make the data accessible or usable for immediate or
medium term access.
Preservation of this type can be effected by copying the data to CD-R (Compact Disk Recordable) platters. The platters would be stored in controlled conditions, and the data would be refreshed by copying to new platters (or other media) every ten years.
Making suitable assumptions, we estimate a cost of approximately £47 p.a. to archive an item in this way over 25 years.
This approach presumes that some issues can be overcome. For example, some data cannot be easily copied from some media (eg some existing CD-ROMs); and current CD-R technology does not automatically perform read-after-write checking (and so something needs to be done to ensure the integrity of CD-R copies).
In the absence of complete, issue-free solutions to the problems, the challenge is to start managing the data now, in the assumption that the answers for long term preservation will emerge naturally.
Ensure Data Can be Interpreted in Future
There are three ways in which we can manage data to make sure that future
generations will be able to make use of it:
COSTS
The Cimtech study developed the following cost estimates for
preservation:
Annual cost to manage and preserve a paper monograph: £5 p.a.
Annual cost to manage and preserve a CD-ROM off-line: £95 p.a. (this
in addition to the £47 p.a. estimate for refreshing the media, as
explained above) (note that one CD-ROM holds the equivalent of about
twenty paper monographs. The cost includes an allocation of the costs of
providing PC workstations for access.)
In the longer term, costs would increase.
CONCLUSIONS
There are no clear solutions yet which answer The British Library's
needs for long term digital preservation without significant drawbacks.
For digital publications which are "similar" to paper documents, the pragmatic approach will be to convert all unformatted text into ASCII format for preservation; and to convert formatted text to a portable, platform-independent format such as Adobe's PDF (Acrobat) format. Ideally, a completely open format will be adopted.
This approach will allow some risk of charges that publications are being republished and/or corrupted by the changes introduced in this preservation process. Both these are to be avoided as much as possible.
Clearly, the state of the technological art in the field of digital preservation means that we have to tread very carefully when taking long term decisions. The high costs and risks point to the need to be very selective in preserving digital works. Copying of some to paper or microfiche may remain the most desirable option.
STRATEGIES
This is an enormously diverse subject, which cannot adequately be
covered in a discussion of only one hour. The description or definition of
"strategy" in this context was the subject of some discussion,
and there was no attempt to develop a formal, complete definition.
The principal conclusion concerned the importance of establishing and maintaining a momentum.
Notwithstanding the relevance of adopting a strategy, it was felt that some actions should be initiated as soon as possible, so that there is not an inordinate delay while a thorough strategy is produced.
The higher education sector has its own needs; it will have to create its own solutions, rather than relying entirely on other institutions such as The British Library. Major public libraries may, however, have an important rôle to play.
Other conclusions were:
COLLECTION
Many ideas and issues, but fewer concrete conclusions. emerged from
this group. Two divergent views were represented in the group, namely:
PRESERVATION POLICY
As with the syndicate on Collection, some of the discussion centred on
the (unanswered) question of which objects should be preserved; in
particular, the issues of whether unusable digital objects (eg those
needing obsolete IT components) should be preserved for future
generations.
Five positive suggestions were developed:
PRACTICAL IMPLICATIONS
This syndicate identified three major headings for practical
implications, namely Management, Resources and Technology. The limited
time available restricted discussion to Management and Resource issues.
There was considerable debate on the meaning of the term "publishing", with a conclusion that the debate is more of concern to national libraries than to the higher education sector.
Key issues were:
Chair: Lynne Brindley, British Library of Political and Economic Science, UK
12.00 - 13.00 Arrival and Registration
13.00 - 14.00 Lunch
14.00 - 14.15 Introduction - Lynne Brindley
14.15 - 15.00 Preserving the Digital Library - Peter Graham, Rutgers University Libraries, US
15.00 - 15.30 Tea
15.30 - 16.00 Strategies for Managing Electronic Archives - Denise Lievesley, ESRC Data Archive, UK
16.00 - 16.30 Collection Policies - Daniel Greenstein, Arts & Humanities Data Service Executive, UK
16.30 - 17.30 Syndicate A: Strategies and Syndicate B: Collection
17.30 - 17.45 Report from Syndicate A
17.45 - 18.00 Report from Syndicate B
18.00 - 19.00 Free
19.00 for 19.30 Dinner
After dinner: Legal Deposit The British Library Experience - Sir Anthony Kenny, Chairman of The British Library Board, UK
Chair: Nigel Macartney, British Library, UK
09.00 - 10.00 Preserving Digital Information- Margaret Hedstrom, University of Michigan, US
10.00 - 10.30 Preservation Policies -Mirjam Foot, British Library Collections and Preservation, UK
10.30 - 11.00 Practical Implications- Tony Hendley, CIMTECH Limited, UK
11.00 - 11.20 Coffee
11.20 - 12.20 Syndicate C: Preservation Policy and Syndicate D: Practical Implications
12.20 - 12.35 Report from Syndicate C
12.35 - 12.50 Report from Syndicate D
12.50 - 13.30 Summary of outcomes and conclusions- Lynne Brindley
Roy Baker, University of London Computer Centre
r.baker@ulcc.ac.uk
Lynne Brindley, British Library of Political & Economic Science
brindley@lse.ac.uk
David Buckle, OCLC Europe
david.buckle@oclc.org
Lou Burnard, Oxford University Computing Services
lou.burnard@vax.ox.ac.uk
Terry Cannon, British Library Research & Development Department
terry.cannon@bl.uk
Reg Carr, Leeds University Library
lib6rpc@library.novell.leeds.ac.uk
Julia Chruszcz, University of Manchester Computer Centre
julia@mcc.ac.uk
Ann Clarke, British Library
ann.clarke@bl.uk
Alice Colban, JISC Secretariat
a.colban@jisc.ac.uk
Margaret Croucher, British Library Research & Development Department
margaret.croucher@bl.uk
Marilyn Deegan, de Montfort University
marilyn@vax.ox.ac.uk
Richard Field, University of Edinburgh
r.field@ed.ac.uk
Mirjam Foot, British Library Collections & Preservation Department
mirjam.foot@bl.uk
Marc Fresko, Imaging & Information Technology Consultant
marc@easynet.co.uk
Hazel Gott, UKOLN
h.a.gott@bath.ac.uk
Peter Graham, Rutgers University Libraries
psgraham@gandalf.rutgers.edu
Daniel Greenstein, Arts & Humanities Data Service
d.greenstein@kcl.ac.uk
Rhidian Griffiths, The National Library of Wales
wrg@aber.ac.uk
Margaret Hedstrom, University of Michigan
hedstrom@umich.edu
Tony Hendley, CIMTECH Limited
Bjørn Henrichsen, Norwegian Social Science Data Services
nsd@nsd.uib.no
Andrew Jordan, University of Huddersfield
a.p.h.jordan@hud.ac.uk
Sir Anthony Kenny, The British Library Board
anthony.kenny@bl.uk
Geraldine Kenny, National Preservation Office, British Library
geraldine.kenny@bl.uk
Denise Lievesley, ESRC Data Archive
denise@essex.ac.uk
Nigel Macartney, British Library Research & Development Department
nigel.macartney@bl.uk
John Mahoney, British Library Research & Development Department
john.mahoney@bl.uk
Ann Matheson, National Library of Scotland
fb285am@admin.nls.uk
Simon Musgrave, ESRC Data Archive
simon@essex.ac.uk
Bernard Naylor, University of Southampton Library
b.naylor@soton.ac.uk
Seamus Ross, The British Academy
seamus@britac.ac.uk
Chris Rusbridge, The Electronic Libraries Programme
c.a.rusbridge@warwick.ac.uk
Anne Thurston, University College London
a.thurston@a1.sas.ac.uk
Frank Wright, Ordnance Survey