Michael Day
UKOLN, University of Bath, Bath BA2 7AY, United Kingdom
m.day@ukoln.ac.uk
http://www.ukoln.ac.uk/
© Springer-Verlag
Paper delivered at: ECDL2001, 5th European Conference on Research and Advanced Technology for Digital Libraries, Darmstadt, Germany, 5 September 2001.
Published as: Michael Day, "Metadata for digital preservation: a review of recent developments". In: P. Constantopoulos and I. T. Sølvberg, (eds.), Research and Advanced Technology for Digital Libraries: 5th European Conference, ECDL 2001, Darmstadt, Germany, September 4-9, 2001, Proceedings. Lecture Notes in Computer Science, 2163. Berlin: Springer-Verlag, 2001, pp. 161-172. ISBN 3-540-42537-3. Table of contents available at: http://link.springer.de/link/service/series/0558/tocs/t2163.htm
This paper is a review of recent developments relating to digital preservation metadata. It introduces the digital preservation problem and notes the importance of metadata for all proposed preservation strategies. The paper reviews some developments in the archives and records domain, describes the taxonomy of information object classes defined by the Reference Model for an Open Archival Information System (OAIS) and outlines some library-based projects.
The long-term preservation of information in digital form is one of the most important problems faced by the cultural heritage professions in the early twenty-first century. Hedstrom [1] has defined digital preservation as "the planning, resource allocation, and application of preservation methods and technologies necessary to ensure that digital information of continuing value remains accessible and usable." Using this definition, it is clear that the digital preservation problem is not just a technical problem, but an organisational one as well. It may, in fact, be easier to solve many of the technical issues relating to the preservation of digital information than to create organisational and managerial structures to support their consistent application. Hedstrom's definition also stresses that preservation is about maintaining access to information - not just, for example, about the various technical options for long-term storage.
It has been clear for some time that the preservation of information in digital form will require more than just the preservation of the digital bits and bytes themselves. It has been widely assumed that if digital information to remain understandable over time, there will be a need to preserve information about the technological and other contexts of a digital object's creation and use. In the past, this was sometimes assumed to mean the concurrent preservation of all of the relevant documentation that might be associated with a digital object. At the present time, following other trends in digital library terminology, a more sophisticated understanding of this concept is now known under the name of 'preservation metadata.' This paper attempts to review some recent initiatives that relate to preservation metadata for digital objects.
In technical terms, the successful long-term preservation of digital information will be dependent upon organisations identifying and implementing suitable preservation strategies [2]. If one ignores strategies that involve converting digital information into non-digital forms (e.g. printouts or microforms), at the moment there are three main strategies: technology preservation, software emulation and data migration [3]. None of these options provides a single perfect solution and it is assumed that different digital information types may require different strategies to be adopted. In any case, whichever particular digital preservation strategy is adopted, preservation metadata is likely to be a key part of its implementation. Clifford Lynch [4] describes the function of some of this metadata:
Within an archive, metadata accompanies and makes reference to each digital object and provides associated descriptive, structural, administrative, rights management, and other kinds of information.
Lynch's comments, however, give us a clue that preservation metadata must enable to do more than support the implementation of any particular preservation strategy. Day [5] has suggested, for example, that metadata could be used to help ensure the authenticity of digital objects, to manage user access based on intellectual property rights information as well as for more traditional metadata applications like resource description and discovery. This paper will now begin to look at some metadata developments in the archives and recordkeeping domain before proceeding to look at some recent library-based projects.
Some parts of the archives and records professions have been seriously considering digital preservation issues for some time. In the United States, for example, an awareness of the need for the preservation of economic data stored on punched cards and magnetic tape first became apparent in the early 1960s [6]. Shortly afterwards, some of the larger national archives had started to consider what were then generically known as machine-readable records, and a few set up separate divisions to deal with them.
Most of the first generation of machine-readable records were data sets stored on punched cards or magnetic tape. As a result, appraisal and custody regimes tended to follow a traditional pattern based on the physical records being transferred into the custody of an archival repository at the end of their active life cycle. Over time, however, a rapid growth in the use of computers and the ever-changing nature of the records that were being created, resulted in a widespread reassessment of archival theory and practice [7]. For example, in the new digital environment, it was no longer sufficient for archivists to make decisions about the retention or disposal of records at the end of their active life. By that time it may be too late to ensure their preservation in any useful form. O'Shea [8] has commented that the ideal time for archivists attention to be given to digital records, "is as part of the systems development process at the point systems are being established or upgraded, i.e. even before the records are created." The Australian archives community has in particular adopted a 'continuum' approach to records management. By the early 1990s, projects began to look at embedding recordkeeping requirements in the design of office systems. Examples of these are the National Archives of Canada's IMOSA (Information Management and Office Systems Advancement) project [9] and the Public Record Office's EROS (Electronic Records in Office Systems) Programme [10]. Examples of this type of activity can also be found in commercial contexts, most notably in the pharmaceutical industry [11].
The reassessment of archival theory and practice triggered by electronic records has also begun to influence archivists' understanding of archival description. Under the older model, archival description took place after the physical transfer of records to a repository. Traditional archival descriptions document the context of their creation as well as containing information on their accumulation, custodial history and arrangement. McKemmish and Parer [12] argue that these descriptions essentially act as cataloguing records, "surrogates whose primary purpose is to help researchers find relevant records." However, with a record continuum perspective, archival description can instead be envisaged "as part of a complex series of recordkeeping processes involving the attribution of authoritative metadata from the time of records creation." This metadata is commonly known as 'recordkeeping metadata', defined by McKemmish and Parer as "standardised information about the identity, authenticity, content, structure, context and essential management requirements of records." At least some of this data could be automatically captured at the time the record is created.
Since the 1990s, a variety of research projects and practically based initiatives have been concerned with the development of recordkeeping metadata schemes and standards. The most influential of these will be described here.
The first recordkeeping research project to develop a detailed concept of metadata for recordkeeping was the Functional Requirements for Evidence in Recordkeeping project. This was a project undertaken between 1994 and 1997 by the School of Information Sciences at the University of Pittsburgh and funded by the US National Historic Publications and Records Commission [13]. The core aim of the project was to "develop viable recordkeeping functional requirements through an analysis of the professional literature and via consultation with experts in the management of archives and records" [14]. What emerged was the idea of an electronic recordkeeping system that could support the capture, maintenance and continued usability of records.
One of the Pittsburgh Project's outcomes was the development of a 'Metadata Specification for Evidence' based on a model known as the Reference Model for Business Acceptable Communications (BAC). The BAC metadata specification proposed that digital records should carry a six-layer structure of metadata. These would contain a 'Handle Layer' that would include a unique identifier and basic resource discovery metadata. However, the specification also included other layers that would be able to store detailed information on terms and conditions of use, data structures, provenance, content, and the use of the record since its creation. This metadata would be directly linked to each record and would be able to describe the content and context of the record as well as enabling the decoding of its structure for future use [15]. The metadata was intended to carry all the necessary information that would allow the record to be used - even when the individuals, computer systems and information standards under which it was created no longer existed [16].
At approximately the same time as the Pittsburgh Project was developing its functional requirements for recordkeeping, another project was looking at the 'Preservation of the Integrity of Electronic Records.' This project was funded by the Social Sciences and Humanities Research Council of Canada and was based at the School of Library, Archival and Information Studies at the University of British Columbia (UBC), in collaboration with the US Department of Defense. It ran from 1994 to 1997. The project was primarily concerned with the preservation of the completeness, reliability and authenticity of electronic records.
While the Pittsburgh Project was heavily influenced by the reappraisal of archival thinking occasioned by developments like the record continuum model [17], the UBC project looked to base their understanding of electronic records on more traditional archival concepts. The project adopted concepts of 'reliability' and 'authenticity' that had already been used within diplomatic theory and archival science. This resulted in a restatement of the importance of archival custody once records have become inactive. Duranti [18] noted that "the authenticity of inactive records traditionally has been protected by physically transferring them to an archival institution or programme and, once transferred, by arranging and describing them." The replacement of traditional archival description by the automated capture of contextual metadata (as proposed by the Pittsburgh Project) was therefore rejected [19]. Duranti [20] argued that automatically captured metadata are inadequate, because they "do not contain 'historical' context, but only the contextual data contemporary to records creation, and because they only record the limited contextual fabric that a document has within the electronic system in which it exists." The UBC project's research team developed a set of eight templates that were intended to help identify the necessary components of records in all recordkeeping environments. These templates may be seen as potentially forming the basis of a metadata scheme for records; but one that is more firmly based in the traditional custodial view of recordkeeping than the specification developed by the Pittsburgh Project.
The InterPARES (International Research on Permanent Authentic Records in Electronic Systems) project is another project led by the School of Library, Archival and Information Studies at the University of British Columbia. The project is concerned with a wide range of issues relating to the reliability and authenticity of electronic records. Work has been undertaken by a series of task forces dealing with authenticity, preservation, appraisal and strategy. The project's task force on authenticity has the task of identifying the elements of electronic records that need to be preserved to ensure their authenticity. The task force first developed a template for analysing electronic records. Gilliland-Swetland and Eppard [21] note that the template "is a model of an ideal record that, based upon prior archival knowledge of record types, contains all the possible known elements that a record may contain." In common with the UBC Project (upon which the work is based), the identification of these elements has been guided by the general principles of diplomatic theory and archival science. It is accepted that no one single record would include all of the elements identified in the template. The project has also, therefore, developed a typology of electronic records to help to identify which 'core' elements would be applicable to all electronic records.
Since the Pittsburgh Project, it is the Australian archives and records community that has led the way in the development of metadata schemas for recordkeeping. A research project (the Recordkeeping Metadata Project) based in the School of Information Management and Systems at Monash University has developed a general framework known as the Australian Recordkeeping Metadata Schema (RKMS). The project, amongst other things, has attempted to specify and standardise the whole range of recordkeeping metadata that would be required to manage records in digital environments [22]. It has also been concerned with supporting interoperability with more generic metadata standards like the Dublin Core and relevant information locator schemes like the Australian Government Locator Service (AGLS) scheme. The RKMS defines a highly structured set of metadata elements that conforms to a data model based on that developed for the Resource Description Framework (RDF). The schema is designed to be extensible and can inherit metadata elements from other schemas.
In addition to the conceptual frameworks and elements developed as part of the RKMS, both the National Archives of Australia (NAA) and the State Records Authority of New South Wales have published metadata standards for recordkeeping. These are, respectively, the Recordkeeping Metadata Standard for Commonwealth Agencies [23] and the NSW Recordkeeping Metadata Standard [24]. The Victorian Electronic Records Strategy (VERS) has also defined a metadata scheme for self-documenting records. This scheme [25] is designed to be compatible with the Recordkeeping Metadata Standard developed by the NAA despite being based on a different conceptual model.
Apart from the ongoing Australian efforts to define recordkeeping metadata frameworks and standards, the one other important development has been the development of the Reference Model for an Open Archival Information System (OAIS). This resulted from a request from the International Organization for Standardization (ISO) that the Consultative Committee for Space Data Systems (CCSDS) should co-ordinate the development of standards in support of the long-term preservation of digital information obtained from observations of the terrestrial and space environments. Although the OAIS model has been primarily developed by and for the space data community, its developers hope that the model has a much wider application.
The specification defines a high-level reference model for an OAIS, which is defined as an organisation of people and systems that have "accepted the responsibility to preserve information and make it available for a designated community" [26]. The OAIS model is not just concerned with metadata. It defines and provides a framework for a range of functions that are applicable to any archive - whether digital or not. These functions include those described within the OAIS specification as ingest, archival storage, data management, administration and access. Amongst other things, the OAIS model aims to provide a common framework that can be used to help understand archival challenges and especially those that relate to digital information. This is its real value: providing a common language that can facilitate discussion across the different communities interested in digital preservation. For example, one key concept in the OAIS model is that of an Archival Information Package (AIP) consisting of a digital object together with all of its associated metadata.
As part of this framework, the OAIS model identifies and distinguishes between the different types of metadata that will need to be exchanged and managed within an OAIS. Within the draft recommendation, the broad types of metadata that will be needed are defined as part of a 'Taxonomy of Information Object Classes.' Within this taxonomy, an AIP is perceived as encapsulating two different types of information, some Content Information and any associated Preservation Description Information (PDI) that will allow the understanding of the Content Information over an indefinite period of time. The Content Information is itself divided into the Data Object itself - which would typically be a sequence of bits - and the technical Representation Information that would give meaning to this sequence. Descriptive Information that can form the basis of finding aids (and other services) can be based on the information that is stored as part of the PDI, but is logically distinct.
The OAIS taxonomy also sub-divides the PDI into four distinct groups. These are based on general concepts described in the 1996 report of the Task Force on Archiving of Digital Information commissioned by the Commission on Preservation and Access (CPA) and the Research Libraries Group (RLG). The task force [27] wrote that "in the digital environment, the features that determine information integrity and deserve special attention for archival purposes include the following: content, fixity, reference, provenance and context." Accordingly, the OAIS taxonomy divides PDI into Reference Information, Context Information, Provenance Information and Fixity Information.
The OAIS model defines Reference Information as the information that "identifies, and if necessary describes, one or more mechanisms used to provide assigned identifiers for the Content Information." Reference Information, therefore, would be a logical place to record unique identifiers. It could also be used to store basic descriptive-type information that could be used as the basis for resource discovery, although that would not be its main purpose within the PDI.
Context Information is defined as information that "documents the relationships of the Content Environment to its environment." The CPA/RLG report suggests that 'context' should include information on the technical context of a digital object, e.g. to specify its hardware and software dependencies and to record things like hypertext links in a Web document. Context could also include information relating to the mode of distribution of a particular Digital Object (e.g. whether it is networked or provided on a particular storage device) and its wider societal context.
Within the OAIS taxonomy, Provenance Information refers generally to that information that "documents the history of the Content Information." The CPA/RLG report says that the "assumption underlying the principle of provenance is that the integrity of an information object is partly embodied in tracing from where it came. To preserve the integrity of an information object, digital archives must preserve a record of its origin and chain of custody." While Provenance Information is primarily concerned with supporting the integrity of a Data Object, the information that is recorded could also provide information that could be used to help the management and use of Digital Objects stored within a repository (e.g. administrative metadata). It could also store information about the ownership of intellectual property rights that could be used to manage access to the Content Information of which it forms a part.
Fixity Information - in OAIS terms - refers to any information that documents the particular authentication mechanisms in use within a particular repository. The CPA/RLG report comments that if the content of an object is "subject to change or withdrawal without notice, then its integrity may be compromised and its value as a cultural record would be severely diminished." Changes can either be deliberate or unintentional, but both will adversely effect the integrity of a Digital Object.
It is important to remember that the OAIS is a reference model and not a blueprint for an archive implementation. The OAIS model and its taxonomy, however, have begun to influence a number of projects that have been developed by the library community. In the last section we will, therefore, turn to look at these projects and some other digital preservation initiatives that have been undertaken by the library community.
Like the archives and records domain, the library community has been aware of digital preservation issues for a long time [28]. The publication of the report of the CPA/RLG Task Force on Archiving of Digital Information was a catalyst for much recent work. A study of RLG member institutions (including libraries, archives and museums) revealed that by 1998 there was an strong awareness that institutions needed to assume responsibility for the preservation of information in digital form. However, this awareness was combined with a general lack of written policies, facilities and expertise [29]. One particular focus of libraries' interest in digital preservation issues has been preservation metadata.
Part of the motivation for looking at preservation metadata is related to the development of digitisation projects. For example, in May 1997 the RLG constituted a working group on the Preservation Issues of Metadata to help identify the kinds of information that would be required to manage a digital master file over time. The primary focus of the working group was the products of digital imaging technologies. The working group published its final report in May 1998 [30]. This defined sixteen metadata elements for digital image files. A more detailed technical metadata standard for digital images is currently under review as a draft NISO (National Information Standards Organization) standard [31].
Other preservation metadata implementations have been developed by national libraries and by research projects. The most influential of these will be reviewed in the following sections.
The National Library of Australia (NLA) has long had a keen interest in digital preservation issues. This is demonstrated by its ongoing support for and hosting of the PADI (Preserving Access to Digital Information) service [32]. In 1996, the NLA established its PANDORA (Preserving and Accessing Networked DOcumentary Resources of Australia) archive as an operational 'proof-of-concept.' With regard to metadata, descriptive metadata for each object in the PANDORA archive was stored in the NLA's own library management system; individual items being identified by means of Persistent Uniform Resource Identifiers (PURLs). The project also developed a logical data model (based on entity-relationship modelling) to help identify the particular entities (metadata) that would need to be supported [33].
Later, the NLA also developed a specification entitled Preservation Metadata for Digital Collections, the exposure draft of which was published in October 1999 [34]. This was based on an 'data output model,' i.e. it defined the information that a digital storage system would need to generate in order to facilitate the preservation management of digital content. The NLA metadata element set defined 25 high level elements (some with sub-elements) at three distinct levels of granularity: the collection, the object and the sub-object (here called files). The metadata specification made no assumptions about the technological strategies that would need to be adopted to preserve the object, e.g. migration or emulation.
The Cedars (CURL Exemplars in Digital Archives) project was funded by the Joint Information Systems Committee as part of Phase III of the Electronic Libraries (eLib) Programme and was managed by the Consortium of University Research Libraries (CURL). The lead sites in the project were the universities of Cambridge, Leeds and Oxford, with expertise being drawn from both computing services and libraries within the three organisations. The project's aims were to address some of the strategic, methodological and practical issues relating to digital preservation. These issues were addressed in three main project strands; one looking at digital preservation strategies and techniques, another concerned with collection development and rights management issues and a third interested in defining the metadata that would be required to adequately preserve digital information objects [35].
The work on preservation metadata got underway in 1998 with a document that reviewed existing metadata initiatives [36]. The project then created a draft metadata specification that was broadly (and explicitly) structured according to the taxonomy of information object classes described in the OAIS reference model [37]. The draft metadata element set was developed both as a scheme that could be tested in the Cedars project's demonstrator archive and as a contribution to the wider international discussion about preservation metadata. The elements identified were defined at a relatively high-level (it was assumed that some elements could be sensibly subdivided into sub-elements) and were intended to be applicable to a wide range of digital objects at any granularity.
The NEDLIB (Networked European Deposit Library) project ran from 1998 to 2000 and was funded by the European Commission as part of its Telematics Applications Programme. The project was a consortium of national libraries, publishers, information technology organisations and a national archive. The project developed an architectural framework for what it called a deposit system for electronic publications (DSEP) - broadly based on the OAIS model. The project also attempted to define the minimum metadata elements that would be necessary for preservation management [38]. Like the Cedars element set, the NEDLIB schema explicitly adopts the OAIS model's terminology and structure. The schema, however, was much smaller (18 elements, 38 sub-elements) than the Cedars element set because it was focussed on only identifying 'core' (or mandatory) metadata elements. It was also primarily concerned with defining metadata that would address the problem of technological obsolescence and not with metadata for descriptive, administrative or legal purposes.
The exposure draft of the NLA's Preservation Metadata Specification was published in 1999; the Cedars and NEDLIB element sets in 2000. It was an appropriate time for a collaborative attempt at synthesis and further development. In 2000, therefore, the OCLC Online Computer Library Center and the Research Libraries Group decided to co-operate on the formation of a Working Group on Preservation Metadata. The group has an international scope and has already produced a review of the state of the art in 'Preservation Metadata for Digital Objects' [39]. Future work will include the development of a metadata framework and the identification of the metadata elements that would be required to support it. The working group will also look at developing some form of test implementation and produce some recommendations on best practice.
This review of developments in digital preservation metadata has, of necessity, covered a wide range of initiatives, but not in great detail. Some trends can be seen. There is a tendency - at least among the library projects - to focus discussion on the terminology defined by the OAIS model. This has been one of the most important results of the development of the model. Another good outcome has been the identification of some weaknesses in the OAIS model that will hopefully inform its future development. The NEDLIB project noted, for example, that while the model had identified separate entities for things like ingest, administration and archival storage, it didn't actually say much about preservation itself. The NEDLIB project, therefore, included an explicit preservation entity in its OAIS based process model for a DSEP [40]. It might also be interesting for other communities to review the OAIS model with regard to their own needs, for example from the point of view of recordkeeping metadata requirements.
Other important issues have not yet been addressed. For example, more time and effort has been expended on developing conceptual metadata specifications than in testing them in meaningful applications. This is not intended as a criticism, but is just a reflection of how experimental the digital preservation area remains. There is also little published on the expertise and skills that would be required to generate preservation metadata and, therefore, its potential cost. This could be a fruitful area for future research, but again reflects a wider uncertainty about the precise economic and societal costs of the long-term preservation of digital information.
This paper is based on work undertaken for the Cedars project (funded by the Joint Information Systems Committee) and as part of the Metadata Watch activity of the SCHEMAS project (funded by the European Commission as part of its Information Societies Technology (IST) Programme - contract no. IST-1999-10010).
© Springer-Verlag
| Maintained by: Michael Day of UKOLN, University of Bath.
 |   |