A review of metadata: a survey of current resource description formats
Work Package 3 of Telematics for Research project DESIRE (RE 1004) |
Title page
Table of Contents |
The DESIRE project will use a generic metadata format for the records in the subject-based information gateways. There are a number of options for this format. This study provides background information which allows the implications of using particular formats to be assessed. Part I is a brief introductory review of issues. Part II provides an outline of resource description formats in directory style. This includes generic formats, but also, to give an indication of the range of development, domain-specific formats. The intention is not to be comprehensive, but to give sufficient examples to support understanding of a rapidly developing environment. The focus is on metadata for 'information resources' broadly understood; a variety of other approaches exist within particular scientific, engineering and other areas.
Metadata is data which describes attributes of a resource. Typically, it supports a number of functions: location, discovery, documentation, evaluation, selection and others. These activities may be carried out by human end-users or their (human or automated) agents.
A more formal definition is:
metadata is data associated with objects which relieves their potential users of having to have full advance knowledge of their existence or characteristics.
It is recognised that in an indefinitely large resource space, effective management of networked information will increasingly rely on effective management of metadata. The need for metadata services is already clear in the current Internet environment. As the Internet develops into a mixed information economy deploying multiple application protocols and formats this need will be all the greater. Metadata is not only key to discovery, it will also be fundamental to effective use of found resources (by establishing the technical or business frameworks in which they can be used) and to interoperability across protocol domains.
Part II of this report describes a range of metadata formats. It is unlikely that some monolithic metadata format will be universally used. This is for a number of more or less well known reasons, not least the investment represented by legacy systems in terms of technology and human effort. In addition the variety of record formats represent an attempt to meet the diverse requirements of the different communities. The various communities involved in resource description have vested significant effort in developing specialised structures to enable rich record descriptions to be created to fulfil the requirements of their particular domain. These structures are embodied in systems. In addition the people who maintain these structures have developed considerable detailed knowledge and skills of a specialist nature. For these reasons it is unlikely that one format will fulfil their diverse requirements.
There is a variety of types of metadata. There is traditional descriptive information of the kind found in library catalogues, which typically includes such attributes as author, title, some indication of intellectual content and so on. There is information that might help a client application make a decision based on format (where certain local browser equipment is available) or on location (to save bandwidth). There are different types of user: a user as customer wishes to know the terms under which an object is available; a user as researcher may wish to have some extended documentation about a particular resource, its provenance for example. There are different types of resource. Some resources may have a fugitive existence, existing to satisfy some temporary need and only ever minimally described if at all; some are important and valuable scholarly or commercial resources, where the value of extensive description is recognised. Some resources may be simple; some may be complex in various ways. There will be many different information providers, some commercial 'yellow pages' type services, some scholarly or research-oriented services, in different organisational configurations with different target audiences and products. Metadata may be closely coupled with the object it describes as an intrinsic part of its composition; or it may have no intrinsic link with it at all. And so on ...
Thus, the nature of the problem to be solved suggests a variety of solutions. In the following sections we examine some characteristics of the environment in which network information of interest to European researchers is being created and some of the factors which are influencing the development of metadata services.
The discipline or control exercised over the production of collections of resources will improve as the web becomes a more mature publishing environment. There will be managed repositories of information objects. Such repositories may be managed by information producing organisations themselves, universities for example, by traditional and 'new' commercial publishers, or by other organisations (the Arts and Humanities Data Service in the UK, for example, or industrial and other research organisations, archives, image libraries, and so on). This is not to suggest that the existing permissive electronic publishing environment will not continue to exist in parallel. One concern of a managed repository will be that its contents are consistently disclosed and that descriptions are promulgated in such a way that potential users, whoever they might be, are alerted to potentially relevant resources in that repository.
Different repositories will have different requirements and priorities. Examples are a social science data archive, a university web site, a commercial publisher's collection of electronic journals, an archival finding list, and so on. Objects on a university web-site may be briefly and simply described. A data archive may need extensive documentation.
There will be a variety of metadata creators. These fall into three broad categories: 'authors', repository managers, and third party creators. As its importance becomes more apparent, 'authors' are likely to create descriptive metadata: a major incentive for this will be agreement about the use of the META tags in HTML documents for embedding metadata which will be harvested by programs. Descriptive data will be similarly embedded in other objects by those responsible for their creation. Metadata will also be created by repository managers, who have some responsibility for a resource and the data that describes it. Third party creators (including, for example, the information gatways being developed in DESIRE) create metadata for resources which they themselves may not manage or store.
Metadata may sit separately from the resources it describes; in some cases, it may be included as part of the resource. Embedded HTML tags is probably the simplest example of the latter case, but it is common in some of the domain-specific SGML frameworks described in the review section. For example, a TEI header needs to accompany conformat TEI documents. However, independent TEI headers may also exist, which describe documents which may be physically remote.
Metadata, once created may be shared with others. Take for example, author-created metadata embedded in HTML documents. This may be collected by robot or other means. Value will be added to this data at various stages along whatever use chain it traverses: by a local repository manager, by subject-based services like the ones under consideration here, by crawler-based indexing services, by various other intermediary services. These intermediary services might include librarians and others who now invest in current awareness and SDI (selective dissemination of information) services, as well, maybe, as current abstracting and indexing services. Many authors may only provide basic information: typically they will not be conversant with controlled subject descriptor schemes, record all intellectual or formal relationships with other resources, and so on.
A different use chain might be traversed by fuller metadata associated with the scholarly edition of an electronic text, for example. Full documentary metadata would be available to assist in the analysis and use of the text, but a subset might be output to a general purpose discovery service. There might be a link back to the fuller metadata from the shorter record.
A number of factors, including the perceived value of a resource, will determine the relative balance between author-produced, added value and third-party original descriptions in different scenarios. The metadata ecology and economy is still in development.
The level of created metadata structure (however it is designed) and the level of intellectual input deemed necessary will depend on the perceived value of the resources and the environment of use.
Webcrawlers tend to describe individual web pages. Newer approaches based on manual description have initially tended to focus on servers, and not describe particular information objects on those servers or the relationships between objects. Most subject information gateways, such as those in the UK eLib project, fall into this category. Neither approach is complete as users are interested in resources at various levels of granularity and aggregation which may not be satisfied by either of these simplified approaches. There also exist a number of emerging approaches specialised for a particular community of users. Quite often, these are rich in terms of content and structure: they are created to represent the objects in a collection and the relationships between them. Examples from the archives, museums, and other communities are given below.
A tripartite division along these lines is further elaborated below. Web indexes based on robot extraction of (currently unstructured) metadata are cheap to create, are automatic. Documentation of a particular collection by specialists is expensive. 'Information gateway' services add value through intellectual effort, and are correspondingly expensive. These factors will drive the creation of author-produced metadata and more sophisticated automatic extraction techniques. However, the creation of full, structured metadata will remain expensive, wherever along the use chain that cost falls.
Programs will collect and manipulate data in a variety of ways, passing it off to other applications in the process. Data may be collected and served up in various environments, converted on the fly to appropriate formats: it may be collected by robot, added to a local 'catalogue', or pulled into a subject-based service. The metadata we have been talking about refers to network information resources. This will need to be integrated at some level with the large, albeit highly fragmented, metadata resource for the print literature. There may also be metadata about people, about courses, about research departments and about other objects. Programs might periodically look for resources that match a particular user profile, might search for people with a particular research interest, and so on.
These developments will take place in a rapidly changing distributed
environment in which directory protocols (e.g. whois++, LDAP),
search and retrieve protocols (e.g.Z39.50), Harvest, and a variety
of other approaches will be deployed. These will be hidden from
the user by the web, itself likely to be transformed by the integration
of Java and distributed object technologies.
Here we briefly examine characteristics of the metadata formats considered in this study, taking as a framework the broad categories which structure the format descriptions in Part II.
One can suggest an approximate grouping along a metadata spectrum
which becomes successively richer in terms of fullness and structure.
For purposes of analysis, we propose three bands within this spectrum,
which allows us to sketch some shared characterstics across groups
of formats. Any one format may not have all the characteristics
of the band in which it is placed, but this grouping has proved
beneficial in identifying the differences and similarities between
formats.
Band One | Band Two | Band Three | |
Record | Simple formats | Structured formats | Rich formats |
characteristics | Proprietary | Emerging standards | International standards |
Full text indexing | Field structure | Elaborate tagging | |
Lycos | Dublin Core | ICPSR | |
Record formats | Altavista | IAFA templates | CIMI |
Yahoo etc | RFC 1807 | EAD | |
SOIF | TEI | ||
LDIF | MARC |
Band One includes relatively unstructured data, typically automatically extracted from resources and indexed for searching. The data has little explicit semantics and does not support searching by field.
Currently, this data is created by the web crawlers. Many services exist based on such data, and several global services are in heavy use. If a user is looking for a known item, they can be reasonably effective. Because they are global in scope and operate on limited descriptions they are less effective for discovery. A user may find many resources, but may have to sift through them and will miss many potentially relevant resources because they are not indexed with appropriate terms. Nor, in many cases, is the metadata full enough to allow the user to make relevance judgements in advance of actually retrieving the resource. Typically, crawlers are not selective about the resources they index: they often aim for comprehensiveness at some level within their target area, whether that is the world or some part of it. For these reasons, they have some limitations as discovery services. These issues are well known and such services are seeking to enhance the metadata on which they operate: different services have different conventions to allow authors of web pages to include various categories of metadata which can then be collected. There is also some discussion about a common representation for the exchange of such metadata between global indexes and other services, and the harvesting of fuller metadata. We do not look in detail into such indexes here as they are the subject of a future working paper in the Indexing and Cataloguing component of DESIRE.
Band two includes data which contains full enough description to allow a user to assess the potential utility or interest of a resource without having to retrieve it or connect to it. The data is structured and supports fielded searching. Typically these records are simple enough to be created by non-specialist users, or not to require significant discipline-specific knowledge. Descriptions tend to be of discrete objects and do not capture multiple relationships between objects. Typically, but not essentially, descriptions are manually created, or are manual enhancements of automatically extracted descriptions, and they include a variety of descriptive and other attributes. They may be created to be loaded directly into a discovery service or to be harvested.
Services in this area include OCLC's NetFirst (based on its own internal format) and the UK Electronic Libraries Programme subject-based information gateways (some of which use their own internal format; some use IAFA templates). Often, these services involve some selectivity in what they describe and may have more or less explicit criteria for selection. For these reasons, they may be expensive to create, again driving an interest in author- or publisher- generated description and automatic extraction techniques such as those piloted by Essence as part of the Harvest software.
Our third Band includes fuller descriptive formats which may be used for location and discovery, but also have a role in documenting objects or, very often, collections of objects. Typically, they are associated with research or scholarly activity, require specialist knowledge to create and maintain, and cater for specialist domain-specific requirements. The are expressive enough to capture a variety of relationships at different levels. Developments described below include the Inter-university Consortium for Political and Social Research SGML codebook initiative to describe social science data sets, the Encoding Archive Description, Content Standards for Digital Geospatial Metadata and Computer Interchange of Museum Information.
It should be clear that these are not watertight categories, especially as implementations may vary. GILS and CIMI object descriptions might be considered to be in the middle band for example.
Against this background one can note some trends, especially across the boundaries of these bands. Author or site produced metadata will become more important for many purposes. This may be harvested unselectively, or only from selected sites. An important motivation for this is to overcome some of the deficiencies of current crawlers without a provider incurring the cost of record creation. In some respects, the crawlers will assume some of the characteristics of the middle band.
At the same time, communities using the richer 'documentation' formats will wish to disclose information about their resources to a wider audience. How best to achieve this will have to be worked out: perhaps 'discovery' records will be exported into other systems. These trends suggest that the middle band will become more important as a general-purpose access route, maybe with links to richer domain-specific records in some cases.
There is currently no widely-used standard for data in band one, although amongst implementors of systems based on harvesting of simple metadata there are moves to develop an exchange format based on basic level SOIF. There is also a trend noted above to enhance the data collected by these services in various ways, making them better suited to discovery.
The middle band metadata used in discovery services tends to be based on simple record structures influenced by RFC-822 style attribute-value pairs. Formats here do not contain elaborate internal structure, do not easily represent hierarchical or other aggregated objects, nor, typically, do they express the variety of relationships which might exist between objects. This is usually by design: there is a necessary trade-off between simplicity and expressiveness. Also, their purpose is to be hospitable to the non-specialist description of information objects of different types and from different domains and so is not concerned with the very specific requirements of any one domain. Of the discovery service formats which we examine here, IAFA templates are perhaps the most detailed. There are templates for different types of object (document, user, logical archive, etc.), and there has been some consideration given to 'clusters' of data which are likely to be repeated across records and to variants within records.
There has been some interesting recent discussion about the future direction of the Dublin Core in this context. The Dublin Core is a simple resource description format. It could be extended in two ways. Firstly, it could be extended to accommodate elements which contain other types of metadata: terms and conditions, archival responsibility, administrative metadata and so on. Secondly, it could be designed for resource description of different levels of fullness and within different communities. The IAFA document template is an example of one such format, USMARC another. We would argue that it is undesirable either that there be one single format for resource description or that a single format be indefinitely expanded to accommodate all future requirements. The need to retain a Dublin Core optimised for its target use together with the need to exchange a variety of types of metadata led to the proposed Warwick Framework (which is described in Part II). This is a container architecture for the aggregation of metadata objects and their interchange. However, such an architecture is not yet in place and implementation details are far from clear. It is therefore inevitable that there be a continuing tension between simplicity and the need to provide more expressiveness or functionality.
Although the bulk of the formats in this range follow an attribute-value pair structure, it has been agreed that an SGML DTD will be developed for the Dublin Core. At the 'documentation end' of discovery it is likely that other formats will be found. MARC is a notable one which will be further considered below, but the encoding of choice is now likely to be SGML as in CIMI object descriptions.
Because of some similarity of construction and content across formats in this band, conversion between them, though inevitably lossy, is feasible.
The documentation band contains some very full frameworks for the description of multiple aspects of objects and collections of objects. In some cases, the frameworks describe metadata objects as one type only of information object: they are concerned with 'information content' also. Typically, work is proceeding within an SGML context and the example of the Text Encoding Initiative has been quite influential. Within the social science, museums, archives and geospatial data communities work is progressing on establishing DTDs. These may relate to collection level description, item level description, and allow various levels of aggregation and linkage appropriate to the domain. They cater for a very full range of attributes appropriate to documenting data sets or other resources. These can be distinguished from the range in the middle band by fullness (they go into more detail), structure (they contain richer structuring devices), and specialism (they may be specific to the relevant domain).
It seem likely that specialist users will want to search such data directly, but that to make data more visible to more general 'discovery' tools, there may be export of data in some of what we have called 'discovery' formats. Indeed, the Dublin Core has been explicitly positioned as a basis for semantic interoperability across richer formats, although it has not been widely used in this context.
Middle band discovery services are being delivered through emerging distributed searching and directory approaches on the Internet, notably whois++, LDAP, and Dienst. There is some use of Z39.50 also, notably for GILS.
Band three documentation approaches are in early stages. However, there has been some discussion of using Z39.50 for search and retrieve in several cases. In particular, there has been some interest in the Z39.50 profile for access to digital collections <URL:http://lcweb.loc.gov/z3950/agency/profiles/collections.html>.
Standards-based resource discovery services are also in early stages. Examination of the descriptions collected in Part II of this report will show that many formats are still under development or are not widely implemented.
In Band 3, the 'documentation category', in particular, communities of users are working towards consensus and in some cases robust interoperating implementations are some time away.
In Band 2, the 'discovery category', IAFA/whois++ templates are in use in several projects, and are deployed in whois++ directory services. Dublin Core is being piloted in several projects, but an agreed syntax is only now being defined. RFC-1807 is used within the NCSTRL project <URL:http://www.ncstrl.org>. SOIF is widely used as the internal format for Harvest, but there is no agreed 'content' definitions. LDIF is in a similar position, lacking an agreed set of schema for resource description. LDIF and SOIF have attracted much interest as a result of Netscape's decision to base its directory server and catalog server products on LDAP and Harvest respectively.
Of course, an exception to this shallowness of implementation experience is MARC and MARC-like formats. There are many millions of MARC records worldwide, and there are elaborate organisational and technical infrastructures in place for creating and sharing them. MARC is special in this context because of its long established use and its centrality in the library community for describing print resources. There are several initiatives attempting to integrate descriptions of print and electronic resources through the use of MARC and some of these are described in the entries for Pica+ (not a MARC format, but a close analogue), MARC, UKMARC and USMARC. Some library organisations have a vested interest in using MARC for the description of network resources as it simplifies meshing existing sytems with new requirements. It should be noted that MARC records are only standardised at a certain level. ISO 2709 standardises a physical encoding for records. However, each national or other format defines its own set of designators and different rules determine the format of the data content. Several national formats have made changes to accommodate electronic resources. It is likely that conversion into and out of MARC will always be an issue that may have to be addressed by service providers in some contexts.
The majority of existing Z39.50 applications involve searching
of MARC based resources. However, this may gradually change as
other profiles are introduced.
It is clear then, that the organisational, service and technology contexts of resource discovery services are not stable and that risk-free selection of approaches or confident prediction of future scenarios is not possible. Importantly, there is no single driving agency for these developments. Vested interests, competitive advantage, integration with legacy systems or custom and practice will always mean that there are differences of approach.
Choices made within DESIRE must acknowledge this wider context of change.
A number of crosswalks (high level mapping tables for conversion)
are now available and can be referenced at UKOLN's metadata web
pages at <URL: http://www.ukoln.ac.uk/metadata/interoperability/>
Next | Table of Contents |