Metadata: an overview of current resource description practice
Work Package 3 of Telematics for Research project DESIRE (no. 1004)

Some characteristics of investigated metadata formats

Here we briefly examine characteristics of the metadata formats considered in this study, taking as a framework the broad categories which structure the format descriptions in Part II.

One can suggest an approximate grouping along a metadata spectrum which becomes successively richer in terms of fullness and structure. For purposes of analysis, we propose three bands within this spectrum, which allows us to sketch some shared characters tics across groups of formats.

Band 1	Band 2	Band 3
Unstructured indexes	Dublin Core	ICPSR
	IAFA	FGDC
	RFC 1807	CIMI
		TEI
	SOIF	EAD
	LDIF	...

Environment of use

Band one includes relatively unstructured data, typically automatically extracted from resources and indexed for searching. The data has little explicit semantics and does not support searching by field.

Currently, this data is created by the web crawlers. Many services exist based on such data, and several global services are in heavy use. If a user is looking for a known item, they can be reasonably effective. Because they are global in scope and operat e on limited descriptions they are less effective for discovery. A user may find many resources, but may have to sift through them and will miss many potentially relevant resources because they are not indexed with appropriate terms. Nor, in many cases, i s the metadata full enough to allow the user make relevance judgements in advance of actually retrieving the resource. Typically, crawlers are not selective about the resources they index: they often aim for comprehensiveness at some level within their ta rget area, whether that is the world or some part of it. For these reasons, they have some limitations as discovery services. These issues are well known and such services are seeking to enhance the metadata on which they operate: different services have different conventions to allow authors of web pages to include various categories of metadata which can then be collected. There is also some discussion about a common representation for the exchange of such metadata between global indexes and other servi ces, and the harvesting of fuller metadata. We do not look in detail into such indexes here as they are the subject of another working paper in the Indexing and Cataloguing component of DESIRE (number ..).

Band two includes data which contains full enough description to allow a user to assess the potential utility or interest of a resource without having to retrieve it or connect to it. The data is structured and supports fielded searching. Typically these records are simple enough to be created by non-specialist users, or not to require significant discipline-specific knowledge. Descriptions tend to be of discrete objects and do not capture multiple relationships between objects. Typically, but not essenti ally, descriptions are manually created, or are manual enhancements of automatically extracted descriptions, and they include a variety of descriptive and other attributes. They may be created to be loaded directly into a discovery service or to be harves ted.

Services in this area include OCLC's NetFirst (based on its own internal format) and the UK Electronic Libraries Programme subject-based information gateways (some of which use their own internal format; some use IAFA templates). Often, these services inv olve some selectivity in what they describe and may have more or less explicit criteria for selection. For these reasons, they may be expensive to create, again driving an interest in author- or publisher- generated description and automatic extraction te chniques such as those piloted by Essence as part of the Harvest software.

Our third Band includes fuller descriptive formats which may be used for location and discovery, but also have a role in documenting objects or, very often, collections of objects. Typically, they are associated with research or scholarly activity, requir e specialist knowledge to create and maintain, and cater for specialist domain-specific requirements. The are expressive enough to capture a variety of relationships at different levels. Developments described below include the ICPSR SGML codebook initiat ive to describe social science data sets, the Encoding Archive Description, Content Standards for Digital Geospatial Metadata and Computer Interchange of Museum Information.

It should be clear that these are not watertight categories, especially as implementations may vary. GILS and CIMI object descriptions might be considered to be in the middle band for example.

Against this background one can note some trends, especially across the boundaries of these bands. Author or site produced metadata will become more important for many purposes. This may be harvested unselectively, or only from selected sites. An importan t motivation for this is to overcome some of the deficiencies of current crawlers without a provider incurring the cost of record creation. In some respects, the crawlers will assume some of the characteristics of the middle band.

At the same time, communities using the richer 'documentation' formats will wish to disclose information about their resources to a wider audience. How best to achieve this will have to be worked out: perhaps 'discovery' records will be exported into othe r systems. These trends suggest that the middle band will become more important as a general-purpose access route, maybe with links to richer domain-specific records in some cases.

Format issues

Metadata formats

There is currently no widely-used standard for data in band one, though there are moves to develop a shared format for exchange, perhaps based on SOIF. There is also a trend noted above to enhance the data collected by these services in various ways, maki ng them better suited to discovery.

The middle band metadata used in discovery services tends to be based on simple record structures influenced by RFC-822 style attribute-value pairs. Formats here do not contain elaborate internal structure, do not easily represent hierarchical or other ag gregated objects, nor, typically, do they express the variety of relationships which might exist between objects. This is usually by design: there is a necessary trade-off between simplicity and expressiveness. Also, their purpose is to be hospitable to t he non-specialist description of information objects of different types and from different domains and so is not concerned with the very specific requirements of any one domain. Of the discovery service formats which we examine here, IAFA templates are p erhaps the most detailed. There are templates for different types of object (document, user, logical archive, etc.), and there has been some consideration given to 'clusters' of data which are likely to be repeated across records and to variants within re cords.

There has been some interesting recent discussion about the future direction of the Dublin Core in this context. The Dublin Core is a simple resource description format. It could be extended in two ways. Firstly, it could be extended to accommodate elemen ts which contain other types of metadata: terms and conditions, archival responsibility, administrative metadata and so on. Secondly, it could be designed for resource description of different levels of fullness and within different communities. The IAFA document template is an example of one such format, USMARC another. We would argue that it is undesirable either that there be one single format for resource description or that a single format be indefinitely expanded to accommodate all future requiremen ts. The need to retain a Dublin Core optimised for its target use together with the need to exchange a variety of types of metadata led to the proposed Warwick Framework (which is described in Part II). This is a container architecture for the aggregation of metadata objects and their interchange. However, such an architecture is not yet in place and implementation details are far from clear. It is therefore inevitable that there be a continuing tension between simplicity and the need to provide more expr essiveness or functionality.

Although the bulk of the formats in this range follow an attribute-value pair structure, it has been agreed that an SGML DTD will be developed for the Dublin Core. At the 'documentation end' of discovery it is likely that other formats will be found. MARC is a notable one which will be further considered below, but the encoding of choice is now likely to be SGML as in CIMI object descriptions.

Because of some similarity of construction and content across formats in this band, conversion between them, though inevitably lossy, is feasible.

The documentation band contains some very full frameworks for the description of multiple aspects of objects and collections of objects. In some cases, the frameworks describe metadata objects as one type only of information object: they are concerned wit h 'information content' also. Typically, work is proceeding within an SGML context and the example of the Text Encoding Initiative has been quite influential. Within the social science, museums, archives and geospatial data communities work is progressing on establishing DTDs. These may relate to collection level description, item level description, and allow various levels of aggregation and linkage appropriate to the domain. They cater for a very full range of attributes appropriate to documenting data sets or other resources. These can be distinguished from the range in the middle band by fullness (they go into more detail), structure (they contain richer structuring devices), and specialism (they may be specific to the relevant domain).

It seem likely that specialist users will want to search such data directly, but that to make data more visible to more general 'discovery' tools, there may be export of data in some of what we have called 'discovery' formats. Indeed, the Dublin Core has been explicitly positioned as a basis for semantic interoperability across richer formats, although it has not been widely used in this context.

Protocol issues

Middle band discovery services are being delivered through emerging distributed searching and directory approaches on the Internet, notably Whois++, LDAP, and Dienst. There is some use of Z39.50 also, notably for GILS.

Band three documentation approaches are in early stages. However, there has been some discussion of using Z39.50 for search and retrieve in several cases. In particular, there has been some interest in the Z39.50 profile for access to digital collections <URL: http://lcweb.loc.gov/Z3950/agency/profiles/collections.html>.

Implementations

Standards-based resource discovery services are also in early stages. Examination of the descriptions collected in Part II of this report will show that many formats are still under development or are not widely implemented.

In Band 3, the 'documentation category', in particular, communities of users are working towards consensus and in some cases robust interoperating implementations are some time away.

In Band 2, the 'discovery category', IAFA/Whois++ templates are in use in several projects, and are deployed in Whois++ directory services. Dublin Core is being piloted in several projects, but an agreed syntax is only now being defined. RFC-1807 is used within the NCSTRL project <URL: http://www.ncstrl.org>. SOIF is widely used as the internal format for Harvest, but there is no agreed 'content' definitions. LDIF is in a similar position, lacking an agreed set of schema for resource description. LD IF and SOIF have attracted much interest as a result of Netscape's decision to base its directory server and catalog server products on LDAP and Harvest respectively.

Of course, an exception to this shallowness of implementation experience is MARC and MARC-like formats. There are many millions of MARC records worldwide, and there are elaborate organisational and technical infrastructures in place for creating and shari ng them. MARC is special in this context because of its long established use and its centrality in the library community for describing print resources. There are several initiatives attempting to integrate descriptions of print and electronic resources t hrough the use of MARC and some of these are described in the entries for Pica+ (not a MARC format, but a close analogue), MARC, UKMARC and USMARC. Some library organisations have a vested interest in using MARC for the description of network resources as it simplifies meshing existing sytems with new requirements. It should be noted that MARC records are only standardised at a certain level. ISO 2709 standardises a physical encoding for records. However, each national or other format defines its own set of designators and different rules determine the format of the data content. Several national formats have made changes to accommodate electronic resources. It is likely that conversion into and out of MARC will always be an issue that may have to be addr essed by service providers in some contexts.

The majority of existing Z39.50 applications involve searching of MARC based resources. However, this may gradually change as other profiles are introduced.

Next

Table of Contents