Metadata: an overview of current resource description practice
Work Package 3 of Telematics for Research project DESIRE (no. 1004)

Title page
Table of Contents

Metadata and its uses

Metadata is data which describes attributes of a resource. Typically, it supports a number of functions: location, discovery, documentation, evaluation, selection and others. These activities may be carried out by human end-users or their (human or automa ted) agents.

It is recognised that in an indefinitely large resource space, effective management of networked information will increasingly rely on effective management of metadata. The need for metadata services is already clear in the current Internet environment. A s the Internet develops into a mixed information economy deploying multiple application protocols and formats this need will be all the greater. Metadata is not only key to discovery, it will also be fundamental to effective use of found resources (by est ablishing the technical or business frameworks in which they can be used) and to interoperability across protocol domains.

Part II of this report describes a range of metadata formats. It is unlikely that some monolithic metadata format will be universally used. This is for a number of more or less well known reasons. There is a variety of types of metadata. There is traditio nal descriptive information of the kind found in library catalogues, which typically includes such attributes as author, title, some indication of intellectual content and so on. There is information that might help a client application make a decision ba sed on format (where certain local browser equipment is available) or on location (to save bandwidth). There are different types of user: a user as customer wishes to know the terms under which an object is available; a user as researcher may wish to have some extended documentation about a particular resource, its provenance for example. There are different types of resource. Some resources may have a fugitive existence, existing to satisfy some temporary need and only ever minimally described if at all; some are important and valuable scholarly or commercial resources, where the value of extensive description is recognised. Some resources may be simple; some may be complex in various ways. There will be many different information providers, some commerc ial 'yellow pages' type services, some scholarly or research-oriented services, in different organisational configurations with different target audiences and products. Metadata may be closely coupled with the object it describes as an intrinsic part of i ts composition; or it may have no intrinsic link with it at all. And so on ...

Thus, the nature of the problem to be solved suggests a variety of solutions. In the following sections we examine some characteristics of the environment in which network information of interest to European researchers is being created and some of the fa ctors which are influencing the development of metadata services.

Control and the publishing environment

The discipline or control exercised over the production of collections of resources will improve as the web becomes a more mature publishing environment. There will be managed repositories of information objects. Such repositories may be managed by inform ation producing organisations themselves, universities for example, by traditional and 'new' commercial publishers, or by other organisations (the Arts and Humanities Data Service in the UK, for example, or industrial and other research organisations, arc hives, image libraries, and so on). This is not to suggest that the existing permissive electronic publishing environment will not continue to exist in parallel. One concern of a managed repository will be that its contents are consistently disclosed and that descriptions are promulgated in such a way that potential users, whoever they might be, are alerted to potentially relevant resources in that repository.

Different repositories will have different requirements and priorities. Examples are a social science data archive, a university web site, a commercial publisher's collection of electronic journals, an archival finding list, and so on. Objects on a univer sity web-site may be briefly and simply described. A data archive may need extensive documentation.

A variety of metadata creators and sources

There will be a variety of metadata creators. These fall into three broad categories: 'authors', repository managers, and third party creators. As its importance becomes more apparent, 'authors' are likely to create descriptive metadata: a major incentive for this will be agreement about the use of the META tags in HTML documents for embedding metadata which will be harvested by programs. Descriptive data will be similarly embedded in other objects by those responsible for their creation. Metadata will al so be created by repository managers, who have some responsibility for a resource and the data that describes it. Third party creators (including, for example, the information gatways being developed in Desire) create metadata for resources which they the mselves may not manage or store.

Metadata may sit separately from the resources it describes; in some cases, it may be included as part of the resource. Embedded HTML tags is probably the simplest example of the latter case, but it is common in some of the domain-specific SGML frameworks described in the review section. For example, a TEI header needs to accompany conformat TEI documents. However, independent TEI headers may also exist, which describe documents which may be physically remote.

Metadata, once created may be shared with others. Take for example, author-created metadata embedded in HTML documents. This may be collected by robot or other means. Value will be added to this data at various stages along whatever use chain it traverses : by a local repository manager, by subject-based services like the ones under consideration here, by crawler-based indexing services, by various other intermediary services. These intermediary services might include librarians and others who now invest i n current awareness and SDI (selective dissemination of information) services, as well, maybe, as current abstracting and indexing services. Many authors may only provide basic information: typically they will not be conversant with controlled subject des criptor schemes, record all intellectual or formal relationships with other resources, and so on.

A different use chain might be traversed by fuller metadata associated with the scholarly edition of an electronic text, for example. Full documentary metadata would be available to assist in the analysis and use of the text, but a subset might be output to a general purpose discovery service. There might be a link back to the fuller metadata from the shorter record.

A number of factors, including the perceived value of a resource, will determine the relative balance between author-produced, added value and third-party original descriptions in different scenarios. The metadata ecology and economy is still in developme nt.

Structure and fullness

The level of created metadata structure (however it is designed) and the level of intellectual input deemed necessary will depend on the perceived value of the resources and the environment of use.

Webcrawlers tend to describe individual web pages. Newer approaches based on manual description have initially tended to focus on servers, and not describe particular information objects on those servers or the relationships between objects. The subject i nformation gateways fall into this category. Neither approach is complete as users are interested in resources at various levels of granularity and aggregation which may not be satisfied by either of these simplified approaches. There also exist a number of emerging approaches specialised for a particular community of users. Quite often, these are rich in terms of content and structure: they are created to represent the objects in a collection and the relationships between them. Examples from the archives , museums, and other communities are given below.

A tripartite division along these lines is further elaborated below. Web indexes based on robot extraction of (currently unstructured) metadata are cheap to create, are automatic. Documentation of a particular collection by specialists is expensive. 'Info rmation gateway' services add value through intellectual effort, and are correspondingly expensive. These factors will drive the creation of author-produced metadata and more sophisticated automatic extraction techniques. However, the creation of full, st ructured metadata will remain expensive, wherever along the use chain that cost falls.

A distributed environment

Programs will collect and manipulate data in a variety of ways, passing it off to other applications in the process. Data may be collected and served up in various environments, converted on the fly to appropriate formats: it may be collected by robot, ad ded to a local 'catalogue', or pulled into a subject-based service. The metadata we have been talking about refers to network information resources. This will need to be integrated at some level with the large, albeit highly fragmented, metadata resource for the print literature. There may also be metadata about people, about courses, about research departments and about other objects. Programs might periodically look for resources that match a particular user profile, might search for people with a parti cular research interest, and so on.

These developments will take place in a rapidly changing distributed environment in which directory protocols (e.g. Whois++, LDAP), search and retrieve protocols (e.g.Z39.50), Harvest, and a variety of other approaches will be deployed. These will be hidd en from the user by the web, itself likely to be transformed by the integration of Java and distributed object technologies.

Next

Table of Contents