Caroline Arms, Andy Powell, Mogens Sandfaer
Each OAI repository can be thought of as a collection - a collection of metadata records and/or "full content" items. The collection of records and/or items may, optionally, be partitioned into one or more sub-collections, known as sets. It should be noted that OAI repositories need not contain "full content" items. For example, a subject gateway (a database of metadata about remote Internet resources) is a repository containing a collection of records but no items. A pre-print archive is an example of a repository that contains a collection of items and a collection of records.
This document considers the description of OAI repositories and sets and makes recommendations for mechanisms to encode such collection descriptions within the OAI Metadata Harvesting Protocol (OAI-MHP). The intention is to develop a generic collection description mechanism that can be used across all OAI repositories, i.e. one that is applicable to all domains, and that is rich enough to support:
Ideally, the descriptions should be simple enough that they will be provided by the majoritory of repositories.
This document does not consider the more general issues of collection description, nor does it consider how the OAI-MHP can be used to harvest collection description metadata about arbitrary external collections, though that is a perfectly valid use of the protocol.
There is one existing collection description mechanism within the OAI-MHP. The response to an Identify request may contain a list of description containers, which provide an extensible mechanism for communities to describe their repositories.
The 1.1 specification contains two sample XML Schemas - oai-identifier and eprints:
It should be noted that the oai-identifier schema doesn't provide a true collection description, it merely describes the format of identifiers used by the repository. The eprint schema provides a collection description covering both the items (dataPolicy) and the records (metadataPolicy). However, the eprint schema is fairly minimal. For example, there is no indication of the subject coverage of the repository.
In general, responses to the Identify request describe the repository as a whole - there is no agreed mechanism for separately describing the collection of items and the collection of records within the same repository.
The current version of the protocol provides no mechanism for describing the sets within a repository, other then providing the setName as part of the response to a ListSets request.
An analysis of the usage of sets by currently registered OAI repositories is available in Appendix A.
Out of 49 repositories, 39 are using sets. Of these 13 appear to partition their collection by subject area, 13 by genre, and 9 by source of records.
In order to share descriptions of repositories and sets a collection description schema needs to be agreed. Five possibilies are suggested here:
Of these, there is some benefit in using simple Dublin Core because of its use elsewhere in the protocol. EAD has a strictly archival background and may not be applicable across the whole range of OAI repositories. The RSLP schema may be considered to be over-complex for use within OAI. The eprints schema may not be considered complex enough. UDDI primarily focuses on 'service' description, though it may also provide a framework for describing collections.
It is worth noting that mappings between the DC, EAD and RSLP schemas already exist.
As described above, the OAI-MHP already provides a mechanism for describing the repository as a whole, using the description container within the response to an Identify request. All of the schemas listed above could be encoded within the description container, provided a suitable XML schema is made available.
To provide descriptions of the sets within a repository the protocol will need to be enhanced. For example, it would be possible to add an optional setDescription container within the response to a ListSets request. A response might look like this:
<?xml version="1.0" encoding="UTF-8"?> <ListSets xmlns="http://www.openarchives.org/OAI/1.1/OAI_ListSets" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/1.1/OAI_ListSets http://www.openarchives.org/OAI/1.1/OAI_ListSets.xsd"> <responseDate>2001-06-01T19:20:30-04:00</responseDate> <requestURL>http://an.oa.org/OAI-script?verb=ListSets</requestURL> <set> <setSpec>Oceanside</setSpec> <setName>Oceanside University of Nebraska</setName> <setDescription> <dc xmlns="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://www.openarchives.org/OAI/1.1/dc.xsd"> <subject>Marine biology</subject> <publisher>University of Nebraska</publisher> <rights>Metadata may be used without restrictions as long as the oai identifier remains attached to it.</rights> </dc> </setDescription> </set> </ListSets>
However, it is possible to view each repository and/or set as two collections - a collection of items and a collection of metadata records. It should be noted that, in the discussion above, the use of description and setDescription containers does not provide a mechanism for separately describing these two collections within the repository and/or set. In order to support separate descriptions, separate containers would be required. For example: itemsDescription, recordsDescription, setItemsDescription and setRecordsDescription.
Rather than providing collection descriptions in-line within the protocol, an alternative approach might be to provide some mechanism for linking to an external collection description about the repository and/or sets.
It is recommended that further work be carried out to propose answers to the following questions:
By Caroline Arms, Library of Congress
As of 2001-10-22
OAI id | Repository name | # of sets | Apparent semantics |
---|---|---|---|
celebration |
A Celebration of Women Writers |
0 |
|
aps |
American Philosophical Society |
5 |
discipline |
arXiv |
arXiv |
4 |
discipline |
bmc |
BioMed Central |
60 |
topic |
CDLCIAS |
California International and Area Studies Digital Repository |
11 |
discipline (region of study) |
caltechcstr |
Caltech Computer Science Technical Reports |
8 |
discipline (1), decade |
caltecheerl |
Caltech Earthquake Engineering Research Laboratory Technical Reports |
1 |
|
caltechETD |
Caltech Electronic Theses and Dissertations |
1 |
genre |
cimi |
CIMI Metadata Harvesting Working Group Demonstration Repository |
>100 |
source of records |
citebase |
Cite-Base services |
2 |
source of records |
cogprints |
CogPrints |
50 |
topic |
cbold |
Comparative Bantu Online Dictionary (CBOLD) |
1 |
genre of content |
CSTC |
Computer Science Teaching Center |
1 |
|
CDLDERM |
Dermatology Digital Repository |
34 |
topic |
DUETT |
DUETT - Dissertations and other Documents of the Gerhard-Mercator-University Duisburg |
1 |
genre (dissertations and theses) |
eldorado |
Elektronisches Dokumenten-, Archivierungs- und Retrievalsystem der Universitaet, Dortmund |
16 |
discipline |
elra |
European Language Resources Association |
10 |
genre |
formations |
Formations |
23 |
discipline |
cav2001 |
Fourth International Symposium on Cavitation |
21 |
topic (1), session at conference |
hsss |
Hochschulschriftenserver (HSSS) der SLUB Dresden |
29 |
genre (3), discipline, |
HUBerlin |
Humboldt University of Berlin, GERMANY, Document Server |
36 |
genre (6), discipline |
scout |
Internet Scout Project OAI Repository |
0 |
|
lcoa1 |
Library of Congress Open Archive Initiative Repository 1 |
4 |
source of records, genre of content |
ldc |
Linguistic Data Consortium |
|
ERROR (403) |
|
LTRS |
50 |
topic |
|
M.I.T. Theses |
2 |
source of records |
|
NACA |
0 |
|
etdcat |
OCLC Online Computer Library Center Theses and Dissertations Repository |
0 |
|
|
Perseus Digital Library |
15 |
source of records |
physdoc |
PhysNet, Oldenburg, Germany, Document Server |
1 |
genre (PhD theses) |
|
Resource Discovery Network |
0 |
|
RIACS |
RIACS - Research Institute for Advanced Computer Science - Eprint Archive |
|
ERROR (500) |
sceti |
Schoenberg Center for Electronic Text and Image |
0 |
|
MONARCH |
Technical University of Chemnitz - MONARCH |
14 |
genre of content (level of thesis, article, etc.) |
EKUTuebingen |
The Eberhard Karls University of Tuebingen |
0 |
|
lacito |
The LACITO Archive |
3 |
genre of content |
dfki |
The Natural Language Software Registry |
1 |
|
|
The Oxford Text Archive |
1 |
|
dlpscoll |
The University of Michigan. University Library. Digital Library Production Service. |
15 |
source of records, genre of content |
CDLTC |
Tobacco Control Digital Repository |
>100 |
topic |
|
Tropicos |
1 |
genre |
thesis |
Universidad de las Americas - Puebla: Digital Thesis |
1 |
genre (theses) |
uiLib |
University of Illinois Library |
6 |
source of records |
tkn |
University of Tennessee Libraries |
8 |
source of records |
VTETD |
Virginia Tech Electronic Thesis and Dissertation Collection |
1 |
genre (theses) |
|
AISRI (American Indian Studies Research Institute) |
1 |
genre of content |
anlc |
Alaska Native Language Center |
0 |
|
idli |
University of Illinois at Urbana-Champaign, Digital Library Initiative |
3 |
source of records |
|
Chemistry Preprint Server |
10 |
discipline |