Issues with current use of simple DC
From DigiRepWiki
[ Home | Functional Requirements | Application Model | Application Profile | Community Acceptance Plan | Mapping to Simple DC | XML Format |
Contents |
Background
The ePrints UK project developed a set of guidelines for describing eprints using simple DC. The guidelines are available at http://www.rdn.ac.uk/projects/eprints-uk/docs/simpledc-guidelines/. The use of simple DC means that eprint metadata can be easily transferred using the OAI-PMH. However, there are some issues and problems for application developers caused by the limitations of simple DC.
This document analyses each of the recommendations in the guidelines and summarises those areas where there is felt to be a significant weakness in the metadata, caused by the use of simple DC.
Analysis
Note: properties marked with an asterix are mandatory. All other properties are optional.
Current guidance | Issues/problems |
---|---|
dc:title (*)
The title of the eprint. Preserve the original wording, order and spelling of the eprint title. Only capitalize proper nouns. Punctuation need not reflect the usage of the original. Subtitles should be separated from the title by a colon. For example:
If necessary, repeat this element for multiple titles. |
|
dc:creator (*)
An author of the eprint. Personal names should be listed surname or family name first, followed by forename or given name or initial followed by a full stop. Separate the surname (or family name) from the forenames, given names or initials with a comma. Titles (Dr., Prof., etc.) should precede the forenames, generational suffixes (Jr., Sr., etc.) should follow the family name. When in doubt, give the name as it appears, and do not invert. For example:
In the case of organizations where there is clearly a hierarchy present, list the parts of the hierarchy from largest to smallest, separated by full stops. If it is not clear whether there is a hierarchy present, or unclear which is the larger or smaller portion of the body, give the name as it appears in the eprint. For example:
Only encode organisations in this element to indicate corporate authorship, not to indicate the affiliation of an individual. The inclusion of personal and corporate name headings from authority lists constructed according to AACR2 [8], e.g. the Library of Congress Name Authority File (LCNA), is also acceptable. In cases of lesser responsibility, other than authorship, use dc:contributor. If the nature of the responsibility is ambiguous, recommended best practice is to use dc:publisher for organizations, and dc:creator for individuals. If necessary, repeat this element for multiple authors. |
|
dc:subject (*)
The topic of the eprint. In general, choose the most significant and unique words for keywords, avoiding those too general to describe a particular eprint. If the subject of the eprint is a person or an organization, use the same form of the name as you would if the person or organization were an author, but do not repeat the name in the dc:creator element. For free-text keywords either encode multiple terms with a semi-colon separating each keyword; or repeat the element for each term. There are no requirements regarding the capitalization of keywords though internal (within archive) consistency is recommended. Where terms are taken from a standard classification scheme: encode each term in a separate element. Encode the complete subject descriptor according to the relevant scheme. Use the capitalisation and punctuation used in the original scheme. Where subject terms are taken from LCSH, the subfields of the subject heading should be separated by double dash (--) and spaces should be omitted. For example (using free-text keywords and LCSH):
|
|
dc:description (*)
A summary of the content of the eprint, typically in the form of an abstract. |
|
dc:publisher (*)
Eprint-specific Recommendation: The publisher of the eprint, typically either the author's institution or a commercial publisher. In the case of organizations where there is clearly a hierarchy present, list the parts of the hierarchy from largest to smallest, separated by full stops. If it is not clear whether there is a hierarchy present, or unclear which is the larger or smaller portion of the body, give the name as it appears in the eprint. For example:
Personal names should be listed surname or family name first, followed by forename or given name or initial followed by a full stop. Separate the surname (or family name) from the forenames, given names or initials with a comma. Titles (Dr., Prof., etc.) should precede the forenames, generational suffixes (Jr., Sr., etc.) should follow the family name. When in doubt, give the name as it appears, and do not invert. For example:
The inclusion of personal and corporate name headings from authority lists constructed according to AACR2 [8], e.g. the Library of Congress Name Authority File (LCNA), is also acceptable. |
|
dc:contributor
A contributor to the eprint (but not one of the primary authors). For example, a supervisor, editor, technician or data collector. Personal names should be listed surname or family name first, followed by forename or given name or initial followed by a full stop. Separate the surname (or family name) from the forenames, given names or initials with a comma. Titles (Dr., Prof., etc.) should precede the forenames, generational suffixes (Jr., Sr., etc.) should follow the family name. When in doubt, give the name as it appears, and do not invert. For example:
In the case of organizations where there is clearly a hierarchy present, list the parts of the hierarchy from largest to smallest, separated by full stops. If it is not clear whether there is a hierarchy present, or unclear which is the larger or smaller portion of the body, give the name as it appears in the eprint. For example:
Only encode organisations in this element to indicate a corporate contribution, not to indicate the affiliation of an individual. The inclusion of personal and corporate name headings from authority lists constructed according to AACR2 [8], e.g. the Library of Congress Name Authority File (LCNA), is also acceptable. |
|
dc:date (*)
The 'last-modified' date of the eprint and/or the date of its accession into the archive. The date should be formatted according to the W3C encoding rules for dates and times [9] (a profile based on ISO 8601 known as W3C-DTF), for example:
If necessary, repeat this element to provide both the last-modified date and the date of accession. The last-modified date will be assumed to be the more recent of the two dates. If only one date is provided, it will be assumed that the last-modified date and the date of accession are the same. |
|
dc:type (*)
The type of eprint. Recommended best practice is to take the value of this element from the following list:
For example:
If necessary, repeat this element to encode multiple types. If necessary, repeat this element to indicate the peer-reviewed status of the eprint, using one of the following values:
For example:
|
|
dc:format
The media-type of the eprint. Recommended best practice is to select a term from the IANA registered list of Internet Media Types (MIME types) [10]. For example:
Repeat this element if the eprint is available in multiple formats. |
|
dc:identifier (*)
Eprint-specific Recommendation: A URI or bibliographic citation for the eprint, typically the URI of the 'jump-off page' for the eprint, as served by the archive. For example:
If possible, repeat this element to provide a full bibliographic citation for the eprint. For example:
If possible, also repeat this element to provide an OpenURL [11] for the eprint, using the form below. For example:
(Note that lines in these two examples have been wrapped for readability.) |
|
dc:source
The URI, title or bibliographic citation for a resource from which the eprint is derived. In general, this element should not be used. |
|
dc:language (*)
Eprint-specific Recommendation: The language in which the eprint is written. Use the language codes defined in RFC 3066 [12], for example:
If necessary, repeat this element to indicate multiple languages. |
|
dc:relation (*)
The URI of each available format of the eprint. If necessary, repeat this element for multiple formats. Also repeat this element if the eprint is available from other locations, for example from the publisher's Web site. For example:
|
|
dc:coverage
The geographic location or temporal period that the eprint is about. Recommended best practice is to select the value from a controlled vocabulary (for example, the Getty Thesaurus of Geographic Names [13] or TGN) and that, where appropriate, named places or time periods be used in preference to numeric identifiers such as sets of co-ordinates or date ranges. If necessary, repeat this element to encode multiple locations or periods. |
|
dc:rights
A human-readable statement about the rights held in and over the eprint, the URI of a Creative Commons [14] licence or the URI of a machine-readable statement. For example:
|
|
Summary
A summary of the major issues follows:
- It is difficult to differentiate ‘works/expressions’ from ‘manifestations/items’
- therefore difficult to use metadata as basis for bringing together information about different manifestations of the same work/expression, e.g. for citation analysis purposes
- In particular, it is difficult to determine if dc:identifier is being used to identify the work/expression or a particular manifestation/item of the work. In the ePrints UK Guidelines for using Simple DC to describe ePrints, dc:identifier is used to identify the ‘work/expression’ and dc:relation is used to identify ‘manifestation/item’. However, dc:relation may be used for other resources (e.g. cited works), therefore there is ambiguity in the metadata record. In any case, the guidelines not widely implemented anyway
- therefore difficult for software applications to move reliably from the metadata record to the full text.
- It is not possible to determine whether subject terms are taken from a controlled vocabulary or not (e.g. is ‘Physics’ a free-text keyword or a term taken from Dewey?)
- therefore difficult to base subject-browse interfaces on controlled vocabulary hierarchy.
- It is not possible to disambiguate authors with same name or reconcile instances of the same author being given different forms of name
- therefore difficult to build browse-by-author type interfaces.
- Dates are ambiguous (either because of formatting and/or because type of date is not known)
- therefore difficult for software applications to make decisions based on dates in the metadata.