Metadata: an overview of current resource description practice
Work Package 3 of Telematics for Research project DESIRE (no. 1004)
Title page
Table of Contents

ICPSR SGML Codebook Initiative

Environment of Use

Constituency of use

The Inter-university Consortium for Political and Social Research (ICPSR) established a committee in May 1995 to develop a structured standard to describe social science data sets. The committee was a response to a perceived need amongst the social science archive community for an international codebook standard (a codebook generally contains information on the structure, contents, and layout of a datafile or data set).

Documentation

Information documenting the proposed DTD (Documentation Type Definition) and content for the codebook standard can be found at <URL: http://www.lib.umich.edu/codebook.html>.

Ease of creation

The standard is still being formulated, the committee will be meeting in October 1996 to agree on a final draft for the standard with the intention that implementations will begin before the end of the year.

Progress toward international standardisation

The ICPSR is an international organisation with membership from 325 colleges and universities in North America and several hundred institutional members in Australia, Denmark, France, Germany, Great Britain, Hungary, Israel, the Netherlands, Norway, South Africa and Sweden. The codebook committee was established to be representative of all the archives and includes a representative from CESSDA (Council of European Social Science Data Archives), as well as representatives from Canada, Denmark, Norway and Germany. The elements for the codebook were chosen by reviewing a series of guidelines and standards in use by the social science survey, research, archive, and technical communities. The lists below include some of the materials that were examined:

Guidelines that prescribe what the codebook itself should contain (content standards):

• Roistacher: 1980, A Style Manual for Machine-Readable Data Files

• Geda: 1980, Data Preparation Manual (ICPSR)

• Collins, Patrick and Jane Powers, 1991, The preparation of data standards for machine-readable data.

• National Data Archive on Child Abuse and Neglect (Cornell University)

• US Bureau of the Census, Statistical Research Division, Statistical Design and Methods Extension to Cultural and Demographic Data Metadata: CDDM draft standard 1995.

• Federal Geographic Data Committee content standards for digital geospatial metadata

Standards that define how to describe the study:

• Standard Study Description: developed by and for data archives, Council of European Social Science Data Archives.

• ICPSR Study Description "Template" Manual

• Essex Study Description outline (based on the Standard Study Description)

Standards that establish rules for producing records for cataloguing:

• MARC

• ISBD-CF: The International Standard Bibliographic Description for Computer Files

• GILS: Government Information Locator System

• ISO: International Standards Organization: ISO 690-2

• Dublin Core: OCLC/NCSA Metadata Workshop recommendations

Descriptions of codebook elements produced as a by-product of computerised interviewing software:

• Health and Welfare Canada

• Computer Assisted Survey Methods, University of CA, Berkeley

Standards that establish rules for tagging the contents of the codebook text:

• OSIRIS

• TEI: Text Encoding Initiative DTD for SGML

• EAD: Encoded Archive Description DTD for SGML

Other comments

The standard is still in the development phase but the indications are that the initiative has wide support amongst the social science data archives, the ICPSR also hope that data producers and granting agencies will adopt the standard.

Format Issues

Content

There are 5 main sections in the proposed structure:

• Codebook header

• Study description

• Data files description

• Record and variable description

• Other study-related materials

Each of the 5 main sections contain further sub-sections and elements.

Basic descriptive elements

The basic bibliographic elements of the data set are described in section 2 Study description under the sub-section Citation:

• Title statement of data set

• title

• subtitle

• parallel title

• common abbreviation

• study number - producer

• study number - archive

Subject description

The description of subject is dealt with in section 2 - Study description under the sub-section Study scope:

• Subject information

• keywords

• topic classification

URIs

None

Resource format and technical characteristics

The format of the data set is dealt with in section 3 Data files description:

• Type of file - text, numerical, graphic, program source, etc.

Host administrative details

These are provided for in section 2 - Study description under the sub-section Citation:

• Distributor statement for data set

• documentation distributor

• contact persons

• depositor

• date of deposit

• date of distribution

Administrative metadata

All administrative information is provided in section 1 - Codebook header. Sub-sections here include:

• Title statement for documentation

• Responsibility statement for documentation

• Production statement for documentation

• Distributor statement for documentation

• Series statement for documentation

• Version statement for documentation

• Bibliographic citation of documentation

Provenance

The source of the data set is provided in section 2 Study description under sub-section Citation, elements include:

• Production statement for data set

• producer

• date of production

• place of production

Terms of availability/copyright

This information is provided in Section 2 - Study description under sub-section Data access:

• Data set availability

• original archive where study stored

• collection note

• extent of collection

• completeness of study stored

• number of files

• Data use statement

• restrictions

• access authority

• citation requirement

• disclaimer

• analysis conditions

• other reanalysis conditions note

Encoding

An SGML DTD has been proposed. Codebooks encoded into SGML could also be used for the production of data definition statements for use by statistical analysis software such as SAS or SPSS. There is also a proposal to produce a TEI compliant base tag set.

Multilingual issues

Details of language can be found in Section 2 Study description:

• Documentation statement

• Language (s) of written materials

Ability to represent relationships between objects

There are fields for citing bibliographic information about and/or links to related materials and studies.

Fullness

Full - provides a very rich and comprehensive description of data sets.

Protocol Issues

There are no specified protocols assigned to this format as yet but the committee are looking at the possibilities of using Z39.50.

Implementations

This is a proposed standard, the developers have applied the DTD to some sample codebooks but they are not in use as yet.

Next
                                                  
Table of Contents