1.1. Background
This report investigates the use of classification
schemes to aid retrieval in a network environment, specifically
with regard to the Internet. The library community, over many
years, had appeared to favour subject indexing systems (the use
of a controlled vocabulary to assign indexing terms to documents)
over the use of traditional classification schemes (grouping documents
into a hierarchical structure of subject categories). During the
first period of the development of networked information services,
many specialists, especially those from the computing community,
also questioned the value of library subject description systems
in principle, pointing to the accomplishments of full-text indexing
software.
The increasing use of the Internet and the World
Wide Web (WWW) for the storage and retrieval of vast amounts of
information has, however, changed this perception. Two distinct
ways of finding resources on the Internet emerged (Dodd 1996,
p. 276). One approach consisted of the development of robot based
search engines which could be used for powerful keyword searches
of the contents of the WWW. These are extremely useful tools,
although they have a tendency to return large amounts of irrelevant
information. The other approach started with producing 'hotlists'
which would encourage users to browse the WWW. The production
of hierarchical browsing tools sometimes led to the adoption of
library classification schemes to provide the subject hierarchy.
At least one general discovery service, Yahoo! <URL:http://www.yahoo.com/>,
devised their own 'home-grown' classification scheme (or ontology)
to give structured hierarchical access to the resources which
they had indexed. Quality-controlled subject services, which gave
access only to selected Internet resources, also understood that
a browsing structure based on subject classification would be
a desirable compliment to a search engine type service. Most subject
services of this type, and almost all of the Electronic Libraries
(eLib) Programme access to network resources services and the
proposed DESIRE test-bed services currently use a classification
scheme which can be browsed. A list of Internet sites that use
library classification systems or subject headings can be found
in Beyond bookmarks (McKiernan 1996) <URL:http://www.iastate.edu/~CYBERSTACKS/CTW.htm>.
This report will describe the advantages of resource
classification for subject-based information gateways in the Internet
and will analyse the advantages and disadvantages of different
types of classification systems and will then review some important
individual schemes.
1.2. Advantages and disadvantages
of classification
The use of classification schemes offers one solution
to providing improved access to WWW resources. Web sites have
been created to act as a guide to other Web sites selected according
to some pre-specified criteria, e.g. they are judged to be good
quality resources or relevant to a particular subject-area. Some
of these sites typically consist of an alphabetical list of subjects,
and selected Web resources are listed below each one.
Examples include Argus Clearinghouse <URL:http://www.clearinghouse.net/>
and the WWW Virtual Library <URL:http://www.w3.org/pub/DataSources/bySubject/Overview2.html>.
In this context, it can be understood why classification schemes
have begun to be used to give added-value subject access to Web
sites. A site that organises knowledge with a classification scheme
demonstrates several advantages over sites which do not (cf. Svenonius
1983):
- Browsing: classified subject lists are easily
able to be browsed in an online environment. Browsing is particularly
helpful for inexperienced users or for users not familiar with
a subject and its structure and terminology. In addition, the
structure of the classification scheme can be displayed in different
ways as a navigation aid. The classification notation does not
even need to be displayed on the screen so an inexperienced user
can have the advantage of using a hierarchical scheme without
the distraction of the notation itself.
- Broadening and narrowing searches: classification
schemes are hierarchical and therefore can be used to broaden
(i.e. for improved recall) or narrow a search when required. Questions
can be limited to individual parts of a collection (filtering)
and the number of false hits be reduced (i.e. for improved precision).
- Context: the use of a classification scheme gives
context to the search terms used. For example, the problem of
homonyms (words which have the same form and spelling but a different
meaning) can be partly overcome.
- Potential to permit multilingual access to a
collection: since classification systems often use notations independent
from a specific language, indices in different languages can offer
multilingual access to the same resources without any further
changes to the collection. A searcher could enter search terms
in a given language and those terms would then relate to the relevant
parts of the classification system (as a switching language) and
be used to retrieve resources in any given language on the subject.
- The partitioning and manipulation of a database:
large classified lists can be divided logically into smaller parts
if required.
- The use of an agreed classification scheme could
enable improved browsing and subject searching across databases.
- An established classification system is not usually
in danger of obsolescence. The larger schemes are now undergo
continuous revision, although they are normally also formally
published in numbered editions. Some classifications may have
to be changed when a new edition of a scheme is published, but
it is unlikely that every single resource will have to be re-classified.
- They have the potential to be well-known: regular
users of libraries will be familiar with at least part of one
or more of the traditional library schemes. Members of a subject
community are likely to be familiar with their (subject-specific)
schemes as well. Use of an Internet service which uses them will
therefore have an advantage over one that uses its own classification
or none.
- Many classification schemes are available in
machine-readable form.
Classification schemes, however, can be sometimes
subject to criticism:
- The division of logical collections of material:
classification schemes often split up collections of related material.
This can be partly overcome with good cross-references.
- The illogical subdivision of classes: some popular
schemes do not always subdivide classes in a logical manner (Buchanan
1979, pp. 32-34; Rowley 1987, pp. 188-189). This can make them
difficult to use for browsing purposes.
- Assimilating new areas of interest: classification
schemes, since they are usually updated through formal processes
by organised bodies, often reveal difficulty in reacting to new
areas of study.
There are several different types of classification
systems around, varying in scope, methodology and other characteristics.
Detailed descriptions cannot be given here, but it might be useful
to know these different types, when trying to understand the terminology
of this report and when decisions about which scheme to use is
required.
Classification systems - by facet:
- by subject coverage: general or subject specific
- by language: multilingual or individual language
- by geography: global or national
- by creating/supporting body: representative of
a long-term committed body or an home-grown system developed by
a couple of individuals
- by user environment: libraries with container
publications or documentation services carrying small focused
documents (e.g. abstract and index databases)
- by structure: enumerative or faceted
- by methodology: a priori construction according
to a general structure of knowledge and scientific disciplines
or using existing classified documents
(The categories are not dichotomic, a classification
can fit into more than one category).
The facet structure above shows what types of classification
scheme are theoretically possible. In reality, the most frequently
used types of classification schemes are: a) universal; b) national
general; c) subject specific schemes, most often international;
d) home-grown systems; d) local adaptations of all types.
The term 'universal' schemes is used for schemes
which aim to include all subjects, are global geographically and
multilingual in scope. Part 2 of the report deals with some of
the most well-known individual schemes as examples.
The first practical universal classification schemes
were developed in the late-nineteenth-century as a response to
the problem of organising libraries in the context of rapidly
growing knowledge and an increase in the numbers of printed books.
Universal schemes aim to be both comprehensive and also to expand
and contract to fit the state of knowledge at any time.
The most widely-used universal classification schemes
are those which were developed for the use of libraries since
the late-nineteenth-century, notably the Dewey Decimal Classification
(DDC), the Universal Decimal Classification (UDC) and the classification
scheme devised by the Library of Congress (LCC).
Use of a universal, multidisciplinary classification
scheme in an Internet context results in the following advantages
(in addition to the general advantages of using a classification
scheme, see 1.2 above):
- They can cover all subject areas: The use of
an agreed universal classification scheme as a global top-level
structure could enable improved browsing and subject searching
across services and collections from different subject areas.
In theory, the use of an agreed universal scheme at many sites
would allow for the widest interoperability. But it should be
remembered that this is normally not the most important criteria
when choosing a scheme for a certain service (cf. 4. Conclusions).
- They are widely supported: For the universal
schemes, there is a global interest in support, development and
survival of the scheme. DDC, UDC and LCC have been repeatedly
revised since their first publication and are updated by responsible
international bodies.
- They might be known to more users than other
types of classifications: regular users of libraries will be familiar
with at least part of one or more of these schemes. Use of an
Internet service which uses them will therefore have an advantage
over one that uses its own classification or none.
- They have an especially good potential to permit
multilingual access to a collection: DDC was first published in
English and UDC in French, but have both been widely translated.
Full editions of UDC have been made available in English, German,
Russian and Spanish, and abridged versions are available in other
languages (Langridge 1973, p. 89; McIlwaine and Buxton 1995, pp.
7-8). DDC has been translated into 30 languages and is currently
used in 135 countries (Thompson, Shafer & Vizine-Goetz 1997).
This means that the tools already exist for multilingual access
to Internet sites organised with these schemes.
- The major universal classification schemes are
now all available in machine-readable form (see parts 2.1 - 2.3)
Universal classification schemes, however, are subject
to several criticisms:
- False ontology: there is a general concern that
universal schemes impose a false order upon knowledge. For example
it was believed in the early 1970s that DDC still reflected its
origins in a small North American university library (Foskett
1973, p. 39). The structure of enumerative schemes (most universal
schemes are basically enumerative) is often perceived as subjective,
and critics find many examples of inconsistency and illogicality.
For this reason, library classification theory had begun to move
away from enumerative schemes in the mid-twentieth-century. Examples
of the alternative 'faceted' or 'analytico-synthetic' classification
schemes are Ranganathan's Colon classification (Ranganathan 1965)
and Bliss's Bibliographic classification (although both are hardly
ever used), although later editions of DDC and UDC are faceted
to a limited extent.
- Bad at assimilating new areas of interest: universal
classification schemes often have a special difficulty in reacting
quickly to new areas of study because they are updated with the
time consuming participation of broad international multidisciplinary
bodies. Researchers on the University of Illinois Digital Library
Initiative project comment that most digital repositories contain
"concepts and vocabularies too new or dynamic for controlled-vocabulary-based
human indexing" (Schatz, B. et al. 1996, p. 33). Similarly,
all classification systems are poor at handling new concepts and
vocabularies, but universal classification schemes tend to have
more disadvantages in this area when compared with subject-specific
schemes.
Most of the advantages and disadvantages of universal
classification schemes apply also to national general schemes
(cf. 2.4. National general schemes), but they have additional
characteristics that make them perhaps not the best choice for
an Internet service that claims to be relevant for a wider user
group than one limited to certain national boundaries.
Some of those characteristics are discussed here,
relating to use of the scheme in the Internet environment:
- Although national general schemes offer coverage
of all subject areas, they are in general not well known outside
of their place of origin. For an international audience, a universal
scheme would probably serve better.
- Support for a national scheme will be broad in
the country itself, and a national institution has the responsibility
for development. Support for the scheme outside of this national
user group is limited. (e.g. use of the Nederlandse Basisclassificatie
by German libraries which use the Pica system).
- Within the country the national scheme may be
better known than universal schemes, e.g. the BC is used by Pica
libraries in the Netherlands (mostly academic libraries), and
SAB is used by almost all the public libraries in Sweden.
When the choice was made in the Koninklijke Bibliotheek
to use the Nederlandse Basisclassifatie for an Internet subject
service (the Nederlandse Basisclassificatie Web), this
was done mainly because the subject specialists already used the
scheme for classification of printed works. If NBW outgrows
its national boundaries, for instance in the DESIRE context, or
by the participation of non-Dutch institutions, the conversion
to another scheme will deserve serious consideration, to make
wider interoperability possible.
- Multilingual capability is not a primary concern
for national schemes, apart from countries with multiple languages.
- National schemes are likely to have a geographic
bias, e.g. the classification of languages in the BC is not only
Eurocentric, but biased towards the Dutch context: Frisian - as
a language spoken by a minority in Holland - has a separate class,
while Asiatic languages have only three: Japanese, Chinese and
'other' Asiatic languages. This bias could be a serious drawback
in an international context.
Most special subject specific schemes have been devised
with a particular user-group in mind. Typically they have been
developed for use with indexing and abstracting services, special
collections or important journals and bibliographies in a scientific
discipline. They do have the potential to provide a structure
and terminology much closer to the discipline and can be more
up-to-date, compared to universal schemes.
Examples of specific schemes are Engineering Information
(Ei) for engineering, the National Library of Medicine (NLM) Classification
for medicine and the British Catalogue of Music Classification.
In subject areas like medicine, agricultural science and engineering,
where there are international and widely recognised schemes available,
subject services normally will prefer these or use them in combination
with an universal scheme.
Subject specific schemes do have some drawbacks:
- It makes co-operation between subject services
from different subject areas more difficult. Elaborate conversion
programs will be needed in order to exchange resources or to point
to them in another service.
- If they have a very small user-base it can be
very difficult for the numerous users from other subject areas
to learn the structure of the scheme.
- Collections of subject-specific resources are
likely to include some fringe topics which will not be adequately
covered within the specialist scheme itself (Langridge 1991, p.
16)..
It is therefore advisable that only well-established
subject specific classification schemes should be used to describe
Internet resources.
Some Web sites have tried to organise knowledge on
the Internet by devising their own classification scheme. Yahoo!,
created in 1994, lists Web sites using their own universal classification
scheme or 'ontology', which contains 14 main categories. Each
Web site collected for Yahoo! is listed under one of 20,000
categories or sub-categories (Steinberg 1996), the scheme being
developed over time by the 20 people doing the classification
work.
A study by Vizine-Goetz (1996a) showed that out of
Yahoo!'s 50 most popular categories, all but four mapped
perfectly to explicit DDC or LCC numbers or ranges. The results
"... indicate that DDC and LCC have sufficiently wide topic
coverage for classifying Internet resources". The structure
of Yahoo! would require encoding to take advantage of the
relationships between classes which is handled by notations in
traditional library schemes, an important prerequisite for automatic
routines and improved navigation.
Home-grown schemes do have some theoretical advantages
over library universal classification schemes:
- Home-grown schemes are relatively flexible and
easy to change. For example, in 1995 Yahoo! was adding
categories and making other changes to the ontology every day
(Steinberg 1996).
- Home-grown schemes can very quickly absorb new
areas of interest. Universal and enumerative schemes cannot just
add new classification numbers when they are required, attention
has to be given to keeping the numeric arrangement logical and
easy to understand. This process can be very drawn-out.
On the other hand, home-grown schemes have a number
of disadvantages:
- They amplify the problems of classification subjectivity
and can lead to a lack of consistency. Steinberg (1996) notes
that Yahoo!'s more or less consistent point of view "comes
from having the same 20 people classifying every site, and by
having those people crammed together in the same building where
they are constantly engaged in a discussion of what belongs where".
Other people using the same scheme or ontology might come to very
different solutions.
- They are unlikely to be as well-known to users
as universal classification schemes.
- If the scheme is self devised, it might need
frequent revision with little chance of co-operation. The economic
cost of this will fall entirely on the originator of the scheme.
Page maintained by: UKOLN Metadata Group
Last updated: 14-May-1997