Collection Level Description

A review of existing practice

...an eLib supporting study

[contents]
[previous] [next]


2. What is a Collection?

2.4 An Internet / Web perspective

What is a 'collection' on the Web? The World Wide Web Consortium (W3C) currently define a 'Web Collection' as:

A portion or section of a Web site, consisting of two or more Web pages, that represents a non-trivial, self-contained resource, but is still maintained by the same publisher of the overall Web site.

Examples: Web journal, electronic monograph, photo gallery ...

[W3CTERM]

This definition is somewhat restrictive. In the general case, a collection of Web resources may encompass material from more than one publisher. We can define two broad classes of Web collections:

  1. collections of items that are accessible using the Web
  2. Web-accessible collections of information (metadata) about collections of items that are accessible using the Web

The first includes complete Web-sites (for example a university or corporate Web-site), selected parts of such sites (for example a departmental Web-site) and smaller collections of items (such as all the Web pages that make up a document). These examples of collections typically have a long lifetime. Others may not. For example, consider the collection of stories that make up a particular news bulletin made available across the Web.

The second includes manually created catalogues of Internet resources, often referred to as 'subject services', 'subject based information gateways' or, more recently, 'portals'. It also includes robot-generated databases of Web pages, often referred to as 'Web indexes' or 'Web search engines'.

Individual resources on the Web today have little structural relationship with each other. Resources may be linked, but there is no way of knowing the relationship between the resources. One well-known limitation is the inability to easily print a document that has been split into multiple linked HTML pages. Any general-purpose utility that attempts to automatically create a single document for printing is liable to fail by following arbitrary links from the resource. Another limitation is the inability to provide a richer mechanism for browsing. For example PowerPoint users will be familiar with linking to next and previous slides in a variety of ways (pressing the space bar, left or right mouse button, left or right arrows, N or P keys, etc.) This is not easily possible with simple HTML resources. A third example is the difficulty of grouping resources on a Web server for processing in different ways. For example, it is difficult to specify that an indexing robot should not retrieve a group of resources, or that an off-line browser can download a group of resources. It may be possible to make use of the underlying directory structure to carry out such tasks, but this is not a general purpose, or in many cases, scalable solution.

There have been various attempts to provide a way of describing collections of items on the Web including 'Web Collections', 'Channel Definition Format (CDF)', 'Meta Content Framework (MCF)' and more general site maps. These are described in more detail in the Web Collections section of this study. However, none of these approaches have been adopted on a widespread basis.

Andy Powell, UKOLN