The web was initially developed based on three architectural components: transport (HTTP), addressing (URLs) and data formats (HTML). As the web grew limitations in the original architecture became apparent. This paper gives an overview of the development of these standards and the introduction of new web standards and describes the development of a new architectural component: metadata. The paper concludes by describing some of the difficulties in deploying the new standards and ways of overcoming these difficulties.
For many people the World Wide Web (the web) refers to the Netscape or, increasingly, the Internet Explorer browser and the information which the browser can access. Tim Berners-Lee, the father of the web, was the main developer of standards on which the web is based. It is these standards which many web purists would regard as the essential features of the World Wide Web.
The web originally consisted of three key standards: HTML, the HyperText Markup Language which provided the data format for native resources on the Web, HTTP, the transfer protocol for the Web, and URLs, the addressing mechanism for locating Web resources. Since the early 1990s, when the Web was first developed, these standards have been developed further and several new Web standards developed, or are under development. This paper reviews developments to Web standards, especially the standards whose development is being coordinated by the World Wide Web Consortium (W3C).
Serious users of the web will be familiar with HTML. Many information providers on the web will be familiar with HTML's deficiencies, such as the difficulties of controlling the appearance of web pages, proprietary HTML extensions and the browser wars, difficulties in reusing information stored in HTML and the difficulties in maintaining large websites.
The HTML 4.0 recommendation [1] primarily addresses deficiencies in HTML 3.2's accessibility support (e.g. improving access to web sites by people with disabilities by providing hints to voice browsers for the visual impaired). In addition it provides better integration with style sheets (described below). Web authors expecting a range of new tags in HTML 4.0 to provide more control over the appearance of HTML documents will be disappointed, as the intention is for HTML to define the structure of a document, and to use style sheets for described how the structure is to be displayed.
Cascading style sheets (CSS) help to address some of the difficulties mentioned above. CSS 2.0 [2] provides comprehensive control over the appearance of HTML documents. Use of external style sheet files can also help with the maintenance of web sites: the separation of content from appearance means that the look-and-feel of a web site can be maintained without having to edit files contained the content, and single or small, manageable numbers of corporate style sheet files can be easily edited to provide change a website's design.
The development of a Document Object Model (DOM) [3] for HTML will enable interactive web sites to be developed more easily. The release of the DOM recommendation should help to avoid the problems in supporting differing implementations for client-side scripting languages provided by mainstream browser vendors. The browser vendors will be encouraged to support the DOM due to its architectural strengths, which have been discussed thoroughly within the W3C DOM Working Group.
Although HTML 4.0, CSS 2.0 and DOM 1.0 provide the underlying standards for the development of attractive, maintainable and interactive websites, they do not address HTML's limited support for structured documents and supporting reusable documents. XML (Extensible Markup Language) [4] has been designed to enable arbitrary document structures to be defined.
Although end users of the web appreciate the web's hyperlinking mechanism, the hypertext community have criticised the web's limited hyperlinking functionality. Providers of large web sites are also becoming aware of the difficulties in maintaining hyperlinks which are embedded in HTML documents.
The development of XML provided an opportunity for the web's hyperlinking deficiencies to be addressed. XLink [5] provides additional hyperlinking functionality, including links that lead users to multiple destinations, bidirectional links and links with special behaviours. In addition external link databases will ease the maintenance of hyperlinks.
XPointer [6] addresses the limitations provided in HTML for processing pointers into documents. With XPointer it will be possible to link to any portion of an XML document, even if the author has not provided an internal anchor. It will also be possible to link to portions of an XML document.
As merged XML documents become deployed, there is a need to address potential name clashes of XML elements. For example an XML document containing details of a CD collection is illustrated below:
<title>My CD Collection</title> <p>Here is a list of my CDs.</p> <artist>Oasis</artist> <title>Be Here, Be Now</title> ... Figure 1: XML Document Illustrating Name Clash Problem
How can an XML parser differentiate between the title of the document and the title of the CD? XML Namespaces [7] have been introduced to enable such clashes to be resolved, as illustrated in Figure 2.
<cd xmlns:cd='http://www.cd.org/schema/'> <title>My CD Collection</title> <p>Here is a list of my CDs.</p> <cd:artist>Oasis</artist> <cd:title>Be Here, Be Now</title> ... Figure 2: XML Namespace Solution to Name Clash
In Figure 2 a machine-readable definition of the element set for CDs is given at the (fictitious) URL 'http://www.cd.org/schema/', which includes the elements <title> and <artist>. These elements are identified by use of the cd: prefix, as shown.
XML is now the preferred format for new data formats which are being developed by W3C. For example P3P (Platform for Privacy Preferences) [8], SVG (Scalable Vector Graphics) [9], SMIL (Synchronized Multimedia Integration Language) [10] and MathML (Mathematical Markup Language ) [11] are all XML applications.
Version 1.0 of HTTP (the HyperText Transfer Protocol) [12] suffered from design flaws and implementation problems. Many of the problems have been addressed by HTTP/1.1, such as support for virtual hosts and improved support for caching. However HTTP/1.1 is insufficiently flexible or extensible to support the development of tightly-integrated web applications. HTTP/NG [13] has been proposed as a radical redesign using object-oriented technologies which aims to address these concerns. However due to the complexities of this design, the future of HTTP/NG is uncertain. An extension framework for HTTP/1.x has recently been announced [14] which may provide an interim solution.
Most experienced web users will have encountered the dreaded 404 error message indicating that a resource has not been found. URLs such as the fictitious http://www.bristol-poly.ac.uk/depts/music/ are liable to change due to changes in the name of the organisation, internal reorganisation or reorganisational of the underlying web directory structure.
URNs (Uniform Resource Names) [15] have been proposed as a solution to some of the deficiencies of URLs. Other alternatives include DOIs (Document Object Identifiers) [16] and PURLs (Persistent URLs) [17]. However widescale deployment of these technologies does not appear likely in the near future, in part due to the organisational, as opposed to technical, requirements needed for their deployment. The pragmatic solution is to recognise that URLs don't break - people break them - and that URLs should be designed to have a long life-span.
Metadata can be regarded as the missing architectural component from the initial implementation of the web. During the mid 1990s there were web several developments, resource discovery, web site mapping and digital signatures which were all aspects of metadata.
In order to coordinate such metadata developments the W3C set up a Metadata Coordination Group [18] which developed RDF (the Resource Description Framework) [19].
Figure 3 - Web Architecture
RDF provides a general framework for the deployment of metadata applications. It is being used for a number of applications, such as Dublin Core metadata, digital signatures, site maps, content rating and intellectual property rights. RDF aims to provide a means to make statements about properties of Web resources. It includes a number of components: a formal data model, an XML syntax, and a (proposed) Schema language for describing RDF vocabularies.
Although much thought has been given to the development of a rich set of interoperable web standards, the deployment of applications may not necessarily be easy. For example, although CSS 1.0 has been available since December 1996, use of CSS does not appear to have taken off widely. This is due, in part, to lack of support or buggy support for CSS by browser vendors. We are also seeing a slowing-down in the release of W3C standards, due partly to the ever-increasing complexity and interdependencies of new standards, but also due to concerns over patent claims, as mentioned in Tom Berner-Lee's keynote talk at the WWW8 conference (see Brian Kelly's conference report [20]).
As Jakob Neilson has described in his Alertbox column on The Increasing Conservatism of Web Users [21] uptake of new technologies is also affected by the increasing reluctance of users to deploy new browsers.
A number of protocol solutions have been suggested to address such concerns. For example Transparent Content Negotiation [22] has been proposed as protocol solutions to the deployment of new data formats, but unfortunately is not widely deployed.
Increasingly we are finding that web server management applications and toolkits are providing support for browser-agent negotiation. Browser-agent negotiation, for example, is supported by W3C's CSS Gallery [23].
Although browser-agent negotiation is an application solution, and not part of underlying web protocols. However a recent W3C Note on CC/PP [24] has been submitted which describes an RDF application for defining browser functionality in a machine-understandable way.
In addition to these protocol developments, in may also be possible to deploy new technologies through the use of sophisticated content management systems. Content management systems typically provide backend data management and processing capabilities - which contrasts with the simple serving of files approach provided by the first generation web.
An alternative approach is the use of proxy intermediaries. With this approach the server will server files as normal. However the files will be processed and reformulated by a proxy server. This approach has been used for example, to reformat HTML resources for display on a PDA (Personal Digital Assistant).
One of the strengths of the web, particularly in its early days, was its simplicity. With knowledge of a few HTML tags or access to a simple authoring tool, it was possible to make information available globally. We are now reaching the limits of this first generation web. However better performances and richer functionality may require the deployment of complex, and possibly expensive, software packages. Large-scale service providers may be in a position to deploy such software. Whether this will result in a two-tier web is a scenario which, no doubt, concerns many.
Brian Kelly is UK Web Focus, a national web coordination post for the UK Higher Education community. Brian is based at UKOLN, University of Bath. Brian attended the first, fifth, sixth, seventh and eighth WWW conferences. He has been a member of the programme committee for several of the conferences Brian gave a short paper on Subject-Based Information Gateways in the UK at WWW8.