Introduction
On 25 November 1998 the WebWatch robot trawled the entry points for UK academic Web sites. This report is an analysis of the findings. This is the third Web crawl of the UK HEI entry points and completes a series of three snapshots of this community. The first crawl is available from the reports area of the WebWatch pages [1] and the second was published in the Journal of Documentation [2].
The input file of URLs obtained from NISS for the previous crawl was used. Of the 170 sites in this list, 150 sites were successfully crawled. Network/connection errors, out of date URLs and so on account for the 20 unexplored sites.
Size Metrics
Figure A7-1 shows a histogram of the total size of entry points. Total size is defined as the HTML page with inline images. A number of linked resources which may be downloaded by modern browsers, including external style sheets, external client-side scripts, resources requiring 'plugins' and background images are not included.
The range of sizes spans from around 5kb (<URL: http://www.rcm.ac.uk/>) to around 200Kb (<URL: http://www.kiad.ac.uk/>). The second large outlier at 192Kb corresponds to <URL: http://www.scot.ac.uk/>.
Hyperlinks
Figure A7-2 shows the number of hyperlinks within each site. These are obtained from the A element and from image map AREA elements. This data may include duplicate URLs where more than one hyperlink to the same URL exists.
Note that the outlier corresponds to <URL: http://www.rhbnc.ac.uk/>.
HTTP Servers
Figure A7-3 shows a pie chart of the server software encountered during the crawl. This information is based upon the HTTP Server header returned by the web server.
The Other category consists of the servers listed in Table A7-1.
Server |
Count |
Borderware |
1 |
Lotus Domino |
1 |
Novell |
2 |
OSU |
2 |
SWS-1.0 |
1 |
WebSTAR |
4 |
WinHttpd |
1 |
Table A7-1 - Components of the 'Other' slice from Figure A7-3
A more detailed table of the servers found is shown in Table A7-2.
Server |
Count |
Apache/1.0.0 |
1 |
Apache/1.0.3 |
1 |
Apache/1.1.1 |
1 |
Apache/1.1.3 |
1 |
Apache/1.2.0 |
2 |
Apache/1.2.1 |
2 |
Apache/1.2.1 PHP/FI-2.0b12 |
1 |
Apache/1.2.4 |
7 |
Apache/1.2.4 FrontPage/3.0.2 |
1 |
Apache/1.2.5 |
12 |
Apache/1.2.6 |
6 |
Apache/1.2b10 |
2 |
Apache/1.2b7 |
1 |
Apache/1.3.0 (Unix) |
7 |
Apache/1.3.0 (Unix) Debian/GNU |
1 |
Apache/1.3.0 (Unix) PHP/3.0 |
1 |
Apache/1.3.1 (Unix) |
6 |
Apache/1.3.2 (Unix) |
1 |
Apache/1.3.3 |
2 |
Apache/1.3.3 (Unix) |
6 |
Apache/1.3.3 Ben-SSL/1.28 (Unix) PHP/3.0.5 od_perl/1.16 |
1 |
Apache/1.3.3 UUOnline/1.4 (Unix) |
1 |
Apache/1.3a1 |
1 |
Apache/1.3b3 |
1 |
Apache/1.3b5 |
1 |
BorderWare/2. |
1 |
CERN/3.0 |
8 |
CERN/3.0A |
3 |
HTTPS/2.12 |
1 |
Lotus-Doino/4.5 |
1 |
Microsoft-IIS/2.0 |
3 |
Microsoft-IIS/3.0 |
4 |
Microsoft-IIS/4.0 |
15 |
Microsoft-Internet-Inforation-Server/1.0 |
1 |
NCSA/1. |
2 |
NCSA/1.4. |
1 |
NCSA/1.5.1 |
3 |
NCSA/1.5.2 |
7 |
Netscape-Comunications/1.1 |
1 |
Netscape-Comunications/1.12 |
1 |
Netscape-Enterprise/2.01 |
3 |
Netscape-Enterprise/2.0a |
2 |
Netscape-Enterprise/3.0 |
4 |
Netscape-Enterprise/3.0F |
2 |
Netscape-Enterprise/3.0K |
1 |
Netscape-Enterprise/3.5-For-NetWare |
1 |
Netscape-Enterprise/3.5.1 |
4 |
Netscape-FastTrack/2.0 |
1 |
Netscape-FastTrack/2.01 |
1 |
Netscape-FastTrack/2.0a |
1 |
Netscape-FastTrack/2.0c |
1 |
Novell-HTTP-Server/2.5R |
1 |
Novell-HTTP-Server/3.1R |
1 |
OSU/1.9b |
1 |
OSU/3.2 |
1 |
SWS-1.0 |
1 |
WebSTAR |
2 |
WebSTAR/1.2.5 ID/13089 |
1 |
WebSTAR/2.0 ID/44693 |
1 |
WinHttpd/1.4a (Shareware Non-Commercial License |
1 |
Total |
150 |
Figure A7-2 - Table of all Servers Encountered
Of these servers, 40% used HTTP/1.0 and 60% used HTTP/1.1.
The Queso [3] software was used to get an idea of platforms. The high level results are summarised in Table A7-3. A more detailed breakdown is presented in Figure A7-7.
Estimated |
||
OS |
Min |
Max |
Unix |
97 |
108 |
OS2 |
0 |
5 |
MacOS |
6 |
11 |
Netware |
3 |
3 |
Windows NT/95/98 |
20 |
20 |
Other |
7 |
7 |
Unknown |
6 |
6 |
Table A7-3 - Operating Systems as Reported by Queso
Note that the 'Other' category in Table A7-3 corresponds to the Queso output categories Figure A7-7) 'Cisco...' and the 'Unknown' category corresponds to the Queso output categories 'Unknown OS', 'Firewalled host/port or network congestion' and 'Dead Host, Firewalled port or Unassigned IP'.
Note that the estimated minimum and maximum values in Table A7-4 may be skewed because of the Queso unknowns referred to above.
Operating System |
Count |
BSDi or IRIX |
1 |
Berkeley: Digital, HPUX, SunOs4, AIX3, OS/2 WARP-4, others... |
5 |
Berkeley: HP-UX B.10.20 |
1 |
Berkeley: IRIX 5.x |
3 |
Berkeley: usually Digital Unix, OSF/1 V3.0, HP-UX 10.x |
14 |
Berkeley: usually HP/UX 9.x |
1 |
Berkeley: usually SunOS 4.x, NexT |
5 |
Cisco 11.2(10a), HP/3000 DTC, BayStack Switch |
7 |
Dead Host, Firewalled Port or Unassigned IP |
2 |
FreeBSD, NetBSD, OpenBSD |
1 |
IBM AIX 4 |
2 |
IRIX 6.x |
2 |
Linux 1.3.xx, 2.0.0 to 2.0.34 |
5 |
Linux 2.0.35 to 2.0.9 |
1 |
MacOS-8 |
6 |
Novell Netware TCP/IP |
3 |
Reliant Unix from Siemens-Nixdorf |
1 |
Solaris 2.x |
60 |
Standard: Solaris 2.x, Linux 2.1.???, MacOS |
5 |
Windows 95/98/NT |
20 |
Firewalled Solaris 2.x |
1 |
Firewalled host/port or network congestion |
3 |
Unknown OS |
1 |
Total |
150 |
Metadata Profile
The attributes of the HTML <META> element were examined for known metadata conventions. Table A7-5 shows the results.
Metadata |
Number
of |
No. sites |
PICS |
1 |
1 |
HTTP-EQUIV="Refresh" |
9 |
9 |
Reply-To |
3 |
3 |
Search Engine |
190 |
95 |
Dublin Core |
102 |
11 |
HTTP-EQUIV="(Dublin Core)" |
8 |
1 |
Table A7-5 - Types of Metadata Encountered
Technologies
Scripting
29 pages used the <SCRIPT> element to include a client-side script block. Of these, 23 pages included the attribute-value LANGUAGE="JavaScript".
All HTML elements were searched for the set of defined JavaScript event handlers. The results are shown in Table A7-6.
Handlers |
Count |
Sites |
onChange |
1 |
1 |
onClick |
13 |
4 |
onLoad |
10 |
8 |
onMouseOver |
320 |
36 |
Table A7-6 - Event Handlers Encountered
Java
Two Java applets were referenced by the site <URL: http://www.uwic.ac.uk/>.
The site <URL: http://www.luton.ac.uk/> referenced a plugin using the OBJECT element.
Frames and "Splash Screens"
A total of 21 sites used framesets to provide a framed interface to the institutional entry point.
A total of 10 sites use HTTP-EQUIV="refresh" to provide a client-side redirect of a "splash screen" for the entry point.
Cachability
Table A7-7 shows a summary of the cachability of crawled resources.
Cachable resources |
72.5% of HTML pages, 80.9% of images |
Non-cachable resources |
4.4% of HTML pages, 0.2% of images |
Table A7-7 - Cachability of Resources Encountered
Additionally, 40% of HTML pages and 45% of images contained the HTTP/1.1 Etag header.
A resource is defined as cachable if:
* It contains an Expires header showing that the resource has not expired
* It contains a Last-modified header with a modification date greater than one day prior to the robot crawl
* It contains the Cache-control: public header
A resource is defined as not cachable if:
* It contains an Expires header showing that the resource has expired
* It contains a Last-Modified header with a modification date coinciding with the day of the robot crawl
* It contains the Cache-control: no-cache or Cache-control: no-store headers
* It contains the Pragma: nocache header
The cachability of resources is not determined if the resource used the Etag HTTP/1.1 header, since this would require additional testing at the time of the trawl which was not carried out.
Comparisons with Previous Crawls
Server Profiles
As shown in Figure A7-4, the Apache and Microsoft servers have shown increasing adoption. The Netscape server has fluctuated (perhaps due to a period of experimentation). The NCSA and CERN servers have shown a decrease in usage.
The growth of Apache and Microsoft servers has also resulted in a decrease of the 'Other' category, i.e. sites are subscribing to the more popular servers.
A chart showing the growth of various servers is shown in Figure A7-5. This chart shows the contribution of growth for the period Oct 1997 - Jul 1998 and Jul 1998 - Nov 1998. Note that negative growth is interpreted as decline.
Size of Entry Points
A set of sites was isolated, for which reliable measurements of size exist for two previous web crawls. The results are shown in Figure A7-6.
Note that a majority of sites have not undergone great fluctuations in size. The outlier corresponds to <URL: http://www.scot.ac.uk/ >. The pages for this site are different since this site has become part of a larger institution.
"Splash Screens"
The number of institutional entry points which make use of "splash screens" or redirect has shown a steady increase from five sites (Oct 97) to seven sites (July 1998) to ten sites in the current trawl.
Hyperlink Profiles
The domains referenced by hyperlinks in the three crawls have been dominated by ac.uk and this domain has shown an overall increase. Figure A7-8 shows the contribution of different types of domain name as a percentage of all hyperlinks in the site.
Domain |
October 1997 |
July 1998 |
November 1998 |
Total .uk |
97.31% |
97.13% |
98.00% |
ac.uk |
96.63% |
95.94% |
97.68% |
net |
0.30% |
0.16% |
0.11% |
com |
0.82% |
0.61% |
0.63% |
org |
0.34% |
0.08% |
0.18% |
Other |
0.15% |
0.08% |
0.10% |
IP address |
0.00% |
0.12% |
0.04% |
Badly formed URL |
1.10% |
1.72% |
0.91% |
Table A7-8 - Domains Referenced in Hyperlinks
Note in Table A7-8, that the ac.uk data is a subset of the uk data.
Use of Metadata
In each crawl, we have looked for search-engine (SE) type metadata and Dublin-Core (DC) metadata. The findings for the three crawls are shown in Figure A7-7.
Figure A7-7 shows that the use of Dublin Core metadata has increased considerably over the three crawls, from one site in October 1997 to 11 sites in November 1998.
References
1. A Survey of UK Academic Library Web Sites
<URL:
http://www.ukoln.ac.uk/web-focus/webwatch/reports/hei-lib-may1998/
>
2. How Is My Web Community Doing? Monitoring Trends in Web Service
Provision,
Journal of Documentation, Vol. 55 No. 1 January 1999, pp
82-95
3. Questo
<URL: http://www.apostols.org/projectz/queso/
>