Initial evaluation of the project requirements suggested that we could adapt an existing robot for our use (rather that writing from scratch), subsequently we decided that Harvest fitted well.
Our initial web crawls used the Harvest gatherer, with modified summarisers. Later refinements of our needs led to more substantial alterations of the software, until we had re-written many 'modules' and very little of the original Harvest was actually in use.
There were still limits imposed by the original software we still used, and due to the fact that our re-written components still had to fit around a structure optimised for web-indexing. So it was decided to write a robot optimised for the type of tasks that are required of WebWatch and to use our previous experience of needs and problems in doing so.
Unfortunately, there didn't seem to be any existing robots that really fitted the needs of the project.
Before each crawl of a community, it is useful to define what resources and what aspects of those resources we will monitor. We do not explicitly look at any of the textual content of pages, although checking for conformance to HTML or XML standards may require some text to be considered.
So far we have been using Perl to convert the robots SOIF records into appropriate CSV records. These have then been analysed in packages such as excel, Minitab and SPSS.
This work will initially be piloted on a number of small communities (e.g. JISC service providers)
In order to get a random sample of server logs we would need to ask webmasters provide access to logs (possibly after anonymising them). Here we need to formulate an acceptable use policy and possibly write some tools that would anonymise aspects of the logs before they are submitted to us.
From the logs we hope to extract information on user-agents, user-platforms and distribution of "hits" to communities.
The log files will be analysed initially with off-the-shelf software.
Our trawls to date have been community oriented and the conclusions and results of the subsequent analyses have been fed-back into those communities.
We have also formed a technical advisory group for WebWatch, consisting of experienced/interested individuals.