Once a web server has more than one page, for the sake of both webmaster and user alike, there needs to be a search engine available. Finding a search engine that suits your requirements is not particularly easy. In the past searches were accomplished with flat indexes that were generated using a tool that followed your directory structure. Over the past few years search tools have moved over to being spider based, and generating the index by spidering through links on webservers. This change has removed one of the problem areas, since spider based tools can easily index a group of specified servers, and has focussed on another, often ignored, problem area of controlling any robot or spider's access to your web based information.
Looking at UK HE sites in July/August 1999 (results published in Ariadne <http://www.ariadne.ac.uk/issue21/webwatch/> in September 1999) Brian Kelly found the following search engines were being used:
Name |
Total |
Details |
ht://dig |
25 |
Bath- Birkbeck - Bradford - Brighton - Bristol - Brunel - City - Coventry - Durham - Glasgow - Goldsmiths - Imperial - Keele - Kent - Leicester - London School of Hygiene - LSE - Manchester - Oxford - Portsmouth - Salford - UMIST - Worcester - York |
eXcite |
19 |
Aberdeen - Birmingham - Central Lancashire - Edge Hill - Exeter - Glasgow Caledonian - Kingston - Loughborough - UMIST - MMU - Nottingham - Northern College - Oxford Brookes - Sunderland - St George's - Thames Valley - UWE - Wolverhampton - Worcester |
Microsoft |
12 |
Aberystwyth - Canterbury Christ Church - Essex - Liverpool John Moore - Manchester Business School - NTU - Middlesex - Paisley - Scottish Agricultural College - Southampton Institute - UWIC - Westminster |
Harvest |
8 |
Anglia - DMU - Cranfield - Liverpool - Queen's University Belfast - Reading - Southampton - Swansea |
Ultraseek |
7 |
Cambridge - Edinburgh - Newport - Royal Holloway - Sussex - Ulster - UNL |
SWISH / SWISH-E |
5 |
KCL - Lancashire - London Guildhall - Sheffield Hallam - UCE |
Thunderstone's Webinator |
4 |
Newcastle - UEA - NWI - Sheffield |
Netscape (Compass/Web Publisher) |
3 |
Bangor - LMU - UCL |
wwwwais (formerly available from <http://www.eit.com/>) |
3 |
Cardiff - Hull - UWCM |
FreeFind (Remote Index) |
2 |
Northampton - St Mary's College |
Muscat |
1 |
Surrey |
Maestro |
1 |
Dundee |
AltaVista (Product) |
1 |
MMU |
AltaVista (Public Service) |
1 |
Derby |
WebStar> |
1 |
SOAS |
WebFind |
1 |
TASC |
Other (Not known, home-grown) |
6 |
University of London - Open University - South Bank - Surrey Institute - Queen Margaret University College - UNN |
None (or not easily found) |
59 |
Abertay - Aston - Bath Spa - Bolton - Bournemouth - Bretton - Bucks - CSSD - Cheltenham - Chester - Chichester - COGS - Dartington - East London - Falmouth - Glamorgan - Glasgow School of Arts - Greenwich - Harper Adams - Heriot-Watt - Herts - Huddersfield - Institute of Education - Kent Institute - King Alfred's - Lampeter - Lincolnshire - Liverpool Hope - London Business School - London Institute - Napier - Newman - North Riding - Northern School - Norwich - Plymouth - Ravensbourne - Ripon - RGU - Roehampton - Royal Academy - Royal College of Art - Royal College of Music - Royal Northern College of Music - Royal Scottish Academy of Music and Drama - Royal Veterinary College - St Andrews - St Mark - Stirling - Strathclyde - Swansea Institute - Trinity College of Music - Trinity College - Warwick - Welsh College - Westhill College - Westminster College - Writtle - Wye |
Web site or search engine not available at time of survey |
1 |
Staffordshire |
Total |
160 |
  |
It would be interesting to know when these search engines were installed and whether the software is being maintained. Questions that spring to mind follow:
· Is the version of Muscat used by Surrey the free version available for a time?
· Are the users of Excite quite happy with the security and that development seems to have ceased?
· Are users of local search engines that don't use robots.txt happy with what other search engines can index on their sites (you have got a robots.txtfile haven't you?)
There are several very different kinds of mechanism that a search facility can use and it is this mechanism that determines the functionality of the search and the extent of the index. Broadly, search backends are either direct indexes, which are created from scanning the local or networked file structure (for instance Microsoft index server), or robot/spider generated indexes, creation of which is to linked files only and is controlled by the robots.txt file on the web server. The former reflects the files present, where the latter reflects the active structure.
Because the first strategy is not 'Internet mediated' it is suitable for networked or one-off servers and is the technology used by previous generations of current tools, such as swish (++ and E). The index will probably have to be on the web server, which can be problematic if it gets to be large and well used.
The second strategy allows for much more elastic indexing of a number of servers. In general the approach is easy to administer and problem free - controls for which servers are indexed and how frequently this happens lie with the indexing software but there are also several means for the server administrator to control how the server is indexed too (see below). Problems can arise when indexing from built-in or add-on indexing software, such as Microsoft Site Server, Netscape Catalog Server or WebStar indexing software, which may be vendor specific. This is because of the differences in the ways web servers respond when the indexing robot approaches them (the API of the web server). The APIs of the above servers are subtly different from, say, Apache, and the indexing software may have been written with their particular API in mind, so may baulk at unexpected server responses. This problem is more likely to arise when you are indexing a large number of servers (and encounter more different types of server software during the process).
There will be some directories that you do not want your indexer to look at and index. When you are using an indexer such as older versions of Swish or Harvest, you have to add specific controls to tell the indexer where to go and not to go. When using a spider or robot based indexer, controls over indexing are through a number of means and will be observed by the Internet indexers such as AltaVista, Go, and HotBot, as well as your local indexer. Obviously, if you can kill all your indexing requirements with one stone it will save you work in the long run. These controls are:
· the robots.txt file
· description and keywords in metadata tags of individual files
· robots metadata tag giving noindex and nofollow information (and combinations) in individual files
These controls are observed to a greater or lesser degree by search engines at large. SearchEngineWatch will give you all the information you need about this (http://www.searchenginewatch.com/webmasters/features.html) All 'proper' search engines will observe a robots.txt file and do what it says, and almost all will observe the robots metadata tag. You cannot depend upon the description and keywords metadata tags being used - Google, NorthernLight, Lycos and Excite will ignore them - but if they work for your local search facility and make it more valuable, they must be worth pursuing. No support of Dublin core metadata should be assumed.
At another level, access to branches of a web server can be limited by the server software. Combining access control with use of metadata can give information to those within the access domain and some limited information to those outside.
Finally, it is a useful lesson that if you don't want people to read files, then they shouldn't be on the web server. Adding a new indexing facility should remind people to 'spring clean' their files and remove all the information that is no longer pertinent.
There have been security alerts in the past, most notably with Excite free version (January 1998), which now has a patch to correct the problem. Bear in mind potential security problems stemming form the underlying OS, particularly if you are running an indexing facility on a separate server and the server is not the OS you use most often. Windows NT is a particular minefield for the uninitiated.
First make a shopping list
It is essential that you start off the exercise with a clear ides of what you are looking for. Things to think about may include the following:
· Do I want to/ am I able to run this on the web server, on a separate machine, or have someone else host it?
· What platform do I want to use (is there the expertise or facilities for using a different platform)?
· How many servers do I want to index (ballpark figure of number of pages to be indexed useful here too)?
· Is the data to be indexed subject to frequent change, if so in part or as a whole?
· What manpower and/or money is available for the project?
· What type of files do I want indexed (just HTML, or including PDF, Office files, etc.)
· What type of search facilities do I want to offer (keyword, phrase, natural language, constrained searches)?
The answers to some of these may be obvious but you may have to discuss others. Starting with a well-defined list will clarify where you may make compromises if need be.
Categories of search products
There are broad types of search facility, suitable for different circumstances, as follows:
· free services hosted remotely
· products built into (or added onto) your web server software
· free search facilities
· commercial search facilities
May be limited number of pages indexed (500, 5000 or unlimited in these examples) and will probably be deleted if the indexes are not used for between 5 and 7 days. Access to indexing is prey to Internet traffic and server availability (also note may generate incoming transatlantic traffic for UK users since these services are all in the US). You may get advertising appearing on your search results page. May be a stop-gap solution for small UK HE Institutions. See:
· Atomz search (http://www.atomz.com/) chargable on indexes above 500 entries or above 5000 requests / month
· FreeFind (http://www.freefind.com/)
· Pinpoint (http://pinpoint.netcreations.com/)
· Thunderstone (http://index.thunderstone.com/texis/indexsite/)
· Tippecanoe (http://www.tippecanoe.com/) from August 1999
· Whatuseek IntraSearch (http://www.whatuseek.com/)
Several types of server software come with an in-built search facility, so before you go any further it is worth double checking, especially if you are using a Microsoft or Netscape server, or using WebStar version 3 or higher, WebTen 2.1.10 (both Macintosh - WebTen now comes with ht://dig) or WebSite Pro (Windows NT). In addition more sophisticated add-on search facilities are available for Microsoft and Netscape server products.
Free search facilities
Some of the products listed below will index either a single server or a group of servers (marked with **). If there are no asterisks this does not necessarily mean that multiple servers can't be indexed - but the available information might not mention it.
Search engine |
Version |
Platforms |
Memory and disk space |
Searchable document formats |
Notes |
Alkaline |
1.3 July 99 |
Linux, (intel/Alpha) FreeBSD SGI Irix Solaris BSDI, BSD/OS Win NT |
  |
HTML, ASCII, filters for PDF, User defined filters may be made |
Free to non-commercial sites |
Excite For Web Servers (EWS) http://www.excite.com/navigate/ |
1. Oct 96 Patched version is 1.1.1 |
Solaris 2.4 or x86 SGI Irix 5.3 HP-UX 9.05 IBM AIX 3.2 (v 1.0 only) BSDI 2.0 Linux SunOS |
  |
HTML ASCII |
Known security bug needing patches dated 14 January 1998. NT versions n/a Aug 99 |
**freeWAIS-sf http://ls6-www.cs.uni-dortmund.de/ir/projects/freeWAIS-sf/ |
2.2.12 April 99 |
Only tested on SunOS 5.6 and Linux | Not known | HTML ASCII | Not supported but newsgroup information usually good SF-Gate provides web interface (http://ls6-www.cs.uni-dortmund.de/ir/projects/SFgate/) |
Glimpse & Webglimpse http://webglimpse.net/ |
4.12.5 and 1.7.5 respectively July 1999 |
UNIX of various sorts, with more coming |   | HTML, ACSII |
New effort (March 99)not connected with original developers
Free for non-commercial use Webglimpse is the spider, Glimpse the indexing software. |
**Harvest http://www.tardis.ed.ac.uk/harvest/ |
1.5 1997 last patch produced June 1998 |
UNIX |   | a wide variety | Original development ended but co-operative development of newer version ongoing. Works with local gatherers feeding a central broker rather than a spider model. New version no longer uses Glimpse for indexing. |
**ht://Dig
|
3.1.2 March 1999 |
Sun Solaris 2.5 SPARC (using gcc/g++ 2.7.2) Sun SunOS 4.1.4 SPARC (using gcc/gcc 2.7.0) HP/UX A.09.01 (using gcc/g++ 2.6.0) IRIX 5.3 (SGI C++ compiler. Unknown version) Debian Linux 2.0 (using egcs 1.1b) |
disk space approx 12KB for each document for wordlist database, 7.5KB without wordlist. |
HTML ASCII |
You will need a Unix machine, a C compiler and a C++ compiler. With libstdc++ installed and Berkeley 'make' Will index multiple servers understanding http 1.0 protocol.
Developer site at http://dev.htdig.org/ |
ICE
http://www.objectweaver.de/ice/ |
1.5b3r1 Sept 1998 New release summer 99 |
Anything running Perl |
  |
  |
Requires Perl and runs as CGI gateway. Email support from author. |
Isearch http://www.cnidr.org/ no longer maintained
|
v 1.42 available Aug 98 |
Unix machines from Linux PCs to Crays |
  |
wide range of document types, with facilities to add new types |
No support development is through second website. Mailing list active for development See also the Advanced Search Facility project, for supplying resource location system free of charge http://www.etymon.com/asf/ |
Lycos/Inmagic site spider http://www.lycos.com/software/software-intranet.html |
  |
Windows NT |
  |
  |
Free of charge. Commercial supported version available from Inmagic (see below) |
SWISH ++ http://www.best.com/~pjl/software.html |
3.0 July 1999 |
UNIX Windows NT |
disk space approx 1-5% of size of HTML data |
ASCI HTML Office files extracted and PDF files filtered before indexing |
Unix supporting the mmap(2) system call, a C++ compiler, and a version of STL (Standard Template Library) Unsupported but information available on discussion lists and newsgroups |
SWISH-E (SWISH Enhanced) http://sunsite.berkeley.edu/SWISH-E/ |
1.3.1 or 1.3.2, depending on platform Jan 1999 |
UNIX Linux (intel) Windows (all 32-bit varieties) |
disk space approx 1-5% of size of HTML data |
ASCI HTML |
Information and compiled versions available from http://www.geocities.com/CapeCanaveral/Lab/1652/software.html#swishe Discussion lists and newsgroups and on website |
Thunderstone Webinator http://www.thunderstone.com/webinator/ |
2.5 |
Lts of Unix flavours Linux (Intel) Windows NT x86 |
  |
ASCII HTML Other formats in commercial versions |
V 2.5 incompatible with indexes of previous versions Technical support via a mailing list. The free version is limited to 10,000 pages per index |
Tippecanoe Systems http://www.tippecanoe.com/ |
- |
- | - | - |
New version due late 99. Offering free service from Summer 99 |
Webglimpse - see Glimpse, above |
|
- |
- |
- |
- |
Commercial products (not supposed complete)
All of these products will cost real money but many will negotiate a price, so do not be put off from asking about prices or immediately write off using a commercial product. The money spent may well be saved by staff having no development work to do and having access to ready technical support. Many of these products have a limited-time trial version for you to assess before you commit yourself to buying, but you may have to pre-register with them to get access to trial software. Information on web sites varies enormously, but check the basic facts there before you go any further with assessment.
Commercial products are marketed primarily to companies, not to academic institutions, and information about them reflects this. It may not be readily apparent how or if the software will work in your particular environment until you investigate, particularly if you are seeking to index a group of independent servers that are not on an intranet, or are wishing to produce indexes of several subgroups of information.
Some of these products will support metadata but the information is not readily available so no information about metadata has been recorded. Support of Dublin core metadata is almost non-existent.
Search engine |
Version/Price |
Platforms |
Searchable document formats |
Notes |
ALISE http://www.alise.com/ |
2.0 Starting at US$2000 |
Visual Basic |
  |   |
AltaVista Intranet Search and Developers Kit http://www.altavista-software.com/ |
Price scales on size of index. Academic discount. |
Alpha NT Windows NT DIGITAL UNIX Sun Solaris |
Over 200 |
  |
Excalibur http://www.xrs.com/ |
6.7 |
Windows/NT Sun/Solaris IBM/AIX Hewlett-Packard/HP-UX Silicon Graphics/IRIX Digital Alpha/UNIX Digital Alpha/NT |
Over 200 |
Spider and WebExpress products may be suitable |
FastSearch http://www.fast.no/ |
April 99 |
Solaris Intel: NT, Linux, BSD (FreeBSD), Solaris Alpha: Digital Unix, NT. |
  |
Comes with hardware option for generating and searching extremely large indexes Demos available |
InQuizit http://www.inquizit.com/ |
  |
Windows 95 NT UNIX |
  |   |
**Infoseek Ultraseek Server http://software.infoseek.com/ |
3.1 (Aug 99) Large discount for academic use, |
Sun Solaris 2.5 and above Linux Windows NT 4.0 and above |
Many |
Numerous awards |
**Inmagic/Lycos Site Spider http://www.inmagic.com/ |
  |
Windows NT |
Wide range |
  |
InText http://intext.com/ |
  |
Windows NT UNIX |
Wide range |
  |
Limit Point http://www.limit-point.com/ |
Boolean Search 2.2 Summer 99 US$297 |
Macintosh |
  |   |
Maxum Phantom http://wwww.maxum.com/ |
2.2 July 99 Education price US$296.25 |
Windows NT Macintosh |
Wide range |
Full documentation, FAQ and mailing list support. Email support as well. |
Mondosearch http://www.mondosearch.com/ |
3.31 |
Windows NT |
  |
Indexes frames |
Muscat http://www.muscat.com/ |
>  |
  |
Many |
Almost no technical information |
Open Text http://wwww.opentext.com/ |
  |
  |   |
A knowledge management system rather than indexing |
Oracle http://www.oracle.com/ |
  |
  |   |
Indexer for sites generated by an Oracle database |
Quadralay WebWorks Search http://www.quadralay.com/ |
  |
  |   |
Not currently available Aug 99 |
PCDocs/Fulcrum SearchServer http://www.pcdocs.com/ |
  |
Windows NT |
Over 200 |
Supports Korean and Asian languages and Java |
Site Surfer http://www.devtech.com/SiteSurfer/ |
1.0 Feb99 about US$250 |
Any with version 1.1.5 or higher of a Java runtime to build applet |
Many |
Java applet 1.1 or later, so will only work with Java-enabled browsers Will also give site maps and indexes |
Thunderstone Webinator http://www.thunderstone.com/webinator/ |
2.5 |
Solaris SPARC Linux Intel SGI Irix 5/6 Unixware Solaris x86 BSDI 4.0 SGI Irix 4 AT&T SVR4 386 SunOs 4 SCO 5 DEC Alpha Unix 4 HP UX 10 Windows NT x86 IBM AIX 4.2 Other OSs may be available on demand |
Two commercial versions give different file support see web information PDF plug-in available at extra cost. |
Technical support via a mailing list. The free version is limited to 10,000 pages per index |
Verity Information Server http://www.verity.com/ |
Very expensive! |
Windows NT major UNIX systems |
Any document format including databases |   |
WiseBot http://www.tetranetsoftware.com |
2.0 |
Windows 95/98/NT |
  |
Java search engine. Free trial version. |
Search engine software that is available free of charge is generally either a cut-down version of a commercial product that is limited to producing a small index (Lycos Site Spider), or a product that might require quite advanced expertise to set it up correctly and keep it running smoothly (there are, of course exceptions to this). Maintenance of products is a problem area - for a server manager to install and configure a search engine only to find its development is discontinued or it is turned into a commercial product is a blow. Many free products are for Unix platforms since this is where such expertise and enthusiasm for free software lies. Excite generated much interest from the less technical managers of Unix systems, but it became apparent that it was not well maintained (there was a security hole identified later and it has not been updated since) and the support offered was on a commercial basis.
The Perl-based search engines suffer from the disadvantage that the whole index needs to be loaded before a search can be done, and these products might have a limited life when more engines written in Java are available. Java-based search engines have the problem that users have to be running Java-enabled browsers to use them, and many users prefer to disable Java because of security problems. Several other Perl and Java based search engines are available, other than those listed here -see list at http://www.searchtools.com/tools/tools.html
While they requires some technical expertise SWISH-E or ht://dig do accomplish the job with no direct cost and little day-to-day intervention. Two diverged versions of SWISH are available, the other being SWISH ++ - both (at present) are being actively updated.
The only way to find commercial products that really are suitable for your needs is to pay close attention to making your 'shopping list', investigate available information about the products capabilities, then talk to the local contact. We were seeking a product that was well supported, had a good interface, ran under Unix, was essentially self-managing, and could index a large number of diverse web servers. The product that appeared most suitable was Ultraseek. We were able to download a trial version (restricted to one months' use) and use it to confirm its suitability before buying a licence for the product. I would suggest that if you cannot use the product on a trial basis first, you shouldn't buy it.
If your shopping list does not match ours then the final choice will probably not be the same.
With these figures in mind, here are brief case studies of sites using three different search engines, all of whom are more-or-less satisfied with their performance (the current maintainer of the Oxford search engine is not the person who set it up and finds a certain amount of obscurity in its settings).
Platform: Windows NT
Number of servers searched: 16
Number of entries: approx 11,500
File types indexed: Office files, html and txt. Filters available for other formats
Index updating: Configured with windows task scheduler. Incremental updates possible.
Constrained searches possible: Yes
Configuration: follows robots.txt but can take a 'back door' route as well. Obeys robots meta tag
Logs and reports: Creates reports on crawling progress. Log analysis not included but can be written as add-ons (asp scripts)
Pros: Free of charge with Windows NT.
Cons: Needs high level of Windows NT expertise to set up and run it effectively. May run into problems indexing servers running diverse server software. Not compatible with Microsoft Index server (a single server product). Likes to create several catalog files, which may create network problems when indexing many servers.
ht://Dig - University of Oxford
Platform: Unix
Number of servers searched: 131
Number of entries: approx 43, 500 (specifically 9 levels down as a maximum on any server)
File types indexed: Office files, html and txt. Filters available for other formats
Index updating: Configured to reindex after a set time period. Incremental updates possible.
Constrained searches possible: Yes but need to be configured on the ht://dig server
Configuration: follows robots.txt but can take a 'back door' route as well.
Logs and reports: none generated in an obvious manner, but probably available somehow.
Pros: Free of charge. Wide number of configuration options available.
Cons: Needs high level of Unix expertise to set up and run it effectively. Index files are very large.
Platform: Unix
Number of servers searched: 232
Number of entries: approx 188,000
File types indexed: Many formats, including PDF, html and txt.
Index updating: Intelligent incremental reindexing dependent on the frequency of updates of the files - can be limited to time periods and/or days of the week. Manual incremental updates easily done.
Constrained searches possible: Yes easily configured by users and can be added to configuration as a known constrained search, thereby taking shortcut in processing.
Configuration: follows robots.txt and meta tags. Configurable weighting given to terms in title and meta tags. Thesaurus add-on available to give user-controlled alternatives to search terms entered (especially suitable for obscure local names)
Logs and reports: Logs and reports available for every aspect of use - search terms, number of terms, servers searched, etc.
Pros: Very easy to install and maintain. Gives extremely good results in a problematic environment. Technical support excellent.
Cons: Relatively expensive.
If you have a search solution that is more than about 18 months old, then it is time to review it. How often does it stop working, how long does it take to fix, have you tested it for accuracy recently (is it doing what you think it's doing, and how often is the index being updated)? New products and new versions of existing products have made facilities available that you are probably not using currently. Look at how you could use searching to improve your site - giving constrained searches for certain bodies of information could help a user to find immediately appropriate information. Moving to a spider or robot based indexer could change the way you can run your website and how your website is indexed by the major external search engines, and make your index more reliable for users.
For further information see:
BotSpot http://www.botspot.com/
Search Engine Watch http://www.searchenginewatch.com/
Search Tools http://www.searchtools.com/tools/tools.html
Web Compare http://webcompare.internet.com/
Web Developers Virtual Library http://WWW.Stars.com/