Computing Service | |
University of Cambridge > Computing Service > Web Support |
It is a useful lesson that if you don't want people to read files, then they shouldn't be on the web server. Adding a new indexing facility should remind people to 'spring clean' their files and remove all the information that is no longer pertinent.
There will be some directories that you do not want your indexer to look at and index. When using a spider or robot based indexer, controls over indexing are through a number of means and will be observed by Internet indexers such as AllTheWeb, Google, and HotBot, as well as your local indexer. Obviously, if you can kill all your indexing requirements with one stone it will save you work in the long run.
These controls are:
All 'proper' search engines will observe a robots.txt file and do what it says, and observe the robots metadata tag (see http://www.searchenginewatch.com/webmasters/article.php/2167891).
At another level, access to branches of a web server can be limited by the server software. Combining access control with use of metadata can give information to those within the access domain and some limited information to those outside.