Parasite: Mining Structural Information on the Web
Based on heuristic assumptions including:
Hypertext Linking
- A linked page is likely to be on the same topic as the original page (esp. for Yahoo type resources)
Directory Structure
- A URL containing a directory below a personal home page (PHP) is likely to be authored by the person identified in the PHP
Page Structure
- Links "near" each other on a page are likely to have similar topics