Metadata: an overview of current resource description practice Work Package 3 of Telematics for Research project DESIRE (no. 1004) | Title page Table of Contents |
The Summary Object Interchange Format (SOIF) was designed as part of the Harvest Architecture developed at the University of Colorado at Boulder. It is documented in the .of the . .
Records in SOIF format are designed to be generated by Harvest gatherers and then used for user searches by Harvest brokers <URL: http://harvest.cs.colorado.edu/>. They provide a summary of the resources that a Harvest gatherer has found. The Harvest distribution contains a number of stock gatherer programs that can generate SOIF summaries from plain text, SGML (including HTML), PostScript, MIF and RTF formats.
In March 1996, Netscape Communications announced that they were also going to use SOIF records in their catalog server product and a number of other search engine manufacturers are said to be looking at supporting it. The SOIF templates can be generated by hand by archive maintainers or authors.
The vast majority of SOIF templates in use today are generated automatically by robots acting as Harvest gatherers. The format is a simple attribute-value based record and there is only a relatively small number of common SOIF attribute names, so it is easy to create SOIF records by hand if desired. As each Harvest Broker can support any attributes that are required by the data it provides access to, it is possible for other attributes outside of the common set to be used in local systems.
SOIF is a really an internal record format of the Harvest and related systems and has not been placed on any formal standards track process. At the moment it is just a de facto standard.
SOIF templates are designed for a very specific purpose (summarising indexed resources) but the basic format is capable of being locally extended to handle other tasks if needed. However there does not appear to be any concept of nesting of elements in the SOIF format.
SOIF is based on simple attribute-value pair elements. A single SOIF stream can contain multiple SOIF templates, each of which has an URL for the resource that it refers to and a number of different elements for holding the other metadata. Each element has an attribute name, the length of the value in brackets, a colon delimiter and then the value itself.
The basic decriptive (biblographic) elements in a SOIF template are:
Abstract
Author
Description
Keyword
Title
The common SOIF element set does not contain any subject description elements in the traditional library sense. It does however have a Type attribute name that describes what sort of resource the SOIF template refers to. The example types given in the Harvest User Manual are:
Archive
Audio
Awk
Backup
Binary
C
CHeader
Command
Compressed
CompressedTar
Configuration
Data
Directory
DotFile
Dvi
FAQ
FYI
Font
FormattedText
GDBM
GNUCompressed
GNUCompressedTar
HTML
Image
Internet-Draft
MacCompressed
Makefile
ManPage
Object
OtherCode
PCCompressed
Patch
Perl
PostScript
RCS
README
RFC
SCCS
ShellArchive
Tar
Tcl
Tex
Text
Troff
Uuencoded
WaisSource
Most of these types are related to different types of computer file formats and languages, which reflects SOIF's intended use in indexing network accessible objects.
There is a URL at the top of every template that is the URL of the resource that the SOIF template provides a summary of. There is also a URL-References attribute that can be used to hold any URL references that are present within HTML objects being summarised. Lastly, the contact information for the gatherer that generated the SOIF template is also provided for in four additional atttributes. The example SOIF template mentioned in the Harvest User Manual also has separate Site, File and Path elements to allow replicated copies of the object to be located.
The Type attribute detailed above tells us something about the type of the resource being summaried. There is a File-Size attribute that tells us how many bytes are in the summaried object. SOIF also allows the actual object to be embedded within the template using the Full-Text attribute. The example SOIF template mentioned in the Harvest User Manual also includes a Required element that specifies hardware and software requirements,
Note that as the SOIF format includes the length of each value after the attribute name, it is possible to embed any binary object in the template if desired (although that means that it may not be possible to edit it by hand).
The common SOIF element set provides no fields for this purpose. However the example SOIF template mentioned in the Harvest User Manual includes a MaintEmail element that is the email address of the maintainer of the object.
The common SOIF element set provides the following elements to contain information concerned with the administration of the template:
Gatherer-Host
Gatherer-Name
Gatherer-Port
Gatherer-Version
Last-Modification-Time (of the object)
MD5 (checksum of the object)
Refresh-Rate (how many seconds after the Update-Time before the SOIF template should be regenerated; default of 1 month)
Time-to-Live (how many seconds after the Update-Time the SOIF template is still valid for; default 6 months)
Update-Time (the time that this SOIF template was last updated; this is a required element and has no default)
Other SOIF elements that are not mentioned in the common element set but which are in day-to-day use for holding administrative metadata are:
CheckedEmail (the email of the person who checked the SOIF template if hand generated)
EnteredBy (the name of the person who entered the template)
Entered (the date that the template was entered into the database by hand)
SOIF's information on the source of the data is held in the four gatherer information attributes detailed above and also the URL of the resource that the template summarises. Some templates also include a Version element that gives the version of the resource that the SOIF template summarises.
The common SOIF element set provides no fields for this purpose. However the example SOIF template mentioned in the Harvest User Manual has a CopyPolicy element that specifies the copyright and access policy of the resource.
The contents of the common SOIF element set is described in in the .
SOIF templates are in a simple attribute-value pair format. The length of the value is explicitly represented in each element, allowing binary objects to be embedded in the template.
The physical transfer of the SOIF record between the gatherer and the broker (or the broker and another broker) in Harvest is often a simple byte stream containing the raw SOIF template.
There is no specific multi-lingual support in SOIF.
The URL-References attribute allows links that are embedded in summaried HTML objects (and any other objects that can contain URLs such as VRML files) to be held separately in the template. There is no inter-template linking mechanism in the common SOIF element set.
The SOIF common element set is a very simple and designed for a specific purpose (summarising gathered resources) and so they have a fairly low fullness.
The SOIF templates can be carried over any transport protocol that supports a suitable application protocol.
Databases of SOIF templates can be searched via a variety of mechanisms. The most common is a broker CGI script that can be accessed via a normal WWW browser. Other brokers use WAIS front ends. It would be possible to produce a Z39.50 front end, though it is not known if this has been done.
The original implementation of SOIF templates was in the Harvest system, which is still freely available. Netscape Communications is using it in its Catalog Server product and other commerical indexing and search engine vendors are believed to be looking at supporting it.
Next | Table of Contents |