A review of metadata: a survey of current resource description formats
Work Package 3 of Telematics for Research project DESIRE (RE 1004) |
Title page
Table of Contents |
The Summary Object Interchange Format (SOIF) was designed as part of the Harvest Architecture developed at the University of Colorado at Boulder. It is documented in Appendix B of the Harvest User Manual <URL:http://harvest.transarc.com/afs/transarc.com/public/trg/Harvest/user-manual/node151.html>
Records in SOIF are designed to be generated by Harvest gatherers and then used for user searches by Harvest brokers <URL:http://harvest.cs.colorado.edu/>. They provide a summary of the resources that a Harvest gatherer has found. The Harvest distribution contains a number of stock gatherer programs that can generate SOIF summaries from plain text, SGML (including HTML), PostScript, MIF and RTF formats.
In March 1996, Netscape Communications announced that they were also going to use SOIF in their catalog server product and a number of other search engine manufacturers are said to be looking at supporting it. Note that SOIF records could be generated by hand by archive maintainers or authors.
The vast majority of SOIF records in use today are generated automatically by robots acting as Harvest gatherers. The format is a simple attribute-value based record and there is only a relatively small number of common SOIF attribute names, so it is easy to create SOIF records by hand if desired. As each Harvest broker can support any attributes that are required by the data it provides access to, it is possible for other attributes outside of the common set to be used in local systems. SOIF does not mandate any particular attributes, within the Harvest software it is possible to configure a customised record format (or template) which will be used within that particular implementation.
SOIF is a really an internal record format of the Harvest and related systems and has not been placed on any formal standards track process. At the moment it is just a de facto standard.
SOIF 'templates' are designed for a very specific purpose (summarising indexed resources) but the basic format is capable of being locally extended to handle other tasks if needed. However there does not appear to be any concept of nesting of elements in the SOIF format.
SOIF is based on simple attribute-value pair elements. A single SOIF stream can contain multiple SOIF 'templates', each of which has an URL for the resource that it refers to and a number of different elements for holding the other metadata. Each element has an attribute name, the length of the value in brackets, a colon delimiter and then the value itself.
The basic decriptive (biblographic) attributes in SOIF are:
· Abstract
· Author
· Description
· Keyword
· Title
The common SOIF element set does not contain any subject description elements in the traditional library sense. It does however have a Type attribute name that describes what sort of resource the SOIF record refers to. The example types given in the Harvest User Manual are:
· Archive
· Audio
· Awk
· Backup
· Binary
· C
· CHeader
· Command
· Compressed
· CompressedTar
· Configuration
· Data
· Directory
· DotFile
· Dvi
· FAQ
· FYI
· Font
· FormattedText
· GDBM
· GNUCompressed
· GNUCompressedTar
· HTML
· Image
· Internet-Draft
· MacCompressed
· Makefile
· ManPage
· Object
· OtherCode
· PCCompressed
· Patch
· Perl
· PostScript
· RCS
· README
· RFC
· SCCS
· ShellArchive
· Tar
· Tcl
· Tex
· Text
· Troff
· Uuencoded
· WaisSource
Most of these types are related to different types of computer file formats and languages, which reflects SOIF's intended use in indexing network accessible objects.
There is a URL at the top of every template that is the URL of the resource to which the SOIF record relates. There is also a URL-References attribute that can be used to hold any URL references that are present within HTML objects being summarised. Lastly, the contact information for the gatherer that generated the SOIF record is also provided in four additional atttributes. The example SOIF template mentioned in the Harvest User Manual also has separate Site, File and Path elements to allow replicated copies of the object to be located.
The Type attribute detailed above tells us something about the type of the resource being summaried. There is a File-Size attribute that tells us how many bytes are in the summaried object. SOIF also allows the actual object to be embedded within the template using the Full-Text attribute. The example SOIF template mentioned in the Harvest User Manual also includes a Required element that specifies hardware and software requirements,
Note that as the SOIF format includes the length of each value after the attribute name, it is possible to embed any binary object in the template if desired (although that means that it may not be possible to edit it by hand).
The common SOIF element set provides no fields for this purpose. However the example SOIF template mentioned in the Harvest User Manual includes a MaintEmail element that is the email address of the maintainer of the object.
The common SOIF element set provides the following elements to contain information concerned with the administration of the template:
· Gatherer-Host
· Gatherer-Name
· Gatherer-Port
· Gatherer-Version
· Last-Modification-Time (of the object)
· MD5 (checksum of the object)
· Refresh-Rate (how many seconds after the Update-Time before the SOIF template should be regenerated; default of 1 month)
· Time-to-Live (how many seconds after the Update-Time the SOIF template is still valid for; default 6 months)
· Update-Time (the time that this SOIF template was last updated; this is a required element and has no default)
Other SOIF elements that are not mentioned in the common element set but which are in day-to-day use for holding administrative metadata are:
· CheckedEmail (the email of the person who checked the SOIF template if hand generated)
· EnteredBy (the name of the person who entered the template)
· Entered (the date that the template was entered into the database by hand)
SOIF's information on the source of the data is held in the four gatherer information attributes detailed above and also the URL of the resource that the template summarises. Some templates also include a Version element that gives the version of the resource that the SOIF template summarises.
The common SOIF element set provides no fields for this purpose. However the example SOIF template mentioned in the Harvest User Manual has a CopyPolicy element that specifies the copyright and access policy of the resource.
The contents of the common SOIF element set are described in Appendix B.2 (List of common SOIF attribute names) in the Harvest User Manual <URL:http://harvest.transarc.com/afs/transarc.com/public/trg/Harvest/user-manual/node153>
SOIF records are in a simple attribute-value pair format. The length of the value is explicitly represented in each element, allowing binary objects to be embedded in the template.
The physical transfer of the SOIF record between the gatherer and the broker (or the broker and another broker) in Harvest is often a simple byte stream containing the raw SOIF template.
There is no specific multi-lingual support in SOIF.
The URL-References attribute allows links that are embedded in summaried HTML objects (and any other objects that can contain URLs such as VRML files) to be held separately in the template. There is no inter-template linking mechanism in the common SOIF element set.
The SOIF common element set is a very simple and designed for a specific purpose (summarising gathered resources) and so they have a fairly low fullness.
SOIF records can be carried over any transport protocol that supports a suitable application protocol.
Databases of SOIF records can be searched via a variety of mechanisms. The most common is a broker CGI script that can be accessed via a normal WWW browser. Other brokers use WAIS front ends. It would be possible to produce a Z39.50 front end, though it is not known if this has been done.
The original implementation of SOIF was in the Harvest system, which is still freely available. Netscape Communications is using it in its Catalog Server product and other commerical indexing and search engine vendors are believed to be looking at supporting it.
Next | Table of Contents |