Creating a Search from Scan Results

By Janifer Gatenby, Geac Computers

Version 2 July 25^th 1999

The requirement

Scan returns results that consist of brief records [JCZ1] representing rows from an ordered list. The results can be presented to an end user, enabling him or her to browse forward and optionally backwards, then select a line for further information or processing. When a line is selected, the system will format a search request in order to request an actual record [JCZ2] . Typical examples are scans (browses) on AUTHOR, SUBJECT and TITLE.

Database Models

The method of constructing a search from scan results needs to accommodate various different database models. There are the following possible models:

· The scan index is derived from a database that is totally separate from the bibliographic database Example: Authority database on LC and bibliographic database somewhere else [JCZ3] .

· The scan index is derived from a database that is separate from the bibliographic database but it contains pointers to the linked bibliographic records. The same data occurs in both databases. Example: an Integrated Library Management System with an authority database linking to a bibliographic database of full MARC records that contains authorised data, including authors and subjects.

· The scan index is derived from a database, e.g. an authority database that is inter-linked with the bibliographic databases with the records in the bibliographic database containing links to the associated database and vice versa, with no repetition of data. Example: an Integrated Library Management System with an authority database linking to a bibliographic database. To construct a full record, it is necessary to integrate data from both databases.

· The scan index is derived from the bibliographic database. Example: a title index.

To do a follow on search for the first model where there are no database links, the origin uses same USE attributes for the search as it used for the scan and uses data from the TERM as the search data.

This method may be applied to all models, however where there are database links, it is not the optimal approach. The problem with using TERM for the follow search is that the resulting search may not be precise enough. There are a number of reasons for this. Firstly, the term may have been truncated and actually lacks significant words, important for the precision. Secondly, the target may not support position attributes such as FIRST IN FIELD or the structure attribute PHRASE and therefore the search is constructed in an imprecise way such that it can retrieve unexpected records even when a single seemingly unique line has been extracted from a SCAN [JCZ4] . This is exacerbated when the TERM itself has multiple occurrences, e.g. for a title that has only common words such as ‘Psychology”.

What is required is a means of using database links where they exist to assist in the precision of the follow on search.

Scan elements

Which data elements of the scan results can an origin use in order to construct a follow on search? The scan results contain TERM that represents the data that was matched against the scan attributes and is normally the data that is used by the target for sequencing the scan results. The scan results also include DISPLAY TERM that gives the display version of the term, e.g. data in upper and lower case, including diacritical marks and initial articles. The other elements that could carry significant information are ALTERNATIVE TERM and OTHER TERM INFO.

When database models 2, 3 or 4 are in use, it is desirable to send some retrieval information in the SCAN results.

The Proposal

The proposal is to include this retrieval information in alternativeTerm [JCZ5] in the form of AttributesPlusTerm. The alternative terms given should relate to the term itself and its occurrences. This means that the "term" to be given in alternativeTerm does not refer to database records related to the term such as bibliographic occurrences of an authority record.

Examples:

1. An authority scan, e.g. author or subject performed on an index of an authority database (auth.file) produces a scan entry with a term occurrence of 1. There are actually 3 bibliographic records associated with this authority record. In alternativeTerm of the scan response, there is one entry containing the identifier of the authority record (4544).

attributeSet 1.2.840.10003.3.1 (Bib1)

attributeType 1 (Use attribute)

attributeValue 12 (Local number)

term 4544 (local number of the term)

The target may define its own attribute values for internal numbers, particularly if it needs to distinguish between local bibliographic and authority numbers. As the target is able to supply these in the scan response, the origin does not need to know them in advance. Therefore internally defined attributes do not pose a problem for interoperability.

The authority file may be located in a separate database from the bibliographic file or it may be in the same database. For the purposes of retrieval, the authority file should be regarded as a separate database even where it is not. The origin needs to know the names of both databases.

Where an authority file is linked to a bibliographic file as per database models 2 and 3, it is possible that to:

· Scan the authority file, then search the authority file, e.g. to retrieve a MARC authority record

· Scan the authority file, then search the bibliographic file, e.g. to retrieve the MARC bibliographic record or records associated with the authority record

Alternative term is most often used to supply see references of authority records. In this case, usually the attribute set, attribute type and attribute value are the equivalent of those used for the scan request. Under this proposal, the alternative term in the scan response may include a mixture of "see references" and internal reference numbers. The origin will need to distinguish between these two, e.g. for display when displaying see references but masking internal numbers or displaying them differently [JCZ6] .

2. A title scan performed on an index of a bibliographic database (bib.file) produces a scan entry with a term occurrence of 5 [JCZ7] . In alternativeTerm, there are 5 entries containing identifiers of 5 bibliographic records (111, 222, 333, 444 and 555).

attributeSet 1.2.840.10003.3.1 (Bib1)

attributeType 1 (Use attribute)

attributeValue 12 (Local number)

term 111 (local number of one occurrence of the term)

attributeSet 1.2.840.10003.3.1 (Bib1)

attributeType 1 (Use attribute)

attributeValue 12 (Local number)

term 222 (local number of one occurrence of the term)

Whether or not to include alternative terms in the scan response is entirely at the discretion of the target. The target may have an internally defined threshold, e.g. 3 or 5. If the number of occurrences exceeds this figure, it will not supply alternative term in these cases.

Version

Date

Author

Description

9.04.99

Janifer Gatenby

25.07.99

Janifer Gatenby

Change other term info / url to alternative term / attributes plus term

[JCZ1] I have problems with this. Scan does not return brief records; scan is not modelled as returning brief records. Scan returns terms that can be used as the term part of a subsequent query. These terms have some meta-data attached, but they cannot be seen as "records" in the Z39.50 sense.

[JCZ2] Again, the turn-round action is not to "request" an actual record; it is to execute a search, which may return one or many records. A scan on "bible" in a large database, for instance should retrieve terms ("Bible. English", "Bible. O.T. English" etc.) that each might find thousands of records. I certainly wouldn't expect to get back thousands of instances of the term Bible, each pointing to a single record.

[JCZ3] I don't see how any kind of meta-information added to scan can address this model. It is purely a client issue if it wants to apply the results of a scan on an LC database against a local database. The results are by definition unpredictable.

[JCZ4] But in that case, the server is doing its scan incorrectly!!!!!! I don't think this is a problem that we need Z39.50 mechanisms to solve. It seems to me that the presence of a count in the scan response is a promise that when the term is used in a search on the same database(s), then that many records will be in the result set. If this is not true, then either scan or search is broken.

[JCZ5] I think this is changing the semantics of alternative term, at least as I understood it. My understanding has been that alternative term carries a reference, probably a reference. We have implemented behaviour that is very close to that required and have done so using OtherInfo rather than alternative term. I continue to think that OtherInfo is the appropriate vehicle for this kind of thing, and that we should be concentrating on developing an appropriate structure for use in OtherInfo

[JCZ6] How is the client to distinguish between these alternative terms that are see references and should be shown to the user and alternative terms that are optimizations and should not be shown to the user?

[JCZ7] I don't think there is ANY sense in which the record control number for the 1998 edition of Romeo and Juliet is some kind of alternative to the name Shakespeare. And I certainly wouldn't expect to get what is essentially the result set of the corresponding search as part of the scan result. We are scanning here, not searching! We use this kind of mechanism to allow the server to say, in effect, "here is the term, a search on it retrieves n records and if you want to search for these n records you will get them more quickly and efficiently if you send this other thing (my heading number) rather than the term itself. If you send me the term, however, I will, of course search for it, and you should get identical results". We could put this "preferred synonym" term in term, and the display term in display term, but the problem is that the "preferred synonym" term uses a different attribute combination from the one used in the original scan request, and scan response has no mechanism for indicating this. What we send now is:

Term: SHAKESPEARE WILLIAM 1565 1616 (This is the actual string in the index)

Display Term: Shakespeare, William, 1565-1616 (This is the actual string in the record with subfields removed)

Count: 25362 (this is the number of records retrieved by a search on the term)

OtherInfo:

Preferred Synonym Term+Attributes:

AttributeList (same as original except structure is :DBKey)

Term: 123456789 (this is the database's key for the term - searching on this rather than term avoids a join and is thus significantly more efficient)

(well …we don't actually send the attribute list, since our client already knows it for our server, but we very easily could)

This suggests that maybe we need an other info structure that can carry term+attribute structures and specify their purpose. I need a "preferred synonym", Janifer needs a list of "things pointed to by this term", I'm sure someone else has another use.