PRIDE Requirements and Success Factors
Work Package 2 of Telematics for Libraries project PRIDE (LB 5624) |
Title page Table of Contents |
Search and Retrieve systems are the work force of Information Retrieval (IR). They provide access to various document databases, performing selection functions according to the user requests. There are three important characteristics that distinguish different Search and Retrieve systems:
The rest of this section is organised as follows: first, different types of search requests are described, then a number of existing search and retrieve systems and protocols are presented.
Historically first method for IR (Information Retrieval) was the method based on the so-called Boolean model. In accordance with the model the stored documents are identified by sets of keywords (terms). Requests for information are expressed by Boolean combinations of terms. The goal of the retrieval process is to select those stored documents that are identified by the exact combination of terms specified in the requests. This old model is implemented in many search systems still in use nowadays.
Scientific research in IR concentrates on two different approaches to the IR problems.
The first one is based on the statistical models. These models are similar to the Boolean model. The main distinction between them is the use of weighted terms in statistical models. The use of term weights allows assignment of importance indications to the terms. Such well known IR models as Vector Space Model (VSM) and Probabilistic Model belong to the class of statistical models.
The second approach is based on the use of natural language processing method. It is natural for this approach to be used in dictionaries, thesauri and knowledge bases providing semantic specifications for the text words.
In this model the stored documents are described by sets of keywords or phrases known as index terms. Requests for information are expressed by Boolean combinations of index terms. The following Boolean operators can be used: and, or, and not. The goal of the retrieval system is to select those stored documents that are identified by the exact combination of search terms specified in the query.
One of the advantages of the Boolean model is the simple search method. This method is based on the use of auxiliary inverted index files and the search operations use simple list manipulations.
Two disadvantages of the Boolean model should be noted:
Refinements have been introduced into the Boolean model to provide more discriminating output. In this new fuzzy-set retrieval model the terms assigned to document are allowed to carry term weights in decreasing order of presumed term importance (but such weights are not assigned to the query terms). The retrieved documents can then be ranked in decreasing order of the weights of certain matching query terms. [6]
Free form search model is similar to keywords-based Boolean model, but it brings in some intelligence. While the Boolean model can guarantee that the documents are selected in strict accordance with a well-defined formula, the free text model is fuzzy; it may employ synonyms, assign term weights on its own discretion, react on the user's feedback etc. The search system tries to figure out what the user actually wants to find.
The two approaches mentioned above, statistical models and natural language processing, apply to this category of search techniques. Natural language processing is still considered perhaps too heavyweight, as it requires the use of dictionaries, thesauri and similar tools without guaranteeing reliable improvements over the Boolean model. [5]
Statistical models are further subdivided into Vector-Space models (VSM) and Probabilistic models.
IR systems often use document descriptions rather than the raw documents to do the search. In case of VSM, document descriptions are tuples of real numbers indexed by document terms [7]. Terms may be words, phrases, or other textual units from the documents. The numbers are weights assigned to the terms. The following factors are normally used to assign weights:
It is common practice now to suppose that a user query to an IR system is expressed in a natural language. VSM treats users' queries as documents, builds their descriptions and compares them against the descriptions of the document collections. For that, a mathematical apparatus is developed which operates on document descriptions in terms of matrices and matrix operations. Relevance feedback from users can be used by a VSM system to adjust term weights in document collection descriptions.
Another type of the statistical model is a probabilistic model. In this approach the goal is to estimate the probability of relevance of a given document to user with respect to a given query. Document weight is a sum of term weights which values are based on probabilities of terms occurring in relevant and non-relevant documents.
This approach shows good effectiveness where there is an opportunity to utilise available training data to uncover these probabilities. But in other circumstances, especially when complex terms are used, these probabilities may be difficult to estimate accurately. One of the ways to estimate the relevance probability is to use relevance feedback. See also [ 7] for more details.
These systems organise attributes and values in a hierarchical structure. The search request thus consists of a series of choices, descending down the topic tree from more abstract topics to more specific ones. It should be noted that although such structure is an oriented graph, it is not necessarily a tree, because it may contain different paths to the same topic or document. For example, information on car audio systems can be made accessible via "Cars -> Car Audio" as well as "Electronics -> Audio -> Car Audio".
Hierarchical search systems require certain amount of maintenance: all documents need to be classified according to the schema used. Sometimes the topology also needs to be changed.
A typical implementation of a hierarchical document retrieval system is a set of hypertext documents, each representing one or more nodes and containing links to other, more detailed, nodes and/or to the actual documents.
The most well-known example of a hierarchical search system is Yahoo!.
ANSI/NISO Z39.50-1995 - Information Retrieval (Z39.50): Application Service Definition and Protocol Specification [1] is an American national standard for information retrieval. This document specifies a set of rules and procedures for the behaviour of two networked computer systems communicating in order to perform database searching and information retrieval tasks.
The Z39.50 standard was developed to overcome the traditional problems associated with searching multiple databases by defining a standard language by which computer systems can interrogate remote databases and retrieve information records.
The latest version of the Z39.50 standard (Version 3) was approved in 1995 by the American National Information Standards Organisation (NISO). This version is also being adopted as an international standard, replacing the ISO Search and Retrieve (SR) standard approved by the International Standards Organisation in 1991. The new ISO standard will be known as ISO 23950.
Z39.50 conforms to the Client/Server model of computing where two computers interact in a relationship where one computer (the Client) requests services from the other (the Server). The Server responds to Client requests by performing functions on behalf of the Client and returning the function's status and other requested information. In Z39.50 parlance, those parts of the Client and Server that are directly concerned with the handling of Z39.50 communications with the other system are known as the Origin and Target respectively.
A searcher enters a query into the local system using that system's standard interface and query language. The language and the search mechanism correspond to the Boolean model. The Z39.50 Origin then translates the local query into the standard format defined by Z39.50 and sends it to a remote Z39.50 Target system, which further translates the standard Z39.50 query into a search that can be performed against its associated database. The database search results are translated by the Z39.50 Target into Z39.50 response messages that are sent back to the Z39.50 Origin that then passes them up for display to the searcher via the local user interface.
The major advantage of this system is that the searcher only needs to learn the interface and query language of their local system in order to be able to search any number of remote Z39.50-enabled databases. It also provides an enabling technology for multi-database searching.
The Z39.50 Origin and Target communicate using a stateful network connection where both systems remain connected whilst exchanging protocol messages (known as APDUs or "Application Protocol Data Units"). These messages represent Z39.50 service requests and responses. The basic services are INIT to establish a Z39.50 association; SEARCH to query the remote database; and PRESENT to retrieve records from the database.
Z39.50 provides a means to carry out services which are not part of the Z39.50 standard, such as, for example, document ordering via ILL. This is achieved through the Extended services facility.
Z39.50-1995 has some more services to allow for the browsing of indexes (SCAN), the sorting of records in a query result (SORT), and for explicitly closing the network connection (CLOSE). Z39.50-1995 also supports various protocol enhancements such as concurrent operations (initiation of several operations on the same connection without having to serialise them).
Z39.50 itself does not mandate any particular networking protocol to provide the communications between systems. However, The Internet is the most common communication network used to transmit Z39.50 messages, although there are some systems that use the OSI networking protocols.
Z39.50 not only standardises the messages to be transferred and the sequence in which they can be sent, but also the structure and semantics of the search and the format of returned records. It does this through the use of predefined Query Types, Attribute Sets and Transfer Syntaces respectively. Z39.50 is therefore extensible, because new Query Types, Attribute Sets and Transfer Syntaces can be defined within the standard as and when different user communities require them.
The Z39.50 Explain facility is a convention that allows a client system (origin) to obtain details of the target implementation, "including databases, attribute sets, diagnostic sets, record syntaxes, and element specifications supported" [1]. It is covered in section 2.5.2.
Normally, Z39.50 targets only provide information that resides in their local databases - they do not propagate search requests. Hence a typical scenario for a distributed Z39.50 search is based on a flat model: a number of Z39.50 servers are queried in parallel, some of them possibly queried for more than one database. Version 3 servers may support the `concurrent-operations' option that allows searching a number of databases on a target in parallel.
However, there is no requirement that a Z39.50 target must always search its local databases only. A UNIverse server [3,4], for example, propagates incoming queries to a number of Z39.50 targets, collates the results and presents them as if they were coming from a single database. As a UNIverse server is itself a Z39.50 target, this approach allows for the distributed search hierarchies of arbitrary depth.
There are a number of freely available implementations of the Z93.50 protocol. One of them, the YAZ toolkit [2], developed by a Danish based company Index Data, may be of particular interest to the PRIDE consortium. The YAZ toolkit supports both the Z39.50 v3 and the ISO10163 SR protocol, and both the TCP/IP transport method and the OSI upper-layers (using a separate XTI/mOSI implementation) over RFC1006. Both the Origin (client) and Target (server) roles of the protocol are supported. The toolkit is written in C and its source code is freely available (see the licensing terms in [2]). Within the context of the PRIDE Project, Index Data is a subcontractor to Fretwell-Downing Data Systems.
Z39.50-1995 specification [ 1], <URL:http://lcweb.loc.gov/z3950/agency>
YAZ toolkit documentation [ 2], <URL:http://www.indexdata.dk/yaz/yaz.shtml>
HTTP, the HyperText Transfer Protocol, is the protocol used to transfer resources between Web browser and Web server. Built over TCP/IP, HTTP is a stateless protocol that has evolved through several stages, described below.
The original version of HTTP, as defined in a 1991 memo, is a very simple protocol designed for transferring HTML. No client information is transferred with the query. Future versions of HTTP are back compatible with this. HTTP/0.9 can run over any connection-oriented service (usually TCP).
In 1992, HTTP/1.0 was introduced as an upgrade to the previous version. The abstract from the defining memo defines HTTP/1.0:
"HTTP is a protocol with the lightness and speed necessary for a distributed collaborative hypermedia information system. It is a generic stateless object-oriented protocol, which may be used for many similar tasks such as name servers, and distributed object-oriented systems, by extending the commands, or "methods", used. A feature if HTTP is the negotiation of data representation, allowing systems to be built independently of the development of new advanced representations."
Additions include MIME types, a wider range of methods, object meta-information headers and client headers.
HTTP/1.1 became a standard in 1997. Features introduced include:
Current work is focusing on re-engineering the protocol architecture for the HTTP protocol based on a simple, extensible distributed object-oriented model. See future work section.
Many servers, browsers, proxies and caches now support many of the HTTP/1.1 features. Deployment of future technologies such as HTTP-NG may be initially deployed through proxies.
Introduced by Netscape, cookies were defined as an RFC in 1997. Cookies are chunks of state information that allow a stateful session with HTTP requests and responses. Other proprietary mechanisms exist for tracking state information.
Cookies work by an origin server returning an extra response header to a client (`Set-Cookie'). A browser may return a Cookie request header to the origin server if it chooses to continue a session. The origin server may ignore this, or use it to determine the current state of the browsing session. The server may then resend a Set-Cookie response header (with the same or different information). A session is ended when a server returns a Set-Cookie with an immediate expiration.
PEP, the Protocol Extension Protocol, is an extension mechanism for HTTP designed to bridge the gap between formal specification and private agreement and to provide a mechanism for the extension of Web applications. Such applications include browsers, servers and proxies.
PEP associates an extension with a URI and uses various new header fields to carry the extension identifier and related information between the parties involved in an extended transaction. PEP is designed to be compatible with HTTP/1.0 and HTTP/1.1. It is intended that the PEP mechanism is included in future versions of HTTP.
TCN, Transparent Content Negotiation, is a mechanism layered above HTTP for automatically selecting the most appropriate content to return to a client when a URL is accessed. Content negotiation has been increasingly recognised as an important potential member of the Web infrastructure. RFC2295 lists the following expected benefits for "sufficiently powerful system":
MUX is a session management protocol that demarcates the transport layer from the upper-level application protocols. The mechanism for communication between these layers is through multiplexed data streams built on top of a stream-oriented transport. Support for multiple application level protocols provides ease of transition to new future Web protocols. Client applets may communicate, using private protocols with severs over the same connection as the HTTP transactions.
The effects of the deficiencies of HTTP/1.0 are patched in HTTP/1.1. However, pipelining does not adequately support rendering of inlined objects, nor fairness between protocol flows nor graceful abortion of HTTP transactions without closing the TCP connection. These issues may be resolved using MUX, for example, to multiplex multiple lightweight HTTP transactions onto the same transport connection.
SMUX has been implemented as part of the HTTP/NG project (though is still at an experimental stage) to address various transport/application issues.
As HTTP was designed as a medium for the publication of documents to the public, security was not originally a main part of the architecture. The evolution of the web has identified a need for suitable security facilities and access controls.
The stateless nature of HTTP/1.0 contrasts with protocols such as FTP where user authentication occurs at the login process and authentication persists until logout. HTTP/1.0 was designed so that servers did not have to retain any information from one connection to the next, this means that authentication information, if required, must also be included with each request. As long as the same authentication applies to a set of documents (for example, below some directory path in the URL) then the browser may store and re-issue the appropriate credentials at each request. Any such system for using the same credentials is vulnerable to replay attacks, but one-time password systems such as S/Key would not be appropriate for the large number of requests issued for a typical set of web pages. In HTTP/1.0 the username/password combination is encoded but not encrypted, making it vulnerable to network sniffing.
The persistent connections offered by HTTP/1.1 are still not the kind of session over which strong authentication can be deployed, although HTTP/1.1 introduces an improved method called digest authentication. This method incorporates calculating an MD5 digest based upon the username, password and a value returned by the server. Though this improves the situation, digest authentication still does not provide a strong authentication mechanism described in the TCP section. SHTTP (secure HTTP) is not particularly in use.
HTTP-NG should support a more secure model for authentication and encryption.
The Secure Socket Layer (SSL) protocol developed by Netscape is the most commonly used security enforcement protocol for HTTP. SSL is covered in section 2.7.10.5.
HTTP standardisation is IETF based, though a great deal of information can be found at the W3C site.
It is almost certain that PRIDE will be using the Web and HTTP. Issues that may arise with this include:
The W3C briefing package summary for HTTP-NG mentions the following drawbacks of HTTP tackled by NG:
Concerns such as these are moulding the design of the new protocol architecture. The approach to HTTP-NG will be a layered HTTP framework to provide a Web distributed-object system. Transport layers will address the interaction with other Internet protocols. Such layering will allow for modularisation at the implementation framework. The distributed-object system allows application-specific interfaces to be defined and provides a model for the deployment of applications. The distributed-object system itself, provides the core functionality of distributed-object systems defined by DCOM, CORBA and Java RMI. It is hoped that it will be directly usable by these technologies.
A number of HTTP-NG related IETF drafts have recently been submitted (1998).
W3C HTTP pages, <URL:http://www.w3.org/Protocols/>
W3C HTTP-NG pages, <URL:http://www.w3.org/Protocols/HTTP-NG/>
Inter-Library Loan (ILL) is the generic term used for describing the process of supplying copies of, or loaning, items between libraries, generally to satisfy a local user's request. A number of system models used to manage the ILL process are being studied. The ILL protocol (ISO 10160/1) was developed in the context of managing the entire document request lifecycle within a distributed environment. It is conceptually similar to EDI agreements and includes provision for: definition of required data elements, definition of a set of messages and their relationships, and a syntax for structuring the messages.
The ILL protocol is concerned with basically 3 sets of parameters:
ILL is not currently widely implemented with most of the use centred in Canada libraries. In addition, the forty-five ILL vendors and service providers listed below have committed to implement the protocol. Ten of these (*) have successfully sent ILL requests between each other.
A-G
Canada Ltd.
|
*
Innovative Interfaces
|
*
Ameritech Library Services
|
JEDDS
Project (Australia)
|
Bath
Information & Data Services (BIDS) (U.K.)
|
The
Library Corp.
|
British
Library (U.K.)
|
LIBRIS
(Sweden)
|
CARL
Corp.
|
*
MnSCU/PALS
|
CILLA
Project (Australia)
|
National
Library Board of Singapore
|
CISTI
(Canada)
|
National
Library of Australia
|
COPAC
(U.K.)
|
*
National Library of Canada
|
CPS
Systems
|
National
Library of Medicine
|
Committee
on Institutional Cooperation
|
*
OCLC
|
DDE-ORG
Systems Ltd. (India)
|
OVID
Technologies
|
*
DRA
|
PICA
(Netherlands)
|
EBSCO
Document Services
|
Perkins
& Associates
|
EDDIS
Project (U.K.)
|
*Pigasus
Software, Inc.
|
Elias
|
QuickDOC
|
ELib
(U.K.)
|
Relais
International
|
Endeavor
Information Systems
|
*
Research Libraries Group
|
EOSInternational
|
SIRSI
|
ExLIBRIS
|
TKM
Software (Canada)
|
Finsiel
(Italy)
|
Triangle
Research Library Network
|
Florida
Center for Library Automation
|
*
University of Quebec (Canada)
|
*
Fretwell Downing (U.K.)
|
VTLS
|
Gaylord
Information Systems
|
WLN
|
GEAC
|
The protocol is defined in ISO 10160 and 10161.
The protocol may be compared to the basic messaging functionality of the OCLC, RLIN, WLN, and DOCLINE ILL systems (the four U.S. bibliographic utilities). Each system has a defined set of messages (an ILL request, a renewal request, etc.), fields for each message (author, title, patron, verified, etc. for an ILL request), and the proper sequence of the messages. However, these systems use proprietary methods to transmit ILL transactions to users of those systems and as a result, unfilled requests cannot currently be forwarded from one system to another.
US libraries are following the development of the British Library's ILL interface as this may encourage them and others to follow suit and expand on the number of implementation sites. Ten of the implementations listed above have successfully sent ILL requests between each other.
ILL
Maintenance Library at the National Library of Canada
<URL:http://www.nlc-bnc.ca/iso/ill/main.htm>
ILL Protocol Implementers Group (IPIG) home page
<URL:http://www.arl.org/access/naildd/ipig/ipig.shtml>
<URL:http://www.arl.org/access/naildd/ipig/res/portugal.shtml>
Electronic Data Interchange (EDI) is a set of protocols for conducting highly structured inter-organization exchanges, such as for making purchases or initiating loan requests. In the book trade, books are sold not only in specialist shops but also alongside other products in other stores, so an EDI system implemented between libraries and booksellers must also be compatible with other EDI systems. As a result, the EDItEUR group (the Pan-European Book Sector EDI Group), who aim to co-ordinate the development, promotion and implementation of EDI in the books and serials sectors, have run a project (EDILIBE II) to allow EDI communication between libraries and booksellers. This system supports EDIFACT (Electronic Data Interchange For Administration, Commerce And Transport, ISO 9735) syntax messages using X.400 communication to process orders and queries.
The EDILEBE II project was run in conjunction with the following libraries and booksellers
EDIFACT is a syntax definition consisting of a set of messages, data elements and code lists which provide an enabling framework for EDI application designers to use to meet the needs of a particular trading community. As EDIFACT is not static and is constantly having new elements added to it, a subset of messages named EANCOM has been used by EDItEUR. The 1994 release of EANCOM messages, used by EDItEUR, is based on the D.93A EDIFACT Directory.
Implementing an EDI interface for the PRIDE directory would clearly be of use for access to the above libraries.
The EDILEBE II project has also altered the aspects for realizing EDI on the library side. At the beginning of the project there was no hint that the PICA library system would be introduced to numerous federal states in Germany. In the meantime, PICA, as project participant, has developed an EDI module. For a library which uses the PICA system there is the possibility of using the standardized data exchange with the booktrade. Hence there is nothing to prevent the start of EDI communication between libraries and the booktrade.
Since the UK book trade TRADACOMS standard will migrate to EDIFACT and is already in use between libraries and booksellers then there is good prospect of more libraries supporting the EDIFACT standard.
EDItEUR
website:
<URL:http://www.editeur.org>
EDIFACT standards including the D.93A Directory can be found at:
<URL:http://www.harbinger.com/resource/edifact/>
CORBA (Common Object Request Broker Architecture) is a standard for distributed objects being developed by the Object Management Group (OMG). The OMG is a consortium of software vendors and end users. CORBA provides the mechanisms by which objects transparently make requests and receive responses, as defined by OMG's ORB (Object Request Broker). The CORBA ORB is an application framework that provides interoperability between objects, built in (possibly) different languages, running on (possibly) different machines in heterogeneous distributed environments. CORBA Object Query Service is a general service for querying networked information services. As well as IR services, CORBA provides Trader (distributed directories), Interface Repositories and Type Management facilities.
CORBA queries are predicate-based and may return collections of objects. They can be specified using object derivatives of SQL and/or other styles of object query languages, including direct manipulation query languages.
The Query Service can be used to return collections of objects that may be:
By using a very general model and by using predicates to deal with queries, the Query Service is designed to be independent of any specific query languages. Therefore, a particular Query Service implementation can be based on a variety of query languages and their associated query processors. However, in order to provide query interoperability among the widest variety of query systems and to provide object-level query interoperability, a Query Service provider must support one of the following two query languages: ANSI SQL-92 Query or OQL-93.
The CORBA CosQuery service has only one known implementation to date which relates to the oil and gas exploration domain.
The CosQuery service was considered as an interface for the ZORBA initiatve. However, as the CosQuery provides no way to discover the information model of the service being queried, it did not meet the requirements for meta-searching.
As there appear to be no relevant implementations of CosQuery since the specification was published in 1995, it seems unlikely that it will be of relevance to PRIDE in the future.
The
OMG homepage
<URL:http://www.omg.org>
1999-01-22 | PRIDE Requirements and Success Factors |