UKOLN AHDS Porting The Uncover Service



Context

In 1999 ingenta bought the US-based UnCover Corporation and set about moving the operation to the UK. UnCover had evolved over the space of about 10 years and the service had been fixed and added to in an ad hoc manner in response to customer requirements, with the result that there were now very few people who fully understood it. There were three main issues to be addressed: (1) moving the bibliographic data (i.e. information about journal articles) into a database in the UK and implement a stopgap application to provide access to this data; (2) moving the user level subscription and accounting data into a database and (3) reimplementing the application.

This case study discusses the choices which were available at various stages of the project and why decisions were made. It also discusses whether, with the benefit of hindsight, any of the decisions could have been improved.

Background to the Problem

UnCover had been set up to provide universities and commercial companies (mostly in the US) with access to journal articles. The system worked by providing a bibliographic database which contains basic information such as the journal title, the authors, journal title, volume, issue, page numbers, etc, which could be searched using the usual methods. If the searcher wanted a copy of the complete article then the system would provide a FAX copy of this at a charge which included the copyright fee for the journal's publisher, a fee for the provider of the copy (which was one of a consortium of academic libraries) and a fee for UnCover.

Additionally, UnCover provided journal alerting services, customised presentation, prepaid deposit accounts, and other facilities.

Ingenta bought the company, partly to give it a presence in the US and partly to get a bibliographic database with good coverage of academic journals going back about 10 years.

Over the space of about a year the entire system was moved to the UK from where it now runs.

The Bibliographic Database

The first task was to move the bibliographic backfile and then to start taking and adding the regular weekly updates which UnCover produced. The database consisted of about a million articles per year, though the early years (i.e. from 1988 to about 1992 were somewhat smaller). Ingenta had a good deal of experience in using the BasisPlus database system which originated as a textual indexing system but had acquired various relational features over the years. It has many of the standard facilities of such a system e.g. word and phrase indexing, mark up handling, stopwords, user defined word break characters and so on. Some thought had been given to alternative DBMSs (and this is discussed further below) but given the short timescale it would have been too risky to switch systems at this point. BasisPlus had the additional advantage that ingenta already had an application which could use it and which would require only small modifications to get working.

The application was written to access several databases simultaneously. Each database contained the data for a single year's worth of journal articles and if a particular search was required to cover several contiguous years (as most were then the application automatically applied the search to each year database in turn and then concatenated the results for display in reverse chronological order. There were disadvantages to this method, notably the near impossibility of sorting the results into relevance ranked order, but by and large, it worked well.

Ingenta obtained some samples of the data and set about analysing it and building a test database. This was fairly straightforward and didn't pose any serious problems, so the next step was to start offloading the data from UnCover a year at a time and building the production databases. It soon became obvious that data which purported to be from (say) 1990 contained articles form anywhere between 1988 and about 1995. Persuading the UnCover team to fix this would probably have delayed the build so it was decided to collect all the available data and then write a program to scan it searching for articles from a specified year which could then be loaded into the current target year database. Experience indicated that it's better to fix these sorts of problems yourself rather than try to persuade the other party to undertake what for them is likely to be a significant amount of unwelcome work.

The decision was taken quite early in the project to index the text without specifying any stopwords. Stopwords are commonly used words such as "the", "a", "and", "it", "not", etc. which are often not indexed because they are thought to occur too frequently to have any value as searching criteria and the millions of references will make the indexes excessively large. The result is that trying to search for the phrase "war and peace" will also find articles containing the word "war" followed by ANY word, followed by "peace", e.g. "war excludes peace". At first this seems sensible, but experience had shown that some of the stopwords also occur in other contexts where disabling searching is an acute disadvantage, so for example it becomes impossible to search for "interleukin A" without also finding thousands of references to interleukin B, interleukin C, etc which are not wanted. In fact it turned out that specifying no stopwords had a comparatively small inflationary effect on the indexes (about 20%) and a negligible effect on the performance.

Another important decision was to rethink the way author names were held in the system. UnCover had input names as:

Surname, Forename Initial
e.g. Smith, Robert K

this was very difficult to index in a way which would provide flexible name searching, particularly since bibliographic databases generally use Surname, Initials e.g. Smith, RK though we were generally reluctant to discard any data. It was decided to keep several author name fields, one with the names in their original format, a second to be used for display, a third for searching and a fourth for matching with another database. A more detailed description of the methodology used is given in the QA Focus advisory document on merging databases [1].

This operation of analyzing the data, designing the BasisPlus database structure (which was simply a further modification of several we had done in the past), writing the program to take the UnCover data and convert it for input to Basis and finally building the 12 production databases took about three months elapsed time.

The Stopgap Application

The immediate requirement was for an application which would allow the databases to be searched, the results displayed and emailed, and documents ordered and delivered. There was not an initial requirement to replace the entire UnCover service, since this would continue to run for the time being. An application was available which had originally been written for the BIDS services and was reasonably easily adaptable. Because the BIDS services had used an almost identical database structure, the searching and display mechanisms could be moved with only minor modification. In addition the services had used the results display to drive a search of another database called the PubCat (or Publishers Catalogue) which contained bibliographic information on articles for which ingenta held the full text. If the user's search found one of these, then the system would offer to deliver it, either for free if the user had a subscription to the journal or for a credit card payment.

The major addition at this stage was to provide access to the UnCover document delivery service. The PubCat could only deliver electronic (PDF) versions of documents for issues of those journals held by ingenta (or for which ingenta had access to the document server) and inevitably, these tended to be the more recent issues. UnCover could also deliver older material as FAXes and to enable this it was necessary to construct a call to the UnCover server providing it with ordering details receive an acknowledgement. The HTTP protocol was used for this since it had the right structure and the amount of information passing back and forth was relatively small. In addition, a record of each transaction was kept at the ingenta end for reconciliation purposes.

There were a number of teething problems with the UnCover link, mainly caused by inadequate testing, but by this point there was a reasonably stable database and application.

Switching the Data Feed

The first real problem emerged shortly after the system went live, as it became obvious that the feed of bibliographic data from UnCover was going to stop as the UnCover operation in The US was wound down. In retrospect this should have been apparent to the developers involved and should have been allowed for, or at least thought about.

The data feed was to be replaced by the British Library's Inside Serials database (BLIS). In fact there were good reasons for doing this. The journal coverage of Inside Serials is much wider than UnCover and overall, the quality control was probably better. In addition, the coverage is more specifically academic and serious news journals, whereas UnCover had included a significant number of popular journals.

Nonetheless, the problems involved in cutting off one feed and starting another are fairly significant, mainly because an issue of a journal arrives at the various database compilers by a variety of routes and therefore find their way into the data feeds at different times. It was not possible to simply stop the UnCover feed one week and then start updating with BLIS because this would have meant that some articles would previously have been in BLIS, but not yet in UnCover (and therefore would never get into the composite database) while others would have already arrived via UnCover, only to be loaded again via BLIS. The solution adopted was to adapt the system which formatted the BLIS data for loading so that for each incoming article, it would interrogate the database to find out whether it had already been loaded. If it had, then it would merge the new entry with the existing entry (since BLIS had some extra fields which were worth incorporating), otherwise it simply generated a new entry. Also, immediately after stopping the UnCover updates (at the end of January) the previous 10 weeks worth of BLIS updates were applied. It was hoped that this would allow for disparities in the content of the two data feeds. In fact it was impossible to predict the extent of this disparity and the 10 week overlap was simply a best guess. It has since been discovered that arrival rates of some journals can vary even more dramatically than we thought and in retrospect it would have been preferable to have made this overlap somewhat longer (perhaps twice as long, but even then it's unlikely that all the missing articles would have been collected). The other problem was the ability of the updating mechanism to correctly match an incoming article with one which already existed in the database. There are two standard approaches to this difficult problem and these are discussed in some detail in Appendix 1.

In addition to this synchronisation problem, the two databases were rather different in structure and content, in the format of author names and journal titles, and in the minor fields, which all these databases have, but which exhibit a bewildering, and sometimes incomprehensible variety. For those fields which were completely new (e.g. a Dewey Classification) it was simply necessary to fix the databases to add a new field which would get populated as the new fields started to arrive and would have null values otherwise or have some value preloaded. Other fields, and certain other aspects of the content, required the BLIS data to be somehow fixed so that the application (and ultimately of course, the user) would see a consistent set instead of having to deal with a jarring discontinuity. The subject of normalising data from several databases is dealt with in the document on merging databases [1]. The process was less troublesome than it could have been, but this was mostly good luck rather than judgement. The most difficult aspect of BLIS from a presentational point of view is that the journal names are all in upper case. This may sound trivial, but displaying long strings of capitals on the screen now looks overly intrusive, and would in any case have jarred uncomfortably with the UnCover presentation. It was therefore necessary to construct a procedure which would convert the string to mixed case, but deal correctly with words which are concatenated initials (e.g. IEEE, NATO).

Subscription and Accounting Data

In addition to the bibliographic database, UnCover also held a large amount of data on its business transactions and on the relationships with their customers and suppliers and this also needed to be transferred. Because the service was available 24 hours a day and was in constant use, it would have been infeasible (or at least, infeasibly complex) to transfer the actual service to the UK in stages. It was therefore necessary to nominate a period (over a weekend) when the US service would be closed down, the data transferred and loaded into the new database, and the service restarted on the Monday morning.

The first task was to select a database system to hold the data, and ORACLE was chosen from a number of possible candidates. There were good reasons for this:

It had originally been intended to keep all the data (i.e. including the bibliographic data) in a single database, so as well as transferring the subscription and accounting data, it would have been necessary to dump out the bibliographic data and load this as well. It became obvious at an early stage that this was a step too far. There were doubts (later seen to be justified) about the ability of the ORACLE InterMedia system to provide adequate performance when searching large volumes of textual data and the minimal benefits did not justify extra work involved and the inherent risks, so the decision was taken at an early stage to keep the two databases separate, though inevitably this meant that there was a significant amount of data in common.

The database structure was the result of extensive study of the UnCover system and reflected an attempt to produce a design which was as flexible as possible. This is a debatable aim and there was, accordingly, a good deal of debate internally about the wisdom of it. It had the advantage that it would be able to accommodate new developments without needing to be changed, for example, it had been suggested that in the future it might be necessary to deal with objects other than journal articles (e.g. statistical data). By making the structure independent of the type of object it was describing, these could easily have been accommodated. In the short term however it had several disadvantages. Making the structure very flexible led to at least one area of it becoming very inefficient, to the extent that it was slow to update and very slow to interrogate. Moreover, a structure which is flexible admits not only of flexible use, but also flexible interpretation. The structure was difficult for the application designers to understand, and led to interpretations of its meaning which not only differed from that intended, but also from each other.

Samples of the various data files were obtained from UnCover and scripts or programs written to convert this data into a form which could be input to ORACLE. Ultimately the data to be loaded was a snapshot of the UnCover service when it closed down. Once the service had been restarted in the UK, the system would start applying updates to the database, so there would be no possibility of having a second go. This was therefore one of the crucial aspects of the cutover and had it gone wrong, it could easily have caused the whole exercise to be delayed.

In addition to the UnCover data, the source of document delivery was being changed from the UnCover organisation to CISTI (for deliveries in the North America) and the British Library (for deliveries elsewhere) This required that the system know about which journals were covered by the two services in order that it did not take an order for a document which the delivery service had no possibility of fulfilling. IT also needed certain components of the price which has to be calculated on the fly for each article. A similar problem to the article matching arose here. It was necessary to take the relevant details of an article (i.e. journal title, ISSN, publication year, volume and issue) from one source and match them against another source to find out whether the relevant document delivery service could deliver the article. Although this worked reasonably well most of the time, it did initially produce a significant number of errors and, since the documents were paid for, complaints from users which were extremely time consuming to resolve.

Reimplementing the Service

This was easily the most complex part of the operation. In addition to the ability to search a database and order documents, UnCover provided a number of additional services (and packages of services) which needed to be replicated. These included:

The work started by identifying "domain experts" who were interviewed by system designers in an attempt to capture all the relevant information about that domain (i.e. that aspect of the service) and which was then written up as a descriptive document and formed the basis of a system design specification. This was probably a useful exercise, though the quality of the documents produced varied considerably. The most common problems were failure to capture sufficient detail and failure to appreciate the subtleties of some of the issues. This led to some of the documents being too bland, even after being reviewed and reissued.

The descriptive documents were converted into an overall system design and then into detailed specifications. The system ran on a series of Sun systems running Unix. The application software was coded was mostly in Java, though a lot of functionality was encapsulated in ORACLE triggers and procedures. Java proved to have been a good decision as there was a sufficiently large pool of expertise in this area. The Web sessions were controlled by WebLogic and this did cause a number of problems, probably no more than would be expected when dealing with a piece of software most people had little experience of.

Inevitably the main problems occurred immediately after the system went live. Given the timescale involved it was impossible to run adequate large scale system tests and the first few weeks were extremely traumatic with the system failing and having to be restarted, alerting services producing inexplicable results and articles which had been ordered failing to arrive.

Unfinished Business

It had originally been the intention to look for an alternative to BasisPlus as the main bibliographic DBMS. Given that ORACLE was being used for other data, it would have been reasonable to have switched to this. Sometime before, there had been a review of the various possibilities and extensive discussions with the suppliers. Based on this, a provisional decision was taken to switch to using Verity. This was chosen mainly because it was seen as being able to provide the necessary performance for textual searching, whereas there was some doubt about the ability of the ORACLE InterMedia software to provide a sufficiently rapid response.

Faced with the implementation pressures, the switch to an unknown and completely untried DBMS was quickly abandoned. It was still thought that ORACLE might be a viable alternative and the original database design did include tables for storing this information.

Sometime after the system went live, a large scale experiment was conducted to test the speed of ORACLE InterMedia and the resulting response times showed that the conservative approach had in fact been correct.

Conclusions

It is inevitable that transferring a mature and complex service such as UnCover and at the same time making major changes to the way it worked was always going to be risky. Given the scale of the undertaking, it is perhaps surprising that it worked as well as it did, and criticism after the event is always easy. Nonetheless, there have to be things which could have worked better.

There seems to be an unshakeable rule in these cases that the timescale is set before the task is understood and that it is invariably underestimated. In this case, this was exacerbated by the need to bring in a large number of contract staff, who although they were often very competent people, had no experience of this sort of system and who therefore found it difficult to judge what was important and what was not.

Flowing from this, there was a serious communication problem. The knowledge of the working of the UnCover system resided in the U.S. and while there were extensive contacts, this is not a substitute for the close proximity which allows for extended discussions over a long period and for the easy, ad hoc face to face contact which allows complex issues to be discussed and resolved. The telephone and email are poor substitutes for real meetings. The upshot was that some issues took days of emailing back and forth to resolve and even then were sometimes not fully appreciated.

In addition to the difficulties of international communication, the influx of a large number of new staff meant that there was too little time for personal relationships to have built up. There was a tendency for people to work from the specification given, rather than discussing the underlying requirements of the system. The importance of forging close working relationships, particularly on a large and complex project such as this is hard to overemphasise.

The project control methodology used was based on a tightly controlled procedure involving the writing of detailed specifications which are reviewed, amended, and then eventually signed off and implemented. This method is roughly at the other end of the spectrum from what we might call the informal anarchy method. Plainly it has many advantages, and there is no suggestion that an very informal method could have worked here; the problem was simply too complicated. It does however have its drawbacks, and the main one is its rigidity. The specification, whatever its deficiencies, tends to become holy writ and is difficult to adjust in the light of further knowledge. As with many projects, the increasing pressures resulted in the procedures becoming more relaxed, but it is at least debatable whether a more flexible approach should have been used from the start.

References

  1. Merging Data From Several Sources, QA Focus briefing document. To be published shortly.

Appendix 1: Article Level Matching

Given the bibliographic details of journal articles, there are basically two approaches to the problem of taking any two sets of details and asking whether they refer to the same article.

The details will normally consist of:

Article Title:
Possibly with a translation, if the original title is not in English
Author Names:
In a wide variety of formats and in some cases with only the first 3 or 4 authors included.
Journal Title:
Sometimes with an initial "The" missing.
ISSN:
The International Standard Serial Number, if the journal has one.
Publication Year:
Year of publication.
Volume Number:
Some journals, particularly weekly journals, like New Scientist, no longer include a volume number.
Issue Number:
Journals which only publish once a year sometimes don't use a issue number.
Page Number:
Usually start and end page numbers, but sometimes just the start page is given.

In addition, some bibliographic databases include an abstract of the article. BLIS does not, but this is not relevant to this discussion.

The problems arise because different databases catalogue articles using different rules. There will be differences in the use of mark-up, in capitalisation (particularly in journal names), and most notoriously in the rules for author names, where some include hyphens and apostrophes, and some do not, some spell out forenames and other provide only initials, some include suffixes (e.g. Jr., III, IV) and others don't. Also, databases differ in what they include, some for example treat book reviews as a single article within an issue whereas others treat each review separately and others exclude reviews, some include short news articles whereas others don't, and so on. Given these variations, it's plainly impossible to get an exact solution and the real questions are (a) do we prefer the algorithm to err in certain ways rather than others, and (b) how do we measure whether the algorithm is behaving "reasonably"?

One approach is to use information in the article title and author names (probably only the first one or two), along with some other information e.g. journal name and ISSN. This method had been used in the past and while for some purposes it worked reasonably well, the particular implementation depended on a specialised database containing encoded versions of the article title etc, in order to provide acceptable performance. It would either have been necessary to use the same system here or to have written the matching code ourselves (both of which would have meant a great deal of extra work).

There was no possibility of using this solution, so it was decided to try a completely different and computationally much simpler approach which could easily be programmed to run in a reasonable time.

  1. reduce the journal titles to a canonical form by converting everything to lower case, removing any punctuation and removing common words like "the", "of", "an", etc.
  2. if both articles have an ISSN then match on this. if they match then compare the reduced journal names. if either of these fail then the articles are different, otherwise
  3. match on volume numbers (null volume numbers match as equal) if they differ then the articles are different, otherwise
  4. match on issue numbers (null issue numbers match as equal) if they differ then the articles are different, otherwise
  5. match on start page.

The preference here was to err on the side of not matching, if possible, and an attempt was made to measure the effect of this by looking at articles which had successfully matched and checking that there were no erroneous matches. On this measure, the algorithm worked well. Unfortunately, measuring the opposite effect (i.e. those which should have matched, but did not) is extremely difficult without being able to anticipate the reasons why this might happen. These inevitably come to light later. There were two main ones:

  1. Although the ISSN is allocated rigorously, the allocation of ISSN to journal within the databases is sometimes incorrect. This will often have occurred when a journal has split into two or more separate journals and the new ISSN's are not correctly transcribed. Because ISSN is a property of the journal, the error propagates to every article in that journal. This was probably the main source of serious errors.
  2. UnCover catalogued some journals with a volume and issue number (presumably by allocating a volume number based on the publication year) whereas these were (correctly) catalogued in BLIS with only an issue number.

Appendix 2: About This Case Study

This case study was written by Clive Massey, a former employee of BIDS/ingenta.