[Prev Abstract][Next Abstract][Contents]

Report on Digital Libraries '94

LARGE-SCALE PERSISTENT OBJECT SYSTEMS FOR CORPUS LINGUISTICS AND INFORMATION RETRIEVAL

Robert P. Futrelle[1] and Xiaolan Zhang

Biological Knowledge Laboratory and Scientific Database Project, College of Computer Science, 161 Cullinane Hall, Northeastern University, Boston, MA 02115, {futrelle, xzhang}@ccs.neu.edu

Abstract

To build high-quality Digital Libraries, a great deal of up-front analysis needs to be done to organize the collection. We discuss methods for discovering the structure of documents by first analyzing the nature of the language used in the documents. "Bootstrap" techniques are described which can discover syntactic and semantic word classes as well as the higher-order structure of natural language using nothing but the corpus itself. Classical character-based methods of corpus linguistics are incapable of dealing efficiently with the masses of complex data involved in these analyses. We introduce multidatabase methods for handling corpora as word-object-based data. We demonstrate results of word classification applied to a 4 million word corpus from the Journal of Bacteriology . Object-oriented database technology is stressed because of its powerful representation capabilities. However, performance can be compromised in such systems, so we discuss "first-fast" and "successive deepening" approaches to assure good interactive response.

Keywords: Information systems, corpus linguistics, word classification, object-oriented database.


[Prev Abstract][Next Abstract][Contents]