Provenance
Provenance is a well-established concept in the art world where the lineage, pedigree or origins of a painting are critical to determining its authenticity and value. It is of equal importance in science where the provenance or origin of a particular set of data is essential to determining the likely accuracy, currency and validity of derived information and any assumptions, hypotheses or further work based on that information. Significant research has been carried out on describing the provenance of scientific data in molecular genetics databases SWISSPROT and OMIM and in collaborative multi-scale chemistry initiatives. The topic has recently been explored in a workshop at the latest Global Grid Forum (GGF6) in relation to Grid data and the relationship of provenance to the Semantic Web has been noted. The Open Archives Initiative has also carried out some work to describe the provenance of harvested metadata records and the concept is included as an element in the administrative metadata which is part of the METS metadata standard . This project aimed to review the state-of-the-art in provenance research, review the observed trends and directions, identify gaps in work in this area and present some conclusions and recommendations for future activities for the JISC.
References
Penn Database Research Group:
Buneman, P., Khanna, S. & Tan, W.-C. (2000). "Data provenance: some basic issues." 20th Conference on the Foundations of Software Technology and Theoretical
Computer Science (FSTTCS), New Delhi, India, 13-15 December 2000.
http://db.cis.upenn.edu/DL/fsttcs.pdf
Buneman, P., Khanna, S. & Tan, W.-C. (2001). "Why and where: a characterization of data provenance." 8th International Conference on Database Theory (ICDT), London, 4-6 January 2001.
http://db.cis.upenn.edu/DL/whywhere.pdf
Buneman, P., Khanna, S., Tajima, K. & Tan, W.-C. (2001). "Archiving scientific data." Technical Report, University of Pennsylvania.
http://www.cis.upenn.edu/~wctan/papers/01/archiving-tr.pdf
Buneman, P., Khanna, S., Tajima, K. & Tan, W.-C. (2002). "Archiving scientific data." ACM SIGMOD international conference on Management of data, Madison, Wisconsin, USA, 4-6 June 2002.
http://doi.acm.org/10.1145/564691.564693
http://www.cis.upenn.edu/~wctan/papers/02/sigmod02.pdf
Buneman, P., Khanna, S., Tajima, K. & Tan, W.-C. (2003). "Archiving scientific data." DPC Forum, London, 24 June 2003. (slides only)
http://www.dpconline.org/graphics/events/presentations/pdf/PeterBuneman.pdf
Penn Database Research Group
http://db.cis.upenn.edu/Research/provenance.html
myGrid provenance:
Greenwood, M., Goble, C., Stevens, R., Zhao, J., Addis, M., Marvin, D., Moreau, L. & Oinn, T. (2003). "Provenance of e-science experiments - experience from bioinformatics." Poster, UK e-Science All Hands Meeting, Nottingham, 2-4 September 2003.
http://www.nesc.ac.uk/events/ahm2003/AHMCD/pdf/047.pdf
(An overview of initial work on the provenance of bioinformatics e-science experiments within the myGrid project).
Greenwood, M., Goble, C., Stevens, R., Zhao, J., Addis, M., Marvin, D., Moreau, L. & Oinn, T. (2003). "Provenance of e-Science Experiments - experience from Bioinformatics" Proceedings UK e-Science All Hands Meeting 2003 Editors - Simon J Cox September 2003 http://www.nesc.ac.uk/events/ahm2003/AHMCD/pdf/047.pdf
Zhao, J., Goble, C., Greenwood, M., Wroe, C. & Stevens, R. (2003). "Annotating,
linking and browsing provenance logs for e-science." ISWC 2003 Workshop: Semantic Web Technologies for Searching and Retrieving Scientific Data, Sanibel Island, Fl., 20 October 2003.
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-83/prov_2.pdf
myGrid Provenance Data
http://twiki.mygrid.org.uk/twiki/bin/view/Mygrid/ProvenanceData
List of provenance resources maintained by my Grid .
http://twiki.mygrid.org.uk/twiki/bin/view/Mygrid/ProvenanceResources
Collaboratory for Multi-Scale Chemical Science (CMCS):
Myers, J.D., Chappell, A.R., Elder, M., Geist, A. & Schwidder, J. (2003). "Re
integrating the research record." IEEE Computing in Science & Engineering, 5(3), 44-50.
http://collaboratory.emsl.pnl.gov/presentations/papers/reintegrating.html
Myers, J.D., Pancerella, C., Lansing, C., Schuchardt, K.L. & Didier, B. (2003). "Multi-scale science: supporting emerging practice with semantically derived
provenance." ISWC 2003 Workshop: Semantic Web Technologies for Searching and Retrieving Scientific Data, Sanibel Island, Florida, USA, 20 October 2003.
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-83/prov_1.pdf
Pancerella, C., et al. (2003). "Metadata in the Collaboratory for Multi-scale Chemical Science." Proceedings of DC-2003: the 2003 Dublin Core Conference, Seattle, Washington, USA, 27 September - 2 October 2003.
http://purl.oclc.org/dc2003/03pancerella.pdf
Stanford Database Group:
Cui, Y., & Widom, J. (2000). "Practical lineage tracing in data warehouses." 16th International Conference on Data Engineering (ICDE'00), San Diego, Calif., USA, February 2000.
http://dbpubs.stanford.edu/pub/1999-55
Cui, Y., Widom, J., & Wiener, J. L. (2000). "Tracing the lineage of view data in a warehousing environment." ACM Transactions on Database Systems, 25(2), 179-227.
http://dbpubs.stanford.edu/pub/1997-3
Cui, Y. (2001). Lineage tracing in data warehouses. Thesis (PhD), Stanford University.
http://dbpubs.stanford.edu/pub/2001-56
Cui, Y., & Widom, J. (2003). "Lineage tracing for general data warehouse transformations." VLDB Journal, 12(1), 41-58.
http://dbpubs.stanford.edu/pub/2001-5
Widom, J. (2005). "Trio: a system for integrated management of data, accuracy, and lineage." Second Biennial Conference on Innovative Data Systems Research (CIDR 2005), Asilomar, Calif., USA, 4-7 January 2005.
http://www-db.cs.wisc.edu/cidr/papers/P22.pdf
Berkeley Database Research Group:
Woodruff, A. G. (1998). Data lineage and information density in database visualisation. Thesis (PhD), University of California at Berkeley.
http://db.cs.berkeley.edu/papers/UCB-PhD-woodruff.pdf
Woodruff, A., & Stonebraker, M. (1997). "Supporting fine-grained data lineage in a database visualization environment." Thirteenth International Conference on Data Engineering (ICDE 1997), Birmingham, UK, 7-11 April 1997.
http://db.cs.berkeley.edu/papers/icde97-dl.pdf
http://db.cs.berkeley.edu/papers/CSD-97-932.pdf
Birkbeck College, School of Computer Science and Information Systems:
Fan, H., & Poulovassilis, A. (2002). "Tracing data lineage using automated schema transformation pathways." University of London, Birkbeck College, School of Computer Science and Information Systems, Technical Report BBKCS-02-07.
http://www.dcs.bbk.ac.uk/~hao/Publications/bbkcs0207.pdf
Fan, H., & Poulovassilis, A. (2003)."Tracing data lineage using schema transformation pathways." In: Omelayenko, B., & Klein, M., (eds.), Knowledge transformation for the Semantic Web. IOS Press, 64-79.
http://www.dcs.bbk.ac.uk/~hao/Publications/IOS.pdf
Chimera:
Foster, I., Vöckler, J., Wilde, M. & Zhao, Y. (2002). "Chimera: a virtual data system for representing, querying, and automating data derivation." 14th International Conference on Scientific and Statistical Database Management (SSDB
'02), Edinburgh, 24-26 July 2002.
http://www.globus.org/research/papers/VDS02.pdf
Foster, I., Vöcler, J., Wilde, M., & Zhao, Y. (2003). "The Virtual Data Grid: a new model and architecture for data-intensive collaboration." CIDR 2003 - First Biennial Conference on Innovative Data Systems Research, Asilomar, Calif., USA, 5-8 January 2003.
http://www-db.cs.wisc.edu/cidr2003/program/p18.pdf
Other papers:
Bose, R. (2002). "A conceptual framework for composing and managing scientific data lineage." 14th International Conference on Scientific and Statistical Database Management (SSDBM 2002), Edinburgh, UK, 24-26 July 2002. Full text available to members of the IEEE Computer Society; abstract available from: http://csdl.computer.org/comp/proceedings/ssdbm/2002/1632/00/16320015abs.htm
Bose, R. & Frew, J. (2004). "Composing lineage metadata with XML for custom satellite-derived data products." 16th International Conference on Scientific and Statistical Database Management (SSDBM 2004), Santorini Island, Greece, 21-23 June 2004. http://essw.bren.ucsb.edu/~frew/cv/pubs/2004_lineage_XML.pdf
Groth, P., Luck, M., & Moreau, L. (2004). "Formalising a protocol for recording provenance in Grids." UK e-Science All Hands Meeting 2004 (AHM 2004), Nottingham, UK, 31 August - 3 September 2004. http://www.allhands.org.uk/2004/proceedings/papers/91.pdf
Lanter, D. P. (1991). "Design of a lineage-based meta-data base for GIS." Cartography and Geographic Information Systems, 18(4), 255-261.
Marathe, A. P. (2001). "Tracing lineage of array data." Journal of Intelligent Information Systems, 17(2/3), 193-214.
Pinheiro da Silva, P., McGuinness, D. L., & McCool, R. (2003). "Knowledge provenance infrastructure." IEEE Data Engineering Bulletin, 26(4), 26-32.
http://www.ksl.stanford.edu/people/dlm/papers/provenance-abstract.html
Workshops:
Workshop on Data Derivation and Provenance, Chicago, Illinois, 17-18 October
2002
http://www-fp.mcs.anl.gov/~foster/provenance/
NeSC Workshop on Data Provenance and Annotation, Edinburgh, UK, 1-3 December 2003
http://www.nesc.ac.uk/esi/events/304/
Projects:
Provenance Aware Service Oriented Architecture [EPSRC-funded project].
http://www.pasoa.org/index.html
PASOA aims to investigate the concept of provenance and its use for reasoning about the quality and accuracy of data and services in the context of eScience.
Other links:
Renaud, K. "Data Provenance and Annotation Resource Home Page." University of Glasgow, Department of Computer Science.
http://www.dcs.gla.ac.uk/~karen/Provenance/
The 'principle of provenance' in archival science:
For archivists, the principle of provenance and the related concept of 'original order' informs almost every part of archival theory and practice. Together, they make up the wider archival principle of respect des fonds. It is the insight of archivists that the authenticity and integrity of records depends, at least in part, in tracing their origin and past history. Cook (1993, p. 26) argues that when archivists adhere to the principles of provenance and original order, "the evidential character of archives is protected, whereby the records inherently reflect the functions, programmes and activities of the person or institution that created them, and the transactional processes by which that actual creation took place."
Bearman, D., & Lytle, R. H. (1985). "The power of the principle of provenance." Archivaria, 21, 14-27.
Cook, T. (1984). "From information to knowledge: an intellectual paradigm for archives." Archivaria, 19, 28-49.
Cook, T. (1992). "The concept of the archival fonds: theory, description, and provenance in the post-custodial era." In: Eastwood, T., (ed.), The archival fonds: from theory to practice = Le fonds d'archives: de la théorie à la pratique. Ottawa: Bureau of Canadian Archivists, Planning Committee on Descriptive Standards, 31-85.
Cook, T. (1993). "The concept of the archival fonds in the post-custodial era: theory, problems and solutions." Archivaria, 35, 24-37.
Cook, T. (2001). "Fashionable nonsense or professional rebirth: postmodernism and the practice of archives." Archivaria, 51, 14-35.
Duchein, M. (1977). "Les respect des fonds en archivistique: principes théoriques et problèmes pratiques." Gazette des Archives, 97, 89-114. [English translation: Duchein, M. (1983). "Theoretical principles and practical problems of respect des fonds in archival science." Archivaria, 16, 64-82.]
Duff, W. M., & Harris, V. (2002). "Stories and names: archival description as narrating records and constructing meanings." Archival Science, 2, 263-285.
Gilliland-Swetland, A. J. (2000). Enduring paradigm, new opportunities: the value of the archival perspective in the digital environment. Washington, D.C.: Council on Library and Information Resources.
http://www.clir.org/pubs/abstract/pub89abst.html
Nesmith, T., (ed.), Canadian archival studies and the rediscovery of provenance. Metchuen, N.J.: Scarecrow Press.
Posner, E. (1967). "Max Lehmann and the genesis of the principle of provenance." In: Posner, E., Archives and the public interest. Washington, D.C.: Public Affairs Press, 36-44.
Roper, M. (1992). "The development of the principles of provenance and respect for original order in the Public Record Office." In: Craig, B. L., (ed.), The archival imagination: essays in honour of Hugh A. Taylor. Ottawa: Association of Canadian Archivists, 134-153.