Title:
|
DC-XML and the DCMI Abstract Model |
Creator:
|
Pete Johnston
|
Date Issued:
|
2005-09-11
|
Identifier:
|
|
Replaces:
|
Not applicable
|
Is Replaced By:
|
Not applicable
|
Latest Version:
|
|
Status of Document:
|
This is a working document which has no status within DCMI.
|
Description of Document: | This document analyses the current XML binding for Dublin Core metadata with reference to the DCMI Abstract Model. It highlights that the current binding has several shortcomings because the model on which it is based is more limited than the DCAM, and suggests that a binding which implements more features of the DCAM is required. |
|
The Dublin Core metadata standard is not defined in terms of a single data format. Rather, DCMI distinguishes between, on the one hand, a "conceptual model" of what constitutes a Dublin Core metadata "description set", and, on the other, a set of specifications of how instances of that model may be encoded or serialised in various different data formats.
The articulation of a conceptual model for Dublin Core metadata is a process which has taken place over an extended period of time. Over that period of time, various models and frameworks have been discussed and indeed used: those various models and frameworks have exhibited different approaches and sometimes include different, incompatible concepts; perhaps most confusingly for the reader sometimes the same or similar labels have been applied over time to name concepts which, on closer examination, are different, to some greater or lesser degree.
However, DCMI has recently taken a significant step to address this with the approval of the DCMI Abstract Model [DCAM] as a DCMI Recommendation. The DCAM provides a description of a conceptual model for Dublin Core metadata, with the intent that this description serves as a reference model to which other specifications (created by DCMI and by other agencies) can refer.
According to the DCAM:
The abstract model of DCMI metadata descriptions is as follows:
- A description is made up of one or more statements (about one, and only one, resource) and zero or one resource URI (a URI reference that identifies the resource being described).
- Each statement instantiates a property/value pair and is made up of a property URI (a URI reference that identifies a property), zero or one value URI (a URI reference that identifies a value of the property), zero or one vocabulary encoding scheme URI (a URI reference that identifies the class of the value) and zero or more value representations of the value.
- The value representation may take the form of a value string or a rich representation.
- Each value string is a simple, human-readable string that is a representation of the resource that is the value of the property.
- Each value string may have an associated syntax encoding scheme URI that identifies a syntax encoding scheme.
- Each value string may have an associated value string language that is an ISO language tag (e.g. en-GB).
- Each rich representation is some marked-up text, an image, a video, some audio, etc. or some combination thereof that is a representation of the resource that is the value of the property.
- Each value may be the subject of a separate related description.
Also:
- A description set is a set of one or more descriptions about one or more resources.
- A DCMI metadata record is a description set that is instantiated according to one of the DCMI encoding guidelines (XHTML meta tags, XML, RDF/XML, etc.)
The specifications of how instances of that conceptual model are to be encoded are sometimes referred to as "bindings". Each such binding provides a "mapping" between the constructs in the conceptual model and the components and structures used in a data format (or maybe some abstraction of those components and structures, such as the XML InfoSet for XML). For an "encoding" application generating a record, it provides the information that enables the application to determine what syntactic components to use to represent entities and relationships in the conceptual model; and conversely, for a "decoding" application, it specifies how the syntactic components in the data format should be interpreted in terms of the conceptual model.
A binding for Dublin Core metadata may implement the full DCMI Abstract Model or it may implement some subset of that model - i.e. some bindings may not support all the features of the model - but for whatever set of features are implemented, it should provide an unambiguous mapping between the conceptual model and the syntax.
Currently, DCMI provides four "encoding guidelines" specifications which describe formats for representing Dublin Core metadata descriptions:
meta
and link
elements [DC-XHTML]All of the above specifications pre-date the DCMI Abstract Model. In addition to these DCMI-provided bindings, DC metadata implementers have developed other bindings for DC metadata descriptions (See e.g. IPTC News Metadata Framework)
The current DCMI-recommended convention for representing Dublin Core metadata in XML is specified by the Guidelines for implementing Dublin Core in XML [DC-XML-2003].DC-XML-2003 document, and it is that specification with which this the first part of this document is concerned. For convenience, this document will use the label "DC-XML-2003" to refer specifically to that specification, as distinct from the general notion of an XML binding for Dublin Core metadata, or to other such bindings, current or proposed.
In keeping with the general approach outlined above, the DC-XML-2003 document first describes "abstract models" for two types of Dublin Core "record" and then specifies how records conforming to those two models should be represented in XML.
It is important to note that the DC-XML-2003 specification preceded the development of the DCMI Abstract Model: as a consequence, the models for DC metadata described in the DC-XML-2003 specification differ from the model described by the DCAM.
Appendix C of the DCAM attempts to describe which features of the DCAM are implemented by the DC-XML-2003 specification. It does this by examining examples of XML instances constructed according to the DC-XML-2003 binding. However, that approach is problematic in that it fails to acknowledge the fact that the DC-XML-2003 specification does not purport to provide a binding for the conceptual model of a DC description specified by the DCAM: the DCAM did not exist at that point. Rather, the DC-XML-2003 specification provides a binding for the models specified in the DC-XML-2003 specification itself. To understand the limitations of the DC-XML-2003 binding it is necessary to examine, not examples of XML instances, but the models which it states that it supports, and to compare the features of those models with the features of the DCAM.
The model for "Simple Dublin Core" in DC-XML-2003 is described as:
- A simple DC record is made up of one or more properties and their associated values.
- Each property is an attribute of the resource being described.
- Each property must be one of the 15 DCMES [DCMES] elements.
- Properties may be repeated.
- Each value is a literal string.
- Each literal string value may have an associated language (e.g. en-GB).
Note that there is no formal linkage between a simple DC record and the resource being described. Such a linkage may be made by encoding the URI of the resource as the value of the DC Identifier element, however this is not mandatory.
Note that while the value of a property may be a URI, there is nothing in the simple DC model that indicates this is the case. At their own risk, implementations may choose to guess which values are URIs and which are not.
The model for "Qualified Dublin Core" in DC-XML-2003 is described as:
- A qualified DC record is made up of one or more properties and their associated values.
- Each property is an attribute of the resource being described.
- Each property must be either:
- Properties may be repeated.
- Each value is a literal string.
- Each value may have an associated encoding scheme.
- Each encoding scheme has a name.
- Each literal string value may have an associated language (e.g. en-GB).
The principal differences between the DC-XML-2003 conceptual models and the DCMI Abstract Model are as follows:
The DCAM takes the approach that the conceptual entity which is encoded is a description set, made up of one or more descriptions, and the result of that encoding process is a record. In DC-XML-2003, the term record is used for the conceptual entity which is encoded, and a record describes a single resource. i.e. DC-XML-2003 uses the term record for what the DCAM calls a description, and DC-XML-2003 does not include the concept of a description set.
Both the DCAM and DC-XML-2003 take the approach that a description (record in DC-XML-2003) describes a single subject resource. In the DCAM, a description may include a resource URI which identifies that subject resource; DC-XML-2003 does not include the notion of a resource URI.
The DCAM has the explicit concept of a statement representing a "property/value pair". DC-XML-2003 makes no explicit reference to a statement though it is perhaps implicit in the description of a record as "made up of one or more properties and their associated values".
In the DCAM, a property is a conceptual resource. A property is referenced in a statement (using a property URI) but the property itself is not part of the description; in DC-XML-2003, the "property" is described as being part of the "record".
In the DCAM, a value is a resource. A value may be referenced in a statement (using a value URI) or represented in a statement (using a value string or rich representation), or the value may be the subject of a second ("related") description, or all three. But the value itself does not appear in the statement.
In contrast, DC-XML-2003 uses the term "value" to refer to a string which forms part of the record i.e. "values" in DC-XML-2003 are strings which represent resources: DC-XML-2003 uses the term "value" for what the DCAM calls a value string.
In the DCAM, a value may be referenced in a statement using a value URI, which identifies that value; DC-XML-2003 does not include the notion of a value URI.
In the DCAM, a value may be represented in a statement using a value string or rich representation. DC-XML-2003 does not include the notion of a rich representation.
DC-XML-2003 does include the concept of what the DCAM calls a value string, though it uses the term "value" to refer to it. It includes the concept that the "value" may be associated with a language tag.
Indeed, the DCAM supports the notion that a statement may provide multiple value representations, or no representation at all; in DC-XML-2003, exactly one "value" (a value string in DCAM terms) is always present in a "property-value pair".
The DCAM distinguishes between a vocabulary encoding scheme (a class of which the value is an instance) and a syntax encoding scheme (a datatype associated with a value string). A statement may contain zero or one references to a vocabulary encoding scheme (using a vocabulary encoding scheme URI). A value string may be associated with a syntax encoding scheme (using a syntax encoding scheme URI).
The DC-XML-2003 model for "Qualified Dublin Core" specifies that a "value" (a value string in DCAM terms) may be associated with an "encoding scheme", but it does not distinguish between a vocabulary encoding scheme and a syntax encoding scheme (nor, of course, the fact that one the former is associated with the value and the latter is associated with a value string).
In the DCAM, a value may be the subject of a second ("related") description. DC-XML-2003 does not include the concept of a related description.
By explicitly distinguishing "Simple DC Records" and "Qualified DC Records" DC-XML-2003 introduces the idea that there are types or classes of DC metadata description, categorised according to the terms which are referenced in those descriptions and/or the features of the Abstract Model which are deployed in the descriptions.
The abstract models used in the DC-XML-2003 specification differ from the DCMI Abstract Model. In some cases the difference is only that a different label is used for the same (or a very similar) concept.
However, several features of the DCAM are absent from the DC-XML-2003 models, specifically:
As we shall see below, some of these omissions have significant consequences for implementers.
Finally, in one important aspect the DC-XML-2003 models conflate two features which the DCAM treats as distinct:
There is one area in which the DC-XML-2003 specification presents a significant problem even with respect to its own conceptual models.
In section 6, "Mixing metadata schemas", the DC-XML-2003 departs from its approach of describing a conceptual model and then describing a mapping of that model to the XML syntax. This section includes two examples where fragments of well-formed XML, structured according to the rules of externally defined XML formats are embedded as "sibling" XML elements of XML elements which represent "property value pairs"/statements.
Consider the origins of those XML fragments. They are constructed according to the rules of another XML format. It is highly unlikely that that XML format is designed to represent "property-value pairs"/statements as defined in the DC-XML-2003 models. Instead the format is designed to represent some data which is structured according to the rules of that particular XML format. And conversely the interpretation of that XML fragment is defined by the rules of that external XML format.
However, the example is to be interpreted according to the rules of the DC-XML-2003 binding, and following those rules, the "root" XML element of the fragment should be interpreted as representing a "property-value pair"/statement. The DC-XML-2003 binding specifies that:
While the decoding application can interpret the name of the "root" element as representing the URI of a property, its content is an ordered list of child XML elements. It can not interpret that list of XML elements as a "value" (value string).
Clearly it is not possible to interpret the XML fragment according to the rules of the DC-XML-2003 binding: it can not be mapped to the DC-XML-2003 abstract models (nor to the DCMI Abstract Model).
The very purpose of the DC-XML-2003 document is to describe a binding for DC metadata descriptions: to provide a mapping from the conceptual models described to components within the XML syntax. The examples in section 6 fail to meet this requirement and this suggests that their inclusion represents an error in the specification.
This section examines the implications of the differences and omissions identified in the previous section.
The absence of the concepts of a resource URI or a value URI from the DC-XML-2003 models means that it is not possible to specify the URI of the subject resource or the URI of the value resource. These are fundamental requirements for DC metadata applications but the DC-XML-2003 binding does not support them.
Typically, metadata creators seek to compensate for this omission by trying to adopt conventions involving the use of an "encoding scheme" http://purl.org/dc/terms/URI
with a value string. However, the lack of clarity in the DC-XML-2003 models about what an encoding scheme is, and the absence of a distinction between vocabulary and syntax encoding schemes renders this approach problematic, as the following examples illustrate.
Consider the "resource URI" case first i.e. the intent is to provide a URI to identify the subject resource. Suppose the encoded data is:
<dc:identifier xsi:type="dcterms:URI">http://example.org/subject</dc:identifier>
The only interpretation of this fragment supported by the DC-XML-2003 model is
Statement: Property URI: http://purl.org/dc/elements/1.1/identifier Encoding Scheme URI: http://purl.org/dc/terms/URI Value String: http://example.org/subject
If http://purl.org/dc/terms/URI is interpreted as an indicator of the type of the value i.e. as what the DCAM calls a vocabulary encoding scheme URI, then it conveys the information that the value, the identifier, is a URI (which is what the author wishes to convey).
Now consider the "value URI" case i.e. the intent is to provide a URI to identify the value resource. Suppose the encoded data is:
<dc:relation xsi:type="dcterms:URI">http://example.org/value</dc:relation>
As above, the only interpretation of this fragment supported by the DC-XML-2003 model is
Statement: Property URI: http://purl.org/dc/elements/1.1/relation Encoding Scheme URI: http://purl.org/dc/terms/URI Value String: http://example.org/value
Now in this case an interpretation which presents http://purl.org/dc/terms/URI as an indicator of the type of the value - a vocabulary encoding scheme URI - contradicts the author's intent. Such an interpretation states that the value resource is a URI, but in this second case that is not the author's intent. The value resource is not itself a URI. It is a resource of some other type, identified by a URI.
Further, there may be examples of this second case where the dc:identifier
property is used. The definition of dc:identifier
permits any resource to act as an identifier of the subject resource, including a second resource which is itself identified by a URI. Again, in this case the value is not a URI but a resource of some other type, identified by a URI
In short, if the DC-XML-2003 "encoding scheme" mechanism is interpreted as an indicator of the type of the value - as a vocabulary encoding scheme - , the device of using the http://purl.org/dc/terms/URI
scheme can not be used to represent value URIs.
Again consider the "resource URI" case first i.e. the intent is to provide a URI to identify the subject resource. Suppose the encoded data is:
<dc:identifier xsi:type="dcterms:URI">http://example.org/subject</dc:identifier>
The only interpretation of this fragment supported by the DC-XML-2003 model is
Statement: Property URI: http://purl.org/dc/elements/1.1/identifier Encoding Scheme URI: http://purl.org/dc/terms/URI Value String: http://example.org/subject
If http://purl.org/dc/terms/URI is interpreted as the URI of a literal datatype, as what the DCAM calls a syntax encoding scheme URI, then this indicates that the URI http://example.org/subject "represents" the value resource. It does not say explicitly that the value resource itself is a resource of type URI (which is what the author intends). Although it does not result in a contradiction, it does not satisfy the author's intent of saying that the value is a URI, a resource of type URI.
Now consider the "value URI" case i.e. the intent is to provide a URI to identify the value resource. Suppose the encoded data is:
<dc:relation xsi:type="dcterms:URI">http://example.org/value</dc:relation>
As above, the only interpretation of this fragment supported by the DC-XML-2003 model is
Statement: Property URI: http://purl.org/dc/elements/1.1/relation Encoding Scheme URI: http://purl.org/dc/terms/URI Value String: http://example.org/value
As above an interpretation which presents http://purl.org/dc/terms/URI as a literal datatype - a syntax encoding scheme URI - indicates that the URI http://example.org/value "represents" the value resource. It does not say explicitly that the value resource itself is a resource of type URI (which in this case would contradict what the author intends). Although it does not result in a contradiction, neither does it satisfy the author's intent of saying that the URI http://example.org/value identifies the value resource.
The examples above illustrate that there is no way of representing the DCAM concept of a value URI in the DC-XML-2003 binding. The resource URI case may seem slightly less clear, in so far as the interpretation of DC-XML-2003 encoding scheme as DCAM vocabulary encoding scheme does support the construction of statements using the dc:identifier
property in which the value is a URI (i.e. the first example of the four above)
However, a single description may contain several such statements using the dc:identifier
property. Accoding to the DCAM, a description has only a single resource URI; if multiple such statements are provided, it is impossible to determine which of the values should be treated as the resource URI.
This discussion has sought to illustrate that attempts to represent the DCAM concepts of resource URI and value URI using features of the DC-XML-2003 abstract models are problematic: the concepts are simply not part of those models and therefore the XML binding makes no provision for them.
Recommendation: Given the importance of resource URIs and value URIs in the DCAM, it is a requirement for an XML binding for DC metadata that both resource URIs and value URIs can be unambiguously encoded. This requirement is not met by the DC-XML-2003 specification.
In the DCAM, the concepts of vocabulary encoding scheme and syntax encoding scheme are clearly distinguished. They convey different pieces of information about other components of the statement. That distinction is not present in the DC-XML-2003 models, so any attempt to make such a distinction in the interpretation of an XML instance constructed according to the DC-XML-2003 specification is problematic. As the examples above illustrate, such attempts rely on ad hoc decisions by the interpreter about the intent of the author.
Furthermore, the DCAM supports the notion that a single statement may include a reference to a vocabulary encoding scheme and references to multiple value representations each of which may include a reference to a syntax encoding scheme. Even if the approach is taken to adopt a subset of the DCAM such that a statement includes only a single value represntation, a statement may reference both a vocabulary encoding scheme and a syntax encoding scheme.
Recommendation: The distinction between vocabulary encoding scheme and syntax encoding scheme is significant in the DCAM, and it is a requirement that an XML binding for DC metadata should support that distinction. This requirement is not met by the DC-XML-2003 specification.
Earlier discussion [DC-ARCH-2004] recommended that the XML binding for DC metadata should support the encoding of multiple descriptions (as description sets).
The DC-XML-2003 binding maps "encoding schemes" to XML Schema datatypes. It is important to note that the mapping is not simply from "encoding scheme" to the value of an XML attribute with the name xsi:type
. The "semantics" of the xsi:type
XML attribute are defined by the XML Schema specification (and built into the behaviour of parsers and other tools that support XML Schema).
The Simple Type Definition (§2.2.1.2) or Complex Type Definition (§2.2.1.3) used in ·validation· of an element is usually determined by reference to the appropriate schema components. An element information item in an instance may, however, explicitly assert its type using the attribute xsi:type. [W3C-XMLS-STR]
In using the xsi:type
XML attribute, then, the DC-XML-2003 binding is committed to building on the semantics of XML Schema: the value of an xsi:type
attribute is the name of the datatype of the XML element.
The datatyping mechanism in XML Schema [W3C-XMLS-DAT] is designed to provide datatypes for XML document content i.e. for XML element content and for XML attribute values.
The DC-XML-2003 binding maps "encoding schemes" specifically to XML Schema complex types, which are applied to the content of the XML element which represents a "property-value" pair. As already noted, the content of that XML element is what DC-XML-2003 calls the "value" - what the DCAM calls the value string, a representation of the value resource.
The DCAM highlights that in fact "encoding schemes" in Dublin Core metadata are of two distinct types, which have two distinct functions in the DCAM:
Given that in DC-XML-2003 the content of the XML element is mapped to the value string, the use of XML Schema datatypes to represent syntax encoding schemes seems consistent with the XML Schema specification. However trying to represent vocabulary encoding schemes using this construct seems more problematic.
Even for the syntax encoding scheme case, however, there is a difficulty. According to the models specified in DC-XML-2003 and in the DCAM, the use of an "encoding scheme"/syntax encoding scheme is optional: it is not required that "values" in DC-XML-2003 are associated with an "encoding scheme", and there is no "default" encoding scheme implied if none is specified in a DC metadata instance.
However for a processor which applies XML Schema validation, all XML elements are associated with a datatype: the xsi:type
attribute provides a means of specifying the datatype in the XML instance but if no xsi:type
attribute is provided in the instance the datatype, the processor determines the datatype by referring to the types specified by the relevant XML Schemas. It is not possible for the XML element to have no datatype - but on the other hand the DC-XML-2003 models (and the DCAM) both specify that the use of a (syntax) encoding scheme is optional.
In short, it is not clear that the DC-XML-2003 mapping of "encoding schemes" to XML Schema datatypes meets the requirements of the conceptual models it seeks to support.
A secondary issue - distinct from the mapping itself - is that the datatypes currently used in the XML Schemas provided by DCMI for DC-XML-2003 - employ a type derivation strategy which some (many?) XML Schema processors do not handle correctly (the derivation of a complex type with simple content by restriction of a complex type with complex content) [XMLS-NOTES].
Leaving aside the issue of whether this is valid or not, the reason for providing the "base" mixed content type is to enable implementers to derive types which support markup in DC-XML-2003 "values" i.e. to support rich representations. However, the DC-XML-2003 conceptual models do not include support for rich representations, only for value strings. If rich representations are to be supported then that requires a change to the conceptual models and the provision of an appropriate mapping to a syntactic construct.
Two proposals for alternate XML bindings for Dublin Core metadata have been prepared: one which implements a subset of the features of the DCAM, and one which implements the full DCAM.
(To follow)
The approach taken in developing these proposals was that compatibility with the DC-XML-2003 binding was not a requirement.
Even if the decision is taken to limit the features of the DCAM supported by a new binding to some subset of the DCAM which is similar to the features of the DC-XML-2003 binding, the fact that the DC-XML-2003 models conflate in a single concept what the DCAM treats as two distinct concepts - vocabulary and syntax encoding schemes - raises fundamental problems. The DC-XML-2003 binding maps "encoding schemes" to a single syntactic construct (XML Schema complex types referenced as QName values of the xsi:type
attribute). Given an XML instance constructed according to the DC-XML-2003, it is impossible to distinguish vocabulary encoding scheme URIs from syntax encoding scheme URIs. It is not possible to interpret an XML instance constructed according to the DC-XML-2003 binding in terms of conceptual features which were not part of the models on which the DC-XML-2003 binding was based. As noted above, users of the DC-XML-2003 binding have resorted to ad hoc interpretations to try to address this, but the result is inconsistency, contradiction and ambiguity.
Constructing a binding which meets the requirements of representing essential features of the DCAM and is backwards compatible with the DC-XML-2003 binding does not seem possible, because the DC-XML-2003 binding is inherently ambiguous in its treatment of features of the DCAM.
This raises the question of how to manage the existence of two incompatible bindings - the DC-XML-2003 binding and a new XML binding (DC-XML-2005) that does explicitly support the DCAM (or some subset of it) - or a transition from DC-XML-2003 to DC-XML-2005:
Possible options include:
[DCMIAM]
DCMI Abstract Model
http://dublincore.org/documents/abstract-model/
[DC-XHTML]
Expressing Dublin Core in HTML/XHTML meta and link elements
http://dublincore.org/documents/dcq-html/
[DC-RDF]
Expressing Simple Dublin Core in RDF/XML
http://dublincore.org/documents/dcmes-xml/
[DCQ-RDF]
Expressing Qualified Dublin Core in RDF/XML
http://dublincore.org/documents/dcq-rdf-xml/
[DC-XML-2003]
Guidelines for implementing Dublin Core in XML
http://dublincore.org/documents/dc-xml-guidelines/
[XMLS-NOTES]
Notes on the W3C XML Schemas for Qualified Dublin Core
http://dublincore.org/schemas/xmls/qdc/2003/04/02/notes/
[DC-ARCH-2004]
DC Architecture WG meeting report, DC2004
http://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind0410&L=dc-architecture&P=4565
[W3C-XMLS-STR]
XML Schema Part 1: Structures Second Edition
http://www.w3.org/TR/xmlschema-1/
[W3C-XMLS-DAT]
XML Schema Part 2: Datatypes Second Edition
http://www.w3.org/TR/xmlschema-2/
[DC-XML-2]
DC-XML: An XML format for Dublin Core metadata
[DC-XML-3]
DC-XML: An XML format for Dublin Core metadata