Conceptual relationships for encoding thesauri, classification systems and organised metadata collections and a proposal for encoding a core set of thesaurus relationships using an RDF Schema
Previous version: 2000-06-06 (DESIRE II report)
This version: 2000-01-24
Latest version URI: http://ilrt.org/discovery/2001/01/rdf-thes/
This document presents a specification for an XML/RDF representation of thesauri. We expect this work to evolve through feedback from other implementors. Our deployment experience to date suggests that the current proposal is both practical and useful; however there are a number of open issues remaining to be resolved. The proposal may change significantly as we attempt to resolve these issues; implementors should be aware of this.
Note: 2000-01-24 the document is currently being revised; the version online at this address may change as we re-organise the structure and content of the proposal (danbri).
Comments and feedback should be directed to the publically archived rdfthes-dev mailing list (rdfthes-dev@egroups.com).
This paper proposes an RDF representation of various conceptual relationships typical of controlled vocabularies such as thesauri, classification systems and organised metadata collections. The aim is to explore the use of RDF as a common formalism for representing a variety of different thesauri and classification systems within the same overall framework. By doing so, we expect to leverage generic RDF facilities (such as query and storage software components), and also to have a basis for mapping between subject classifications expressed using these various vocabularies.
The approach taken here is to divide the problem into two stages. Firstly we define a simple core RDF representation of concepts such as 'broader term' and 'narrower term' typically used in classification and thesauri systems. Then we extend this with a range of more semantically meaningful relationships expressed in terms of classes of objects. Many vocabulary systems have a tacit or unarticulated semantic model obscured behind relatively uninformative relationships such as 'broader' and 'narrower'. It is usually impossible to mechanically derive a richer set of relationships from a system based around these vague, generic relation types. General hierarchical relationships are frequently used to indicate one of several actual relationships. The relationships 'is a', 'has instantiation', and 'has part', for example, might all be encoded using the less informative 'narrower' relation.
The simpler 'core' relations are best thought of as being relationships between named concepts or terms, rather than as relations between real world (or abstract) entities. In other words, while we might say that "Fido is a dog" using a rich, semantic relationship, we would say that "the-term-Fido has-broader-term the-term-dog". The vague 'broader term' relation in this case subsumes the more informative 'is a' relation. The proposal in this document separates out these two approaches since it is crucial to remain unambiguous about when a node ('resource') in an RDF data model represents a named concept or term rather than some less abstract entity, i.e. the "thing in itself".
A further reason for creating two distinct RDF representations for these vocabulary systems is that RDF itself includes some common core vocabulary elements which have some overlap in functionality with the semantic modelling facilities required to transform simple flat vocabulary systems into richer knowledge bases. In particular, the RDF specifications define notions of 'Class', 'Property', 'subClassOf', 'type', 'domain' and 'range', which may be applicable to the task described above. By first addressing the need to find a simple RDF representation for the broader/narrow/preferred relationships, i.e. those simple relations which make sense in the context of terms/concepts rather than semantically modelled entities, we should be able to make some initial progress without having to solve the entire problem of 'knowledge modelling in RDF'.
The next section walks through the desired set of relationships, using bold type to indicate candidates for the simple core RDF vocabulary. The following section sketches how a machine-processable RDF representation of the simple term-oriented concepts of a thesaurus might look, and finally, we give a machine-readable RDF Schema for the simple vocabulary.
In the definitions given below, the term Category is used when the relationship applies to classification systems, Term when the relationship applies to thesauri, and Document when the relationship can be used with individual documents.
A HIERARCHICAL RELATIONSHIPS
Label: BroaderTerm
Term: Broader term
Member of: A
Definition: Term one level up in a hierarchy, without specification of the type of hierarchical relationship
Label: NarrowerTerm
Term: Narrower term
Member of: A
Definition: Term one level down in a hierarchy, without specification of type of hierarchical relationship
(For inclusion in the simple core vocabulary)
A1 GENERIC RELATIONSHIP
Label: IsA
Term: is a (instance of)
Member of: A1
Definition: Term/Category is an instance of a term/category one level up in the hierarchy
Label: HasInstantiation
Term: has instantiation
Member of: A1
Definition: Term/Category has an instantiation one level below in the hierarchy
A2 Whole-part relationships
Label: IsPartOf
Term: is part of
Member of: A2
Definition: Document/Category/Term represents a (unspecified) part of a document/category/term one level up in the hierarchy
Label: HasPart
Term: has part
Member of: A2
Definition:Document/Category/Term has a (unspecified) part one level below in the hierarchy
Label: IsSpatialPartOf
Term: is spatial part of
Member of: A2
Definition: Term/Category represents a spatial/geographical part of a term/category one level up in the hierarchy
Label: HasSpatialPart
Term: has spatial part
Member of: A2
Definition: Term/Category has a spatial/geographical subterm/subcategory one level below in the hierarchy
Label: IsConceptuallyPartOf
Term: is conceptually part of
Member of: A2
Definition: Term/Category is a subconcept to a term/category one level up in the hierarchy
Label: HasConceptualPart
Term: has conceptual part
Member of: A2
Definition: Term/Category has a subterm/subcategory one level below in the hierarchy
Label: IsCollectionMemberOf
Term: is collection member of
Member of: A2
Definition: Document/Category/Term is member of a group of documents/categories/terms
Label: HasCollectionMember
Term: has collection member
Member of: A2
Definition: Group of documents/categories/terms has member
B EQUIVALENCE RELATIONSHIPS
B1 Single directional equivalence
Label: Use
Term: use, see
Member of: B1
Definition: The term/category pointed to should be preferred
Label: UsedFor
Term: used for
Member of: B1
Definition: The term/category pointed to is the non-preferred term/category
(For inclusion in the simple core vocabulary)
Label: IsVersionOf
Term: is version of
Member of: B1
Definition: The document/category/term pointed to is a version of another document/category/term
Label: HasVersion
Term: has version
Member of: B1
Definition: The document/category/term has another version
B2 Bi-directional equivalence
Label: IsSynonymOf
Term: is synonym of
Member of: B2
Definition: The term is a synonym of the one pointed to
Label: IsFormatOf
Term: is format of
Member of: B2
Definition:The document is a result of a format transformation of the one pointed to
C ASSOCIATIVE RELATIONSHIPS
Label: RelatedTerm
Term: related term, see also, similar to
Member of: C
Definition: The document/category/term pointed to is related (in an unspecified way)
(For inclusion in the simple core vocabulary)
Label: IsReferencedBy
Term: is referenced by
Member of: C
Definition: The document is referenced by the document pointed to
Label: References
Term: references
Member of: C
Definition: The document is referencing the document pointed to
Label: IsRequiredBy
Term: is required by
Member of: C
Definition: The document/object is required by/dependent on the document pointed to
Label: Requires
Term: requires
Member of: C
Definition: The document requires/is dependent on the document/object pointed to
Label: IsBasedOn
Term: is based on
Member of: C
Definition: The document/term is based on the document/term/object pointed to
Label: IsBasisFor
Term: is basis for
Member of: C
Definition: The document/term/object is the basis for the document/term pointed to
Label: IsDerivedFrom
Term: is derived from
Member of: C
Definition: The document/term is derived from the document/term pointed to
Label: HasDerivate
Term: has derivate
Member of: C
Definition: The document/term has the derivate pointed to
Label: IsTranslatedFrom
Term: is translated from
Member of: C
Definition: The document/term is translated from the document/term pointed to
Label: HasTranslation
Term: has translation
Member of: C
Definition: The document/term has the translation pointed to
Label: IsInterpretationOf
Term: is interpretation of
Member of: C
Definition: The document is a (creative, artistic) interpretation of the document/object pointed to
Label: HasInterpretation
Term: has interpretation
Member of: C
Definition: The document has a (creative, artistic) interpretation pointed to
Label: IsMappedTo
Term: is mapped to
Member of: C
Definition: The document/category/term is mapped to the document/category/term pointed to
Label: HasMapping
Term: has mapping
Member of: C
Definition: The document/category/term has this document/category/term mapped to it
Label: IsLinkedFrom
Term: is linked from
Member of: C
Definition: The document/category/term is linked to from the document/category/term pointed to
Label: HasLinkTo
Term: has link to
Member of: C
Definition: The document/category/term is linking to the document/category/term pointed to
Label: IsSameLevelNeighbour
Term: is same level neighbour
Member of: C
Definition: The document/category/term is a neighbour on the same level of a organisational structure to the document/category/term pointed to
Label: IsTopologicalNearestNeighbour
Term: is topological nearest neighbour
Member of: C
Definition: The document/category/term is a topologically nearest neighbour in a organisational structure to the document/category/term pointed to
The relationships in bold above are candidates for a simple core set of relationships for a thesaurus. They are:
BroaderTerm
· NarrowerTerm
· Use
· UsedFor
· RelatedTerm
This terminology is taken from ISO 2788: Guidelines for the establishment and development of monolingual thesauri (International Organization for Standardisation, 1986).
The terminology deals with the terms themselves, that is, the lexical representation of concepts. For the creation of an RDF schema for storing structured vocabularies, we decided to differentiate between the lexical representation of a concept and the concept itself. It was felt that the unique resource should be the concept, each concept resource being indicated by one or more term resources. Thus the RDF resource used to represent cats, would be indicated by a term whose value was the word "cats". This is represented by the graph in figure 3 below.
Figure 3. RDF graph representation of the concept representing a cat (concept 5)
In figure 3, concept_5 represents the concept of cats. Its indicator is a term (term_7) whose value is the text string "cats". Another term indicating the concept might have the value "chats".
As a result of the above approach, the RDF schema refers to relationships between concepts rather than between terms, and this is reflected in the vocabulary used below, e.g. broaderConcept rather than broaderTerm.
Whilst the relationships: 'broader', 'narrower', and 'related' are still meaningful when considering concepts rather than terms, the relationships 'use' and 'used for' refer only to terms. This is because 'use' and 'used for' indicate which particular term has been chosen to be used to represent the relevant concept when indexing some resource. For the core RDF vocabulary then, these relationships have instead been represented by properties of the term resources. This is referred to using the attribute 'termUsage', which has values of 'preferred' or 'nonPreferred'. The second issue considered was that since broaderTerm and narrowerTerm are commutative, i.e.
A narrowerTerm B implies
B broaderTerm A,
utilising both relationships when storing or transferring the vocabulary data would be inefficient. We therefore decided to create a relationship 'broaderConcept' for the RDF Schema but not 'narrowerConcept', as this is implied; it being the responsibility of any application using the data to deduce the opposite relationship and present it to the user.
The second relationship between concepts chosen for the schema was 'relatedConcept'. This term is bi-directional, and hence if the relationship
A relatedConcept B exists, then it is implied that
B relatedConcept A is also true.
Hence we only add one of the two possible pairs to the datastore.
A further attribute often used within thesauri, is 'top term'. This indicates a term that is at the top of a hierarchy within the thesaurus. Since this is a property that may be deduced by an application from the lack of a broaderConcept property for that concept, this attribute is also left out of the schema.
broaderConcept and relatedConcept were therefore selected as the only two core relationships between concepts that would be required for a basic RDF vocabulary schema. Other properties are required however to allow the encoding of thesauri, taking into account the recommendations of ISO 2788 (International Organization for Standardisation , 1986) and general thesaurus usage. These are listed in the next section which describes the RDF thesaurus schema proposed.
The Resource Description Framework (RDF) is a W3C (World Wide Web Consortium, 2000) recommendation for representing structured data on the Web. RDF, like both the Web and thesaurus systems, is based around a strategy of managing information as a collection of links between uniquely named entities. RDF's Web-based information model uses the term 'resource' to refer to the entities that it models, and provides an application-neutral framework within which various kinds of entities and relationships can be described. A general introduction to RDF is beyond the scope of this document. The W3C home page for RDF (Swick, 2000) lists a number of introductory tutorials as well as the RDF specifications.
In this document we describe the application of RDF to the description of thesaurus-like data structures. Specifically, we show how the RDF data model can represent a Web of inter-related concepts and terms from one or more thesauri. To do this, we define a simple RDF vocabulary that uses Web identifiers (Universal Resource Identifiers) to name some relationships and resource types useful for the description of concepts and terms in a thesaurus. It should be noted that we do not here attempt to model the richer semantic relationships that hold between the entities denoted by such concepts, although RDF itself can also be used to represent this kind of information.
The XML/RDF thesaurus schema is set out in Appendix A. An example set of XML/RDF thesaurus data is given in Appendix B.
As described above, the schema consists of two main resources: Concept and Term. Concept resources are related by the properties: 'broaderConcept' and 'relatedConcept'. Concepts have a property 'indicator' which points to one or more term resources. The value of each Term resource will be the actual text string.
As noted above, the Term resources have an optional property called 'termUsage', which can be used with those thesauri that have non-preferred terms linked to preferred terms through the use/'used for' relationships. The value of termUsage must be either the string 'preferred' or 'nonPreferred'.
A second Term property is 'lang', which can be used to indicate the language of the term; thus a single concept can be 'indicated' by both preferred and non-preferred terms, and by terms from different languages (there is likely to be one preferred term for each language). The thesaurus schema therefore provides a mechanism for storing multilingual thesauri. If an English term and a German term both 'indicate' the same concept resource, it is implied that the two terms are either equivalent, or at least are treated as such for indexing purposes.
It may be considered necessary to recognise relationships between terms of different languages other than 'exactly equivalent', such as recognising that the equivalent term is broader in meaning, or where a single term in one language can be represented by two or more terms in another. In such a case, separate sets of concepts could be used for the different languages, with a new set of properties devised to indicate the different types of relationship between them, rather than using the 'lang' property.
There are two further optional properties that are permissible for Concept resources: 'scope' and 'conceptCode'.
· The value of the scope property is a resource called 'ScopeNote', which also has a 'lang' property, and whose value is an optional scope note for the term. A scope note is defined in ISO 2788 as "a note attached to a term to indicate its meaning within an indexing language" (International Organization for Standardisation, 1986).
· The property 'conceptCode' can be used for any code that is assigned to the preferred terms in a systematic thesaurus. In ISO 2788, the property 'address code' is defined as a code which links terms in an alphabetical index to their location in the systematic section. They "should have obvious filing values … may consist simply of running numbers … or may comprise a system of hierarchically expressive notation" (International Organization for Standardisation, 1986). Such a code will be unique for each concept in an RDF version of the thesaurus and might perhaps be useful in providing a language neutral method for indexing documents. In other thesauri, there may be non-unique codes, such as notations that associate the terms to broad subject categories, and such codes could also be held as values of the conceptCode attribute. Any unique code associated with the preferred terms in a thesaurus could also be usefully incorporated into the URI of the Concept resources, as this would be an aid in future management of the data (for instance for updates to the database). However, the conceptCode property has also been provided as a means of storing such information if required.
<rdf:RDF xml:lang="en"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/TR/1999/PR-rdf-schema-19990303#">
<rdfs:Class rdf:ID="Concept">
<rdfs:comment>
A unique concept defined within a thesaurus. Instances
use the rdfs:isDefinedBy property with a vocabulary
namespace as its value, to indicate the vocabulary to
which the concept belongs.
</rdfs:comment>
<rdfs:subClassOf
rdf:resource="http://www.w3.org/TR/1999/PR-rdf-schema-
19990303#Resource"/>
</rdfs:Class>
<rdfs:Class rdf:ID="Term">
<rdfs:comment>
Instances of this class represent the written forms of
Concepts. The string is given by the rdf:value of Term.
</rdfs:comment>
<rdfs:subClassOf
rdf:resource="http://www.w3.org/TR/1999/PR-rdf-schema-
19990303#Resource"/>
</rdfs:Class>
<rdfs:Class rdf:ID="ScopeNote">
<rdfs:comment>
The value of this optional resource is a scope note:
a note attached to a term to indicate its meaning within
an indexing language
</rdfs:comment>
<rdfs:subClassOf
rdf:resource="http://www.w3.org/TR/1999/PR-rdf-schema-
19990303#Resource"/>
</rdfs:Class>
<rdfs:Class rdf:ID="TermUsageValue">
<rdfs:comment>
The value of the property: termUsage. It can take one of two
values: 'preferred' or 'nonPreferred'.
</rdfs:comment>
<rdfs:subClassOf
rdf:resource="http://www.w3.org/TR/1999/PR-rdf-schema-
19990303#Resource"/>
</rdfs:Class>
<rdf:Property ID="broaderConcept">
<rdfs:comment>
This schema does not define a property 'narrowerConcept',
but applications can assume the existence of a property
narrowerConcept such that if:
{broaderConcept,ConceptA,ConceptB}, then
{narrowerConcept,ConceptB,ConceptA} is true.
</rdfs:comment>
<rdfs:domain rdf:resource="#Concept"/>
<rdfs:range rdf:resource="#Concept"/>
</rdf:Property>
<rdf:Property ID="relatedConcept">
<rdfs:comment>
The relatedConcept is commutative, such that if:
{relatedConcept,ConceptA,ConceptB}, then
{relatedConcept,ConceptB,ConceptA} is true.
</rdfs:comment>
<rdfs:domain rdf:resource="#Concept"/>
<rdfs:range rdf:resource="#Concept"/>
</rdf:Property>
<rdf:Property ID="indicator">
<rdfs:comment>
A mandatory property of a Concept whose value is
the Term instance representing a written form of the
Concept. A Concept may have as an indicator more than
one Term. A Term may only be an indicator of one
Concept.
</rdfs:comment>
<rdfs:domain rdf:resource="#Concept"/>
<rdfs:range rdf:resource="#Term"/>
</rdf:Property>
<rdf:Property ID="conceptCode">
<rdfs:comment>
An optional property for any code assigned to the
thesaurus concepts.
</rdfs:comment>
<rdfs:domain rdf:resource="#Concept"/>
<rdfs:range
rdf:resource="http://www.w3.org/TR/1999/PR-rdf-schema-
19990303#Literal"/>
</rdf:Property>
<rdf:Property ID="scope">
<rdfs:comment>
This optional property has as its value an instance of
the resource ScopeNote.
</rdfs:comment>
<rdfs:domain rdf:resource="#Concept"/>
<rdfs:range
rdf:resource="#ScopeNote"/>
</rdf:Property>
<rdf:Property ID="lang">
<rdfs:comment>
Optional property that can be used to give the language
of a Term instance. The codes from "ISO 639:1988,
Code for the representation of names of languages" should
be used as the values for this property.
</rdfs:comment>
<rdfs:domain rdf:resource="#Term"/>
<rdfs:range
rdf:resource="http://www.w3.org/TR/1999/PR-rdf-schema-
19990303#Literal"/>
</rdf:Property>
<rdf:Property ID="termUsage">
<rdfs:comment>
This optional property indicates whether the Term
instance is the 'preferred or 'nonPreferred' textual
expression of the Concept instance that is 'indicated'
by the Term, for a given language.
</rdfs:comment>
<rdfs:domain rdf:resource="#Term"/>
<rdfs:range rdf:resource="#TermUsageValue"/>
</rdf:Property>
<rdf:Description rdf:ID="preferred">
<rdf:type rdf:resource="#TermUsageValue"/>
</rdf:Description>
<rdf:Description rdf:ID="nonPreferred">
<rdf:type rdf:resource="#TermUsageValue"/>
</rdf:Description>
</rdf:RDF>
The example below shows the relationships between three concepts, whose term values are: 'Interpersonal Attraction', 'Interpersonal Relations', and 'Friends'. A graph representation of the RDF follows the XML representation (excluding the scopeNote property).
<web:RDF xml:lang="en"
xmlns:thes="http://snowball.ilrt.bris.ac.uk/~pldab/rdf-
dot/Thes/Thes.xrdf#"
xmlns:web="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/TR/1999/PR-rdf-schema-19990303#">
<web:Description about="http://sosig.ac.uk/hasset/terms/TID_3">
<web:type resource="http://snowball.ilrt.bris.ac.uk/~pldab
/rdf-dot/Thes/Thes.xrdf#Term"/>
<thes:lang>en</thes:lang>
<web:value>Interpersonal Attraction</web:value>
<thes:termUsage web:resource="http://snowball.ilrt.bris.ac.uk
/~pldab/rdf-dot/Thes/Thes.xrdf#preferred"/>
</web:Description>
<web:Description about="http://sosig.ac.uk/hasset/concepts/CID_6">
<web:type resource="http://snowball.ilrt.bris.ac.uk/~pldab/
rdf-dot/Thes/Thes.xrdf#Concept"/>
<rdfs:isDefinedBy web:resource="http://sosig.ac.uk/hasset/
concepts/"/>
<thes:indicator web:resource="http://sosig.ac.uk/hasset/
terms/TID_3"/>
<thes:conceptCode>768</thes:conceptCode>
<thes:broaderConcept>
<web:Description about="http://sosig.ac.uk/hasset/
concepts/CID_8">
<rdfs:isDefinedBy web:resource="http://sosig.ac.uk/
hasset/concepts/"/>
<thes:indicator web:resource="http://sosig.ac.uk/
hasset/terms/TID_15"/>
<thes:conceptCode>769</thes:conceptCode>
</web:Description>
</thes:broaderConcept>
<thes:relatedConcept web:resource="http://sosig.ac.uk/hasset/
concepts/CID_15"/>
</web:Description>
<web:Description about="http://sosig.ac.uk/hasset/concepts/CID_15">
<web:type resource="http://snowball.ilrt.bris.ac.uk/~pldab/
rdf-dot/Thes/Thes.xrdf#Concept"/>
<rdfs:isDefinedBy web:resource="http://sosig.ac.uk/hasset/
concepts/"/>
<thes:indicator web:resource="http://sosig.ac.uk/hasset/
terms/TID_21"/>
<thes:conceptCode>780</thes:conceptCode>
<thes:scope web:resource="http://sosig.ac.uk/hasset/
scopenotes/SN_12"/>
</web:Description>
<web:Description about="http://sosig.ac.uk/hasset/terms/TID_15">
<web:type resource="http://snowball.ilrt.bris.ac.uk/~pldab/
rdf-dot/Thes/Thes.xrdf#Term"/>
<thes:lang>en</thes:lang>
<web:value>Interpersonal Relations</web:value>
<thes:termUsage web:resource="http://snowball.ilrt.bris.ac.uk/
~pldab/rdf-dot/Thes/Thes.xrdf#preferred"/>
</web:Description>
<web:Description about="http://sosig.ac.uk/hasset/terms/TID_21">
<web:type resource="http://snowball.ilrt.bris.ac.uk/~pldab/
rdf-dot/Thes/Thes.xrdf#Term"/>
<thes:lang>en</thes:lang>
<web:value>Friends</web:value>
<thes:termUsage web:resource="http://snowball.ilrt.bris.ac.uk/
~pldab/rdf-dot/Thes/Thes.xrdf#preferred"/>
</web:Description>
<web:Description about="http://sosig.ac.uk/hasset/scopenotes/SN_12">
<web:type resource="http://snowball.ilrt.bris.ac.uk/
~pldab/rdf-dot/Thes/Thes.xrdf#ScopeNote"/>
<thes:lang>en</thes:lang>
<web:value>To be used only for platonic relationships</web:value>
</web:Description>
</web:RDF>
The following issues require further attention. Future versions of this document might attempt this (but then again, they might not...)
International Organization for Standardisation. 1986. ISO 2788: Guidelines for the establishment and development of monolingual thesauri, 2nd ed., Geneva: ISO.
Swick, Ralph et al. (Accessed June 2000). W3C Resource Description Framework. http://www.w3.org/RDF/
World Wide Web Consortium. (Accessed June 2000). W3C – The World Wide Web Consortium. http://www.w3.org/