logo
Is He The One? Subject Identification in Topic Maps < < Home 

Is He The One? Subject Identification in Topic Maps

prevUpnext

Introduction

Topic Map content is about things (subjects) be they tangible things like physical objects like a particular car, or abstractions like the class 'car' or even homocentric concepts as 'love' and 'hate'. It also includes things which only exist electronically such as online documents. As such topics are only representants: they represent (proxy) the subject; the subject itself, of course, exists outside the topic map. Equivalently, we sometimes use the phrase a subject is reified by a topic.

The problem the topic map author faces is how to connect a topic with the subject outside the map. Only to name a topic 'Stalin' does not necessarily mean that every other person in this universe knows exactly that we talk exactly about the Russian dictator during the second world war.

According to the Topic Map paradigm this problem is resolved using subject addresses and subject indicators. This short feature presents the options Topic Map authors have to refer to the outside world. As notation we will use XTM[XTM 2000].

Identification

The challenge of identification of things is a recurring one, in the universe of Topic Maps the identification is relevant for humans and for machines, alike. Human authors and map maintainers must be certain about the things they are writing about and machines need a sufficiently exact way to establish an identity whenever they are supposed to aggregate content, say, by merging two or more maps.

This concept can be rephrased as there shall be only one topic per subject in a map. Often this is called the collocation invariant or - with the charm of a 4-letter word - the SLUO (subject locator uniqueness objective).

We distinguish two different kinds of subjects: those which have an address and those which have not. We will start with the first, simpler situation and then discuss how to proceed with non-addressable subjects.

Addressable Subjects

Most of you are familiar with URLs. These are addresses of documents on the web. While URLs suffer from a number of conceptual problems they have become a widespread and convenient way to refer to online objects, as are text, images, etc. documents.

Given that, every document - actually any network resource - is a potential Topic Map subject, its URL being an address which can be used unambiguously to refer to the document from inside a topic map. Using XTM we can create a topic representing an image:

<topic id="pic-jalta">
   <subjectIdentity>
      <resourceRef xlink:href="http://history1900s.about.com/library/graphics/fdr102.gif"/>
   </subjectIdentity>
   <!-- other topic information -->
</topic>

For obvious reasons there can only be one such resourceRef for one topic which the original XTM standard [ standard] makes also clear syntactically. (Later standard proposals relax this allowing now a set.) This single identity will be proof enough for a Topic Map processor to merge this topic with any other topic which happens to have exactly the same address, regardless whether this is in the same topic map document or in another merged-in map.

You may observe that in that we do not make a distinction between addressing a single document or - using the same URL - addressing a whole site. Using http://topicmaps.bond.edu.au/ leaves it to the interpretation of the map whether a topic is about the whole site or the index page only. This may be a shortcoming in the current Topic Map formalism.

Note also that it is not so much an issue whether the document can actually be downloaded; here only the address serves as convenient identification. So not only URLs can serve as subject addresses; also URNs (uniform resource names, [RFC2141]) can uniquely identify a subject even though it is not a downloadable document. URNs are less well-known and assume that a particular name is only valid in a particular name space. Some of these namespaces have already been registered, for instance for ISBN and ISSN. You are free to argue a new one or - simpler - to use one privately. URNs are quite practical to use together with Topic Maps whenever you already have an existing name space, as say objects in a relational database. Instead of resorting to cryptic URLs to point to these objects you use a private urn space which a part of your application resolves.

Non-addressable Subjects

Most subjects in this universe do not have an address. Human beings do not have an address although it can be argued that most of them have an identity.

While these things might not have a direct address, they might have an indirect one, i.e. an address to a resource which is about the subject in question. So instead of using a URI (URL or URN) as subject address we can use it as indication of the identity of the subject: The document referred to simply has to be sufficiently authoritative about the subject.

In XTM we would use the <subjectIdentity> [ standard] element in a topic:

<topic id="stalin">
   <subjectIdentity>
      <subjectIndicatorRef xlink:href="http://www.stel.ru/stalin/" />
   </subjectIdentity>
   <!-- other topic information -->
</topic>

Maybe you would expect that such URIs are named subject indicators but for some historical reason this term is reserved for the document indicating the identity; the URI itself is named subject identifier which some people may find pretty misleading.

As there is no unique choice for such an indicator, the author of the map is free to add more subject indicators [ standard]. In fact, the more such hints to the Topic Map processor are given the higher the probability that an overlap of subject indicators will cause topics to merge:

<topic id="stalin">
   <subjectIdentity>
      <subjectIndicatorRef xlink:href="http://www.stel.ru/stalin/"/>
      <subjectIndicatorRef xlink:href="http://www.bbc.co.uk/education/modern/stalin/stalihtm.htm"/>
      <subjectIndicatorRef xlink:href="http://www.marxists.org/reference/archive/stalin/"/>
   </subjectIdentity>
   <!-- other topic information -->
</topic>

Any kind of document can serve as a subject indicator. Mostly it will be web pages, but pictures, video or audio documents will be suitable as well. The Topic Map framework does not make any distinction here.

It is the responsibility of the author to select those indicators which cover the subject in question as their main focus, although this is difficult to formalize. As a counter-example take the above document picturing the conference on Jalta. The image contains Churchill, Stalin and Roosevelt. If this picture would be used as subject indicator for Stalin then there are good chances that another topic, maybe one about the Jalta conference or Churchill might use it as well. Those topics would be merged into one, much to the dismay of most historians.

Some care should finally be taken to keep subject addresses separate from subject identifiers. Both are URIs and as such may address a document or a thing. In the case of a subject address, though, this is the address of the thing itself, a subject identifier is the address of something about the thing. A particular document can be both, the subject and an indicator; but this hopefully only for different topics [SAM].

Published Subject Identifiers (PSI)

The follow-up problem now is the suitable selection of such subject identifiers. In fact, a dedicated standardization effort is underway to define a framework to create and publish subject indicators. This framework should allow anyone to create simple version of ontologies and catalogues about things and should facilitate map interoperability on a larger scale.

The expectation is that publishers prepare their web sites in such a way that URLs referencing subjects are as stable as possible and every subject has its unique entry. Currently this is not always the case: under 'Apple' you will usually find several entries in all major online dictionaries: the fruit, the company or maybe even the music label. By publishing these repositories publishers commit themselves signalling topic map authors that the URLs can be used as PSIs.

Every application domain will then have one or more such repositories, just as amazon.com seems to become one for books and music CDs. As there is no official registry of repositories authors will select the repositories according to the perceived trust and their authoritativeness.

Summary

It is perfectly valid to produce a topic map without any hints of identity for the topics. In many cases this may not be the concern of an author in the first place. If, however, such a map has to be merged with others then it is helpful to enrich the topics with either a strong or a weak form of identity.

All in all, the Topic Map framework provides the author with the following options to connect topics (representatives of subjects) with the outside universe:

  • One option is to use occurrences [ standard] which we did not cover here: An occurrence is simply a reference to an outside object, such as a web page, which provides more information about the subject. Any such occurrences can be added to a topic. Occurrences may have a type to help distinguishing them to the map user.
  • If the outside resource is somehow rather significant for the subject in such a way as it - rather uniquely - identifies the subject, then the URI can be used as subject identifier and references then this subject indicator. Any number of these can be added to increase the likelihood of merging.
  • If, finally, the external resource is the subject, then the URI to that resource becomes a subject address.

References

XTM 2000
XML Topic Maps (XTM) 1.0, Ed: S. Pepper, G. Moore, 2000
XTM Proposal 2003
The XML Topic Maps (XTM) Syntax 1.1, Ed.: L. M.Garshol and G. Moore., 2003
RFC2141
URN Syntax, R. Moats, 1997
RFC2611
URN Namespace Definition Mechanisms, L. Daigle, R. Iannella, et.al., 1999
URN Status
URN NID Assignment Status
PSI TC
Published subjects TC, guidelines and recommendations for how to create, publish, and maintain published subject sets
SAM
The Standard Application Model for Topic Maps. Ed.: L. M.Garshol and G. Moore.


prevUpnext