logo
Path Language for Topic Maps: Full speed ahead? < < Home 

Path Language for Topic Maps: Full speed ahead?

prevUpnext

Motivation

In the XML arena XPath is probably the most effective and popular language. It is not a path language perfectly suitable for TMs although still quite a few people work with them on the XTM level. XPath can be used to drill into XML documents to extract information and/or XSLT is used to arrange the content in another form (e.g. xsiteable, TM4J). Still, the rationale behind XTM is that of an interchange format and not so much as a store from which information can be effectively extracted. The - hopefully near - future will bring TMQL as a proper means to retrieve content out of TM backends. In the recent Amsterdam meetings there was agreement on adding path expressions to the future TMQL.

Recently, when working on a draft of the TMQL specification I experimented with existing path languages and realized that a purely path-oriented query language can have enough expressiveness to solve all those use cases for TMQL which return lists. It also does not mean that TMQL should be a path-oriented language. Still it shows the power of the path paradigm which we may want to leverage.

In the following I have collected some thoughts on how to exploit the path paradigm for TMQL. Here I will include aspects of the language itself, but also will sketch some more general usage scenarios to demonstrate why having a powerful path component in TMQL would be, uhm, cool.

A Short History of TM Path Languages

Let us have first a quick look at existing path languages, at least the ones I came across:

XTMPath

XTMPath was one ad-hocish language I invented as part of the Perl XTM package as I grew tired to use the low-level API to extract information from maps. The idea behind was that the topic map in question would be assumed to be in XTM notation. In reality it was of course fully deserialized in memory. Path statements would behave similar to XPath in that they would attend to the - only virtually existing - nodes. A query like

/topic[instanceOf/topicRef/@href = "#beer-brand"]
would so find a topic node within the current map which has the property that the resource reference would point to the topic beer-brand which is local to the map.

Unlike XPath, XTMPath would know about the underlying XML structure it is working on; this has allowed quite a few optimizations and also shortcuts. On the negative side, XTMPath does not scale well in terms of complexity. To extract, for instance, all cities where we know that persons have been born in, we would have to code:

/topic[instanceOf/topicRef/@href = "#city"]
      [/association
            [member
                 [roleSpec/topicRef/@href = "#city"]
                 [topicRef/@href = "."]]
            [member
                 [roleSpec/topicRef/@href = "#who"]
                 [topicRef/@href
                    [instanceOf/topicRef/@href = "#person"]
                  ]
             ]
       ]
	      
This readily surpasses my personal threshold of pain.

TMPath

TMPath by Dmitry Bogatchev is not based on an XMLish tree-like model but directly on a (somewhat simplified) TM model. That consists of associations, topics and characteristics thereof.

TMPath uses first types to select particular instances from a map, such as all persons in topic[person]. From then on, TM-specific navigation can be used to reach out to other parts of a map:

/topic[person]/roleOf[who]/
      association[bornIn]/role[where]/topic[city]
In that case we would select first all persons, find in which associations they play the role who, find then which of these are of type bornIn and then which of them have a role where. From this set of things we only look for those things which are an instance of city.

While there is some syntactic overhead, the queries lend themselves quite nicely to a natural flow of movement through a given map.

AsTMaPath

AsTMaPath has a similar rationale as TMPath in that it also uses a conceptual TM model. It was a late addition to AsTMa?, the query part in the AsTMa* language family.

Like TMPath it can be used to find topic and association components; so, for instance, the occurrence of type website for the topic identified by tmql

tmql / oc (website)

But it can also start from a map and climb over associations:

some-map / * [ . -> instance \ is-a / class = 'person' ]
             -> who \ bornIn / where
             [ . -> instance \ is-a / class = 'city' ]
	      
Of course, the parser/compiler would know about a lot of shorthand notations to make life convenient for authors:
some-map // persons -> who \ bornIn / where // city

Other languages for TMs contain smaller doses of path expressions, such as Toma and also the query language outlined in [Mück, Widhalm].

How could such a language for TMQL look like?

As a starting point we could paraphrase what existing languages already offer. If we assume that we already hold a map in a variable $m, then a statement like

$m // opera
could give us a list of all things, i.e. topics which are - directly or indirectly - instances of the concept opera. 'Direct or indirectly' means that we also want to see things which are instances of subclasses of opera.

Topic characteristics also can be selected, quite similar as the existing path languages allow to:

$m // opera / bn
returns all basenames of all operas. If we ask for the premiere date and assume that is stored as a topic resource data occurrence of type premiere-date we could write
$m // opera / rd (premiere-date)

It is important to realize that we always get returned a list of things, in this case it is a list of all premiere date characteristics of all operas in the map.

This maybe probably not what we want, so we look out only for particular operas:

$m // opera [ ./rd (premiere-date) < '1900' ] 
      / rd (premiere-date)
First, we select all operas; the condition enclosed with [] then filters out those where the premiere is before 1900. Only then we ask again for the premiere dates.

The expression within square brackets can be a comparison of two path expressions like above, or it can also be a simple path expression which is equivalent to 'true' if it evaluates to something non-empty:

$m // opera [ ./rd (premiere-date) ]

We can also navigate along associations, say, to find all composers of operas:

$m // opera -> piece \ composed-by / composer
After we have extracted all operas, we look where these play the role piece in an association of type composed-by. Again we will get a set of things, this time associations. For each of them we select the role composer to receive the topics playing that role there.

Advanced features

In a more complex example let us consider that some operas might be composed not by an individual alone, but by a team of composers. If we now would like to select all operas where all composers are already dead, then we could write

$m // opera 
        [ not . -> piece \ composed-by / composer / rd (status) 
          =
          'alive' ]
Finding all instances of opera, we then see that there is no situation in the map to find an association of type composed-by where that opera plays the role piece and that topic playing the role composer in that very association has a resource data occurrence of type status having the value alive.

As the [] has an 'exists' semantics, with the introduction of negation we can support also an 'forall' semantics.

A TMQL path language will have to return more than one list of values. As an example, assume that we would like to generate a table of operas together with their composers:

$m // opera 
    < . / bn[0], . -> piece \ composed-by / composer / bn[0] >
The sub-expression enclosed by <> brackets (ramification) takes care that for every opera we generate one or more tuples consisting each of two values: The first one is the first basename (whatever that may be) of each particular opera and the second value in the tuple is the first basename of a composer of that particular opera. As an opera might have several of such composers, for each of them such a tuple is generated, always using one and the same basename for the opera itself.

Also the inverse case - selecting particular value from such tuples - has to be covered. In this projection we use the expression .[0] to project the first value in every tuple and ignore all others:

$m // opera < ./bn , ./rd(premiere-date) > .[0]
In the end this is the same as if we would have written shorter
$m // opera / bn
in the first place.

Using ramification with a followup projection may appear as a mute exercise, but they are the necessary means to emulate functionality which can be achieved using variables in the pattern-matching oriented languages.

Additionally, the language will have to allow sorting according to arbitrary criteria. The following query does exactly that, listing all operas alphabetically:

$m // opera order by ./bn@sort

More sophistication can not only be added for intra-map navigation; also on the map level itself we can use expressions to build and manipulate map. As this is not so relevant for this exposition, we will skip details here. The same applies for another open question, namely whether path expressions should be equipped to generate non-list results, such as XML instances or Topic Map instances.

Applications

A powerful path expression sublanguage will open up specific application scenarios where (command) line-oriented control and configuration is natural or at least dominates the mindset.

This, of course, includes most of UNIX/Linux related application management which makes still heavy use of the path paradigm based on the tree-formed file system despite the recent trend towards XML configuration files. But it also includes web applications where URLs carry all information to control the underlying application without the need to send additional information in the HTTP message body.

Let us look at some recent examples we here have done work on:

TMFS (Topic Map File System)

All modern operating systems work with tree-structured file systems where files are stored as binary blobs in directories. Not only are there many different such file systems (with different properties regarding disk usage, speed, recovery speed, etc.), such file systems might also reside on other machines or other media than disks. Usually, the facilities to host meta data within a file system is rather limited.

The TMFS would be a 'virtual file system' in the sense that there are actually no files or directories in which data is stored. Instead, the content would be that of a topic map. As such a topic map could be mounted, i.e. linked into the overall file system and would be manipulated as such: reading, listing of topics would be done as if the topics were files in a directory. Copying, renaming and delete operations would actually be handled by a driver which modifies the underlying topic map.

A topic map path language would loosely fit into the file system paradigm, although navigation in topic map is closer to that through graphs than through trees, so we would see additional navigational means. Still, this can be handled transparently by the virtual file system such that commands like

cd /knowledge/music/opera/
would find an Internet map with a topic music. That could reify another map which covers the WWW with the opera being a topic therein. And that would also reify a map.

Listing there a subdirectory with the name puccini with the UNIX command ls puccini would reveal all characteristics and roles that topic plays in the map. The UNIX command more puccini/.xtm would actually return the contents of the topic puccini within the opera map as XTM text. As all topic characteristics and all associations will be part of such a file system all content can be access (and manipulated) by reading and writing particular files.

While useful by itself, it also opens the venue to overlay topic map content with conventional file content. This can be achieved with a 'stacked file system' which allows to put one file system onto another. In our case we would start with a conventional file system to store our documents and would overlay a TMFS to provide meta data for these documents.

TMWS (Topic Map Web Services)

Knowledge portals already offer topic map content via HTML-based web interfaces. Some already are adapted to expose content via a SOAP-based or a RESTful interface. In these cases the design of the URL space of such services has to follow the structure of the content. Again, we will see slashes '/' used to (logically) separate different levels of depth.

Assuming that a particular map can be retrieved via

http://www.example.com/music/opera/
then we could simply append TMQL path expressions to query that map:
http://www.example.com/music/opera//opera..
     ..[./rd(premiere-date)]
Done via a web interface we would see an HTML-rendered list of results. Done via a web service, the results would be packaged in XML before returning.

All this would not be so straightward had we used a query language style along the lines of SQL or XQuery.

So how can this affect TMQL?

Our TMQL clayman version we created in Amsterdam contains actually sublanguages: the SQL-ish SELECT flavour, the more general FLWR structure and bits and pieces of a path language.

I would see several levels at which a more consistent use of a path language can be useful:

  • I can prove that all TMQL statements which follow the SQL-style can be converted into FLWR structures, but - obviously - not the other way round. I can also prove that all FLWR-styled statements (generating list output) have an equivalent path expression using a similar path language as sketched above. Path expressions do not actually add functionality we do not already have.

    Given this situation, one might argue why TMQL should not officially support pure path expressions and may leave it up to the developer which style to follow in a particular situation. Developers then can profit from the extreme conciseness of a path language, if they wish to do so. Simply compare the slightly heavy-handed

    SELECT string:uppercase ($opera / bn)
    FROM $m
    WHERE
       is-instance-of ($opera : instance, opera: class)
    	      
    with the equivalent path expression
    ( string:uppercase ($m // opera / bn) )

    On other occasions path expression are less suitable, especially when the queries get more complex:

    $m1 // opera 
          < . , $m2 // opera > 
                [ .[0] -> opera \ is-composed / composer 
                  =
                  .[1] -> opera \ is-composed / composer ]
    This query takes two maps, stored in $m1 and $m2 and finds all corresponding operas which have the same composer. This is not immediately obvious. The equivalent FLWR expressions would be a bit more self-explanatory (YMMV):
    FOR $opera1 IN $m1 // opera
       FOR $opera2 IN $m2 // opera
       WHERE
          $opera1 <> $opera2
          AND
          is-composed ($composer: composer, $opera1: opera)
          AND
          is-composed ($composer: composer, $opera2: opera)
    RETURN
       ($opera1, $opera2)
    	      

  • To define TMQL's semantics we certainly can and will use prose. The principal problem with that is to get the wording so precise that the language specification is unambiguous where it has to be while keeping neutral on other issues which are completely implementation-dependent. Quite a few standards have struggled with this and it is not easy to get it right.

    While it may not completely avoidable we could strengthen the language specification using a more formal approach to map the language against a (formal) Topic Map model. Regardless what that is (I still favour a drastically minimized version of TMRM, more on this later), all language constructs of TMQL (SELECT, WHERE, RETURN, ...) will have to be defined in the light of the formal model.

    If - as mentioned above - it can be proven that all TMQL statements have an equivalent path expression then the path sublanguage can effectively function as a language core. Then only for this core the formal definition has to be made, everything else can be covered by transformations into the core language drastically simplifying the overall specification.

    Formalizing the language on this level also provides a means to directly express or indirectly defer potential optimization opportunities for implementers. So, for instance, we could derive from the base semantics that it does not matter in which order the []-conditions are evaluated if several of them are to be applied in sequence. Implementations then are free to decide which conditions to test first to reduce the number of involved objects to reduce the memory footprint or to speed up evaluation.

  • Path expressions also allow us to connect queries with constraints, as governed by the future TMCL. To substantiate this, consider one constraint which mandates that "Every professor must have at least one person which he supervises". Whatever TMCL's surface syntax might be, this would boil down to an equation involving path expressions.

    The way it can be expressed is to find all professors in a map and filter them by those who satisfy the above condition of supervising someone. If the filtered set exactly corresponds to the original of all professors, then the constraint is satisfied. Written with path expressions:

    $m // professor  [ . -> supervisor \ is-supervised-by ]
    =
    $m // professor
    Here we are using the exists semantics of the condition inside []: If only one association of type is-supervised-by is found where only one role supervisor is played by a professor, then the condition is true.

    The beauty of this is not only the potential bridge to TMCL and a common semantic model; equations like this can also - under yet to be researched conditions - be transformed into Description Logics (DL).

    As DL is all about connecting classes and their properties, we have to introduce these first. The obvious classes in our universe are PROFESSOR for the professors themselves. The obvious property is HAS-SUPERVISION-OVER.

    Now we identify these classes with things we find in our map. All things which are of class PROFESSOR are obviously captured by the path expression $m // professor:

    $m // professor  [ . -> supervisor \ is-supervised-by ] = $m // professor
    ^             ^                                           ^             ^
    +-------------+                                           +-------------+
    
       PROFESSOR                                                 PROFESSOR
    
                       ^--------------------------------^
                               HAS-SUPERVISION-OVER
    	      

    This can now be transformed into DL axioms and concept definitions. What the above says is that PROFESSOR is a subconcept of things which do supervision, i.e. have at least one property HAS-SUPERVISION-OVER with a non-empty value:

    PROFESSOR ⊑ EXISTS HAS-SUPERVISION-OVER . ⊤
    	      
    Nice, I would say.


prevUpnext