|
|
Path Language for Topic Maps: Full speed ahead?
Motivation
In the XML arena XPath is probably the most
effective and popular language. It is not a path language perfectly suitable for TMs
although still quite a few people work with them on the XTM level. XPath can be used to drill into
XML documents to extract information and/or XSLT is used to arrange the content in another
form (e.g. xsiteable, TM4J). Still, the rationale behind XTM is that of an
interchange format and not so much as a store from which information
can be effectively extracted. The - hopefully near - future will bring TMQL as a proper
means to retrieve content out of TM backends. In the recent Amsterdam meetings there was
agreement on adding path
expressions to the future TMQL.
Recently, when working on a draft of the TMQL specification I experimented with existing
path languages and realized that a purely path-oriented query language can have enough
expressiveness to solve all those use cases for TMQL which
return lists. It also does not mean that TMQL should be a path-oriented
language. Still it shows the power of the path paradigm which we may want to leverage.
In the following I have collected some thoughts on how to exploit the path paradigm for
TMQL. Here I will include aspects of the language itself, but also will sketch some more
general usage scenarios to demonstrate why having a powerful path component in TMQL would
be, uhm, cool.
A Short History of TM Path Languages
Let us have first a quick look at existing path languages, at least the ones I came across:
- XTMPath
XTMPath was one ad-hocish language I invented as part of the Perl XTM package as I grew tired
to use the low-level API to extract information from maps. The idea behind was that
the topic map in question would be assumed to be in XTM
notation. In reality it was of course fully deserialized in memory. Path statements
would behave similar to XPath in that they would attend to the - only virtually
existing - nodes. A query like
/topic[instanceOf/topicRef/@href = "#beer-brand"]
would so find a topic node within the current map which has the
property that the resource reference would point to the topic
beer-brand which is local to the map.
Unlike XPath, XTMPath would know about the underlying XML structure it is working on;
this has allowed quite a few optimizations
and also shortcuts. On the negative side, XTMPath does not scale well in terms
of complexity. To extract, for instance, all cities where we know that persons have
been born in, we would have to code:
/topic[instanceOf/topicRef/@href = "#city"]
[/association
[member
[roleSpec/topicRef/@href = "#city"]
[topicRef/@href = "."]]
[member
[roleSpec/topicRef/@href = "#who"]
[topicRef/@href
[instanceOf/topicRef/@href = "#person"]
]
]
]
This readily surpasses my personal threshold of pain.
- TMPath
TMPath by Dmitry Bogatchev is not based on an XMLish tree-like model but directly on a
(somewhat simplified) TM model. That consists of associations, topics and
characteristics thereof.
TMPath uses first types to select particular instances from a
map, such as all persons in topic[person]. From then on,
TM-specific navigation can be used to reach out to other parts of a map:
/topic[person]/roleOf[who]/
association[bornIn]/role[where]/topic[city]
In that case we would select first all persons, find in which associations they play
the role who, find then which of these are of type
bornIn and then which of them have a role
where. From this set of things we only look for those things which
are an instance of city.
While there is some syntactic overhead, the queries lend themselves quite nicely to a
natural flow of movement through a given map.
- AsTMaPath
AsTMaPath has a similar rationale as TMPath in that it also uses a conceptual TM
model. It was a late addition to AsTMa?, the query part in the
AsTMa* language family.
Like TMPath it can be used to find topic and association components; so, for instance,
the occurrence of type website for the topic identified by tmql tmql / oc (website)
But it can also start from a map and climb over associations:
some-map / * [ . -> instance \ is-a / class = 'person' ]
-> who \ bornIn / where
[ . -> instance \ is-a / class = 'city' ]
Of course, the parser/compiler would know about a lot of shorthand notations to make
life convenient for authors:
some-map // persons -> who \ bornIn / where // city
Other languages for TMs contain smaller doses of path expressions, such as Toma and also
the query language outlined in [Mück, Widhalm].
How could such a language for TMQL look like?
As a starting point we could paraphrase what existing languages already offer. If we assume
that we already hold a map in a variable $m, then a statement like
$m // opera
could give us a list of all things, i.e. topics which are - directly or indirectly -
instances of the concept opera. 'Direct or indirectly' means that we also
want to see things which are instances of subclasses of opera.
Topic characteristics also can be selected, quite similar as the existing path languages
allow to:
$m // opera / bn
returns all basenames of all operas. If we ask for the premiere date and assume that is
stored as a topic resource data occurrence of type premiere-date we could
write
$m // opera / rd (premiere-date)
It is important to realize that we always get returned a list of things, in this case it is
a list of all premiere date characteristics of all
operas in the map.
This maybe probably not what we want, so we look out only for particular operas:
$m // opera [ ./rd (premiere-date) < '1900' ]
/ rd (premiere-date)
First, we select all operas; the condition enclosed with [] then filters
out those where the premiere is before 1900. Only then we ask again for the premiere dates.
The expression within square brackets can be a comparison of two path expressions like
above, or it can also be a simple path expression which is equivalent to 'true' if it
evaluates to something non-empty:
$m // opera [ ./rd (premiere-date) ]
We can also navigate along associations, say, to find all composers of operas:
$m // opera -> piece \ composed-by / composer
After we have extracted all operas, we look where these play the role
piece in an association of type composed-by. Again we
will get a set of things, this time associations. For each of them we select the role
composer to receive the topics playing that role there.
Advanced features
In a more complex example let us consider that some operas might be composed not by an
individual alone, but by a team of composers. If we now would like to select all operas
where all composers are already dead, then we could write
$m // opera
[ not . -> piece \ composed-by / composer / rd (status)
=
'alive' ]
Finding all instances of opera, we then see that there is no situation in
the map to find an association of type composed-by where that opera plays
the role piece and that topic playing the role
composer in that very association has a resource data occurrence of type
status having the value alive.
As the [] has an 'exists' semantics, with the introduction of negation we
can support also an 'forall' semantics.
A TMQL path language will have to return more than one list of values. As an example, assume
that we would like to generate a table of operas together with
their composers:
$m // opera
< . / bn[0], . -> piece \ composed-by / composer / bn[0] >
The sub-expression enclosed by <> brackets (ramification) takes care
that for every opera we generate one or more tuples consisting each of two values: The first
one is the first basename (whatever that may be) of each particular opera and the second
value in the tuple is the first basename of a composer of that particular opera. As an
opera might have several of such composers, for each of them such a tuple is generated,
always using one and the same basename for the opera itself.
Also the inverse case - selecting particular value from such tuples - has to be covered. In
this projection we use the expression .[0] to project the first value
in every tuple and ignore all others:
$m // opera < ./bn , ./rd(premiere-date) > .[0]
In the end this is the same as if we would have written shorter
$m // opera / bn
in the first place.
Using ramification with a followup projection may appear as a mute exercise, but they are
the necessary means to emulate functionality which can be achieved using variables in the
pattern-matching oriented languages.
Additionally, the language will have to allow sorting according to arbitrary criteria. The
following query does exactly that, listing all operas alphabetically:
$m // opera order by ./bn@sort
More sophistication can not only be added for intra-map navigation; also on the map level
itself we can use expressions to build and manipulate map. As this is not so relevant for
this exposition, we will skip details here. The same applies for another open question,
namely whether path expressions should be equipped to generate non-list results, such as
XML instances or Topic Map instances.
Applications
A powerful path expression sublanguage will open up specific application scenarios where
(command) line-oriented control and configuration is natural or at least dominates the
mindset.
This, of course, includes most of UNIX/Linux related application management which makes
still heavy use of the path paradigm based on the tree-formed file system despite the recent
trend towards XML configuration files. But it also includes web applications where URLs
carry all information to control the underlying application without the need to send
additional information in the HTTP message body.
Let us look at some recent examples we here have done work on:
- TMFS (Topic Map File System)
All modern operating systems work with tree-structured file systems where files are
stored as binary blobs in directories. Not only are there many different such file
systems (with different properties regarding disk usage, speed, recovery speed,
etc.), such file systems might also reside on other machines or other media than
disks. Usually, the facilities to host meta data within a file system is rather
limited.
The TMFS would be a 'virtual file system' in the sense that there are actually no
files or directories in which data is stored. Instead, the content would be that of
a topic map. As such a topic map could be mounted, i.e. linked
into the overall file system and would be manipulated as such: reading, listing of
topics would be done as if the topics were files in a directory. Copying, renaming
and delete operations would actually be handled by a driver which modifies the
underlying topic map.
A topic map path language would loosely fit into the file system paradigm, although
navigation in topic map is closer to that through graphs than through trees, so we
would see additional navigational means. Still, this can be handled transparently by
the virtual file system such that commands like
cd /knowledge/music/opera/
would find an Internet map with a topic music. That could reify
another map which covers the WWW with the opera being a topic
therein. And that would also reify a map.
Listing there a subdirectory with the name puccini with the UNIX
command ls puccini would reveal all characteristics and roles
that topic plays in the map. The UNIX command more puccini/.xtm
would actually return the contents of the topic puccini within
the opera map as XTM text. As all topic characteristics and all associations will be
part of such a file system all content can be access (and manipulated) by reading
and writing particular files.
While useful by itself, it also opens the venue to overlay
topic map content with conventional file content. This can be achieved with a
'stacked file system' which allows to put one file system onto
another. In our case we would start with a conventional file system to
store our documents and would overlay a TMFS to provide meta
data for these documents.
- TMWS (Topic Map Web Services)
Knowledge portals already offer topic map content via HTML-based web
interfaces. Some already are adapted to expose content via a SOAP-based or a RESTful
interface. In these cases the design of the URL space of such services has to follow
the structure of the content. Again, we will see slashes '/' used to (logically)
separate different levels of depth.
Assuming that a particular map can be retrieved via
http://www.example.com/music/opera/
then we could simply append TMQL path expressions to query that map:
http://www.example.com/music/opera//opera..
..[./rd(premiere-date)]
Done via a web interface we would see an HTML-rendered list of results. Done via a
web service, the results would be packaged in XML before returning.
All this would not be so straightward had we used a query language style along the lines of
SQL or XQuery.
So how can this affect TMQL?
Our TMQL clayman version we created in Amsterdam contains actually sublanguages: the SQL-ish
SELECT flavour, the more general FLWR structure and bits and pieces of a path language.
I would see several levels at which a more consistent use of a path language can be useful:
I can prove that all TMQL statements which follow the SQL-style can be converted into
FLWR structures, but - obviously - not the other way round. I can also prove that all
FLWR-styled statements (generating list output) have an equivalent path expression
using a similar path language as sketched above. Path expressions do not actually add
functionality we do not already have.
Given this situation, one might argue why TMQL should not officially support pure path
expressions and may leave it up to the developer which style to follow in a particular
situation. Developers then can profit from the extreme conciseness of a path
language, if they wish to do so. Simply compare the slightly heavy-handed
SELECT string:uppercase ($opera / bn)
FROM $m
WHERE
is-instance-of ($opera : instance, opera: class)
with the equivalent path expression
( string:uppercase ($m // opera / bn) )
On other occasions path expression are less suitable, especially when the queries get
more complex:
$m1 // opera
< . , $m2 // opera >
[ .[0] -> opera \ is-composed / composer
=
.[1] -> opera \ is-composed / composer ]
This query takes two maps, stored in $m1 and $m2 and finds all corresponding operas
which have the same composer. This is not immediately obvious. The equivalent
FLWR expressions would be a bit more self-explanatory (YMMV):
FOR $opera1 IN $m1 // opera
FOR $opera2 IN $m2 // opera
WHERE
$opera1 <> $opera2
AND
is-composed ($composer: composer, $opera1: opera)
AND
is-composed ($composer: composer, $opera2: opera)
RETURN
($opera1, $opera2)
To define TMQL's semantics we certainly can and will use prose. The principal problem
with that is to get the wording so precise that the language specification is
unambiguous where it has to be while keeping neutral on other issues which are
completely implementation-dependent. Quite a few standards have struggled with this
and it is not easy to get it right.
While it may not completely avoidable we could strengthen the language specification
using a more formal approach to map the language against a (formal) Topic Map
model. Regardless what that is (I still favour a drastically minimized version of
TMRM, more on this later), all language constructs of TMQL (SELECT, WHERE, RETURN,
...) will have to be defined in the light of the formal model.
If - as mentioned above - it can be proven that all TMQL statements have an equivalent
path expression then the path sublanguage can effectively function as a language
core. Then only for this core the formal definition has to be made, everything else
can be covered by transformations into the core language drastically simplifying the
overall specification.
Formalizing the language on this level also provides a means to directly express or
indirectly defer potential optimization opportunities for implementers. So, for
instance, we could derive from the base semantics that it does not matter in which
order the []-conditions are evaluated if several of them are to be applied in
sequence. Implementations then are free to decide which conditions to test first to
reduce the number of involved objects to reduce the memory footprint or to speed up
evaluation.
Path expressions also allow us to connect queries with constraints, as governed by the
future TMCL. To substantiate this, consider one constraint which mandates that "Every
professor must have at least one person which he supervises". Whatever TMCL's surface
syntax might be, this would boil down to an equation involving path expressions.
The way it can be expressed is to find all professors in a map and filter them by
those who satisfy the above condition of supervising someone. If the filtered set
exactly corresponds to the original of all professors, then the constraint is
satisfied. Written with path expressions:
$m // professor [ . -> supervisor \ is-supervised-by ]
=
$m // professor
Here we are using the exists semantics of the condition inside
[]: If only one association of type
is-supervised-by is found where only one role
supervisor is played by a professor, then the condition is true.
The beauty of this is not only the potential bridge to TMCL and a common semantic
model; equations like this can also - under yet to be researched conditions - be
transformed into Description Logics (DL).
As DL is all about connecting classes and their properties, we have to introduce these
first. The obvious classes in our universe are PROFESSOR for the
professors themselves. The obvious property is
HAS-SUPERVISION-OVER.
Now we identify these classes with things we find in our map. All things which are of
class PROFESSOR are obviously captured by the path expression
$m // professor:
$m // professor [ . -> supervisor \ is-supervised-by ] = $m // professor
^ ^ ^ ^
+-------------+ +-------------+
PROFESSOR PROFESSOR
^--------------------------------^
HAS-SUPERVISION-OVER
This can now be transformed into DL axioms and concept definitions. What the above
says is that PROFESSOR is a subconcept of things which do
supervision, i.e. have at least one property HAS-SUPERVISION-OVER
with a non-empty value:
PROFESSOR ⊑ EXISTS HAS-SUPERVISION-OVER . ⊤
Nice, I would say.
|