![]() |
![]() |
![]() |
||||||||||
| TMQL Tutorial (Preview) < < Home | ||||||||||||
|
TMQL Tutorial (Preview)
TMQL Tutorial (Preview)
IntroductionThe need for a dedicated query language for Topic Maps is not as uncontested as one might think. Serious work on TMQL has started roughly a year ago, kicked off by the ISO/IEC SC34/WG3 committee decision to create an industrial-grade language with a low entrance barrier. While the editors are still trying to fix odds and ends in the current TMQL specification, the main concepts are sufficiently stable now to justify asking the public for feedback. In this, please note that for the purpose of presentation here we have sometimes chosen a particular syntax some of which will appeal to a particular audience while others may find it at least eyebrow raising. As the specification is in flux (I so much like euphemistic terms), what follows may well be out of sync with the current TMQL draft. Still, it tries to draw a realistic picture how the language may appear to a developer. The following TMQL preview builds on an existing presentation, but it breaks somewhat with the TM tradition to use 'operas' as running use case. Instead, we will use the more general theme 'music', and will assume that we have a TM data store which contains various albums, (female or male) musicians and music groups (which are all artists and have persons as members). All these topics are connected via various associations, such as is-produced-by or is-part-of. Setting OffIf you have used SQL before, then you will not be completely puzzled by the following query: SELECT $album WHERE is-produced-by ($album: production, tom-waits: producer) This query (technically a query expression) will return all albums where Tom Waits is known to be a producer. tom-waits is an identifier of a topic which we happen to know to uniquely pinpoint a topic about that person in the map we query. The query processor will try to find an association of type is-produced-by and will check whether the topic tom-waits is playing the role producer there. If so, it will bind the variable $album to the topic playing the role production in that same association. It will so work through the whole map and will collect all these variable bindings. Finally, the query processor will return a list of these variable bindings. We leave it open for the moment, what exactly is returned to the application, so whether that gets topic identifiers, names of the topics as strings or actually data structures representing the topic. If we wanted to make the query more watertight to return albums only (and not something else produced), then we will have to add another constraint to the WHERE clause: SELECT $album WHERE is-produced-by ($album: production, tom-waits: producer), $album : albumThe special binary predicate : checks now additionally whether the thing we have bound to $album is an instance of the class album, at least according to the map we query. An alternative syntax might be $album is-a album to make this more readable. It is worth noting, that is-a honors the (transitive) subclass-superclass relationship. If a particular production were an instance of collector-box and that (directly or indirectly) is a subclass of album such productions would also be returned. If we are not fixated on Tom Waits and would instead want a list of all albums together with their producers, we can extend the wishlist in the SELECT clause: SELECT $album, $producer WHERE is-produced-by ($album: production, $producer: producer), $album : albumAgain the processor would walk through the whole map, will find all associations of the given type and will bind the playing topics to their respective variables. One particular binding now consists of a pair (tuple of two components); all these pairs are collected and are returned in a list. Controlling What is ReturnedAll queries so far return topics into the application in the form of data structures. If the application were interested in a name of such a topic, it would have to navigate using an API which is outside the scope of TMQL. If it is clear that the application needs only the name (and not the whole topic), we can tweak the SELECT expression again by adding a path expression: SELECT $album / bn WHERE $album : albumThe processor will now do the navigation to the topic names for us, using bn to find the basenames for the thing bound to $album. Unfortuately, this does not cut it yet because of two things: First, a topic can have any number of names, so we actually would get a whole list of those for each such individual topic. And, secondly, also these names would be returned as data structure and not automatically as the string holding the name. To fix the first problem, we could choose to only accept names in a particular scope. This is achieved by appending a filter to the path expression: SELECT $album / bn [ @ en ] WHERE $album : albumTo force the processor to stringify the name and to return so strings, we add a backtick: SELECT $album / bn [ @ en ] ` WHERE $album : album Path expressions are also a convenient way to impose a sorting order on the list of tuples we return: SELECT $album, $producer WHERE is-produced-by ($album: production, $producer: producer) ORDER BY $album / bn [ @ en ] `That way we get albums and their producer, but the whole list becomes a sequence sorted according to the english album title. The ordering can also include more than one ordering criterion, like in SELECT $album, $producer WHERE is-produced-by ($album: production, $producer: producer) ORDER BY $producer / bn [ @ en ] ` desc, $album `Here we first sort the list of topic pairs according to the name of the producer. For demonstration only we choose descending ordering. More importantly though, for one specific producer name (in the english scope) we sort the sublist containing different albums according to the album's identifier. This may not by itself overly useful, but at least it takes care that the whole returned list always appears in the same order if we keep repeating the same query. As you would expect, it is also possible to make the list of tuples unique, i.e. to take care that no two tuples are the same: SELECT $album ORDER BY $album / bn ` UNIQUEAlso, and this is especially meant for networking (web) applications, it is also possible to limit the number of results by asking only for a slice. How this is done exactly is not yet certain, maybe it looks like this: SELECT $album ORDER BY $album / bn ` OFFSET 10 LIMIT 20 Identifying ThingsYou may correctly argue that identifying topics with their (internal) map identifier (TMDM calls this for some reason 'source locators') is not an immensely robust idea if that identifier may change any second. In the likely case that there is a subject indicator (a resource which helps to indirectly identify a subject) for a subject, no one can stop you to use that instead. You only have to tell the processor that you are doing so: SELECT $album WHERE is-produced-by ($album: production, s'http://www.u2.com/ : producer)and that the provided web site (and identifier prefixed with s') indicates the subject. This, of course, only works as long as this is also present in your data. If you are a lucky owner of a subject locator (so a URI which refers exactly to the subject in question), then you can use that instead. The album 'How to Dismantle an Atomic Bomb' from U2, for instance, has the ASIN code B0006399FS: SELECT $producer WHERE is-produced-by (i'urn:x-ASIN:B0006399FS : production, $producer: producer)Again, your data has to play along. Controlling Variable BindingsAs we have seen above, variables can be bound to values. Used naively, this can lead to incorrect queries. As an example let us find all two albums which share the same producer. In a first attempt we write: SELECT $album1, $album2 WHERE is-produced-by ($album1: production, $producer: producer), is-produced-by ($album2: production, $producer: producer)If you have worked with declarative languages before you might immediately spot the problem: For the TMQL processor $album1 and $album2 are completely different variables; the variables might be bound to the same or to different value, the processor does not care. This does not work for us if we want different albums. The usual escape hatch is to have something like this: SELECT $album1, $album2 WHERE is-produced-by ($album1: production, $producer: producer), is-produced-by ($album2: production, $producer: producer), $album1 != $album2Not only is this ugly as hell, in 100% - ε of all cases developers will forget to add this (I will). And it does not look at all too good if you have to compare three or more such things. TMQL has a peculiar way to fine-control when variables are allowed to match anything or when they must be bound to something different: SELECT $album, $album' WHERE is-produced-by ($album : production, $producer: producer), is-produced-by ($album': production, $producer: producer)Now we have used two variables which only differ in their name by the number of primes (') appended. TMQL treats them as two distinct variables, but with the additional semantics that - within one and the same binding - they cannot be bound to the same value. Association TemplatesIn the queries above we made use of association templates. Writing inside a query is-produced-by ($album : production, $producer: producer)makes the processor try to find matching associations in the queried map. Such associations must be of type is-produced-by and must have exactly two roles, one for production and one for producer. If an association in the map has a third role, say, location, to capture where an album has been produced, then such association would never match the template. To allow for such associations with additional roles to match, TMQL allows to append an ellipsis: SELECT $album WHERE is-produced-by ($album : production, $whoever: producer, ...) Association templates also have more implicit meaning than is obvious at first. If, for example, the map contained an association of type is-remastered-by which also connects an album with a producer and is-remastered-by is a subtype of is-produced-by, then also such associations would match the template. Honoring subclassing also applies to roles and their types. Had we in our queried map an association of type is-remastered-by, but the role (type) for the album is not production, but a subclass remastering, such association would also match the association template. If you would not care about the role type, you can also omit it for some players: SELECT $album WHERE is-produced-by ($album : production, $whoever, ...)Of course, this may be walking on thin ice in some situations (or may make processors slower as they have fewer things to grasp on). Query ContextYou may have wondered how the query processor knows which map to query. This can be defined in the query context which allows applications to pass all sorts of information into a query. If a map is passed in, it will be - by default - bound to a variable %_. So in all cases we actually could have written more explicit SELECT $album FROM %_ WHERE ... Using parameters in the query context makes queries more flexible and portable, but sometimes one wants to name the map explicitely. The expression SELECT $album FROM http://example.org/map.xtm WHERE ...does exactly this and tells the query process to refer to the named map. Another feature which may make it into the final standard is to define a list of scoping topics as part of the query context. As an example, the application may choose to pass in [ en, fr, de ] as such a list. Then the query SELECT $album / bn ` WHERE $album : albumis evaluated as before, finding all albums and returning the names as strings. But since the scoping list is non-empty, the processor will use this list to find first a basename in the scope en, if there is none, a name scoped fr followed by one in the scope de. If this was all unsuccessful, then the unconstrained scope is checked, if also this fails, then NULL is returned instead. Path ExpressionsThe textual overhead of the SQLish style which we have used so far may not be convenient if queries are trivial. Especially for web applications where pages have to be filled with lots of content from a TM backend a much shorter notation is more adequate. To return all albums from the map bound to %_, we can simply write %_ // albumIf we need the english names only, then %_ // album / bn [ @ en ]will do it. You can also add constraints, such as that we only want albums which have an english name: %_ // album [ . / bn [ @ en ] ]The processor will first extract all instances of album. For each of them (this is signalled with the dot .), it will test whether the condition inside the bracket pair [] can be satisfied. Accordingly, it will take one album, will find all basenames and will - in turn - try to satisfy the condition [ @ en ] for each of those. That is true if the name is in the scope en. If none of the names is scoped in en, then the sub-expression . bn [ @ en ]/ will return no result, which translates into FALSE (/exists semantics/). Such conditions can be arbitrarily complex, so that it is no problem to reformulate our earlier query which returns only the english names of Tom Waits albums: %_ // album [ . -> production [ * is-produced-by ] / producer = tom-waits ] / bn [ @ en ]The processor will again start off with all albums and will subject each of them to the test
. -> production [ * is-produced-by ] / producer = tom-waits
Like above, the current album is represented by the dot. It will be used as a navigation
starting point. First the processor will try to find all associations where that album plays
the role producer. This list of associations is then filtered down to only those which are
instances of the concept is-produced-by. Also this is done with a path expression predicate.
Once the associations have been found, all of them are used as starting point to navigate
along the role producer outwards. The overall result of this is a list of topics which are
the producers of a particular album.
Trivializing the process somewhat, this list is now compared to the single topic which is identified by tom-waits. If that topic is part of the list we got just before, then the condition is satisfied and the album is kept as candidate. If not, the next album is tried. The last stage of the path expression finds again the english names for each of the remaining albums. Chocolate, Vanilla, CaramelThe different language flavours, SQLish and path expressions, can - as we have already seen - be mixed. Not so obvious is the fact that both styles are (almost?) equivalent in terms of expressitivity; every SELECT query expression can be transformed into an equivalent path expression. It is up to the developer to choose the most appropriate combination on a case by case basis. Both styles allow to return sequences of tuples of things into the application. This may be exactly along your way of thinking in most cases, but there are several application scenarios where you need more complex structures to be returned into the application. One of them is if you need to integrate a TM backend into one of these shiny XML application servers (such as Tamino, Cocoon or AxKit), or if you - more generally - write taglibs to be used in SAX processors or as XSLT extensions. Here you need a query to return XML content. While one might argue, that SELECT query expressions together with some templating system to generate the XML content can achieve exactly this, such an approach has severe downsides. Firstly, most templating systems are text oriented, so the XML stream would have to be deserialized (parsed) before being post-processed by an application. This is not fast. But more importantly, in the case of nested XML documents - and nesting may be the whole point of XML anyway - an the template processor would have to iterate and reissue many queries using the TMQL processor. This results in intense interaction and this is not fast either. With TMQL you can create XML content directly without fuss using a third flavour, which - you may have guessed it - is otherwise equivalent to the other styles. This flavour, FLWR, is inspired by XQuery and uses RETURN clauses to specify the output:
return
<albums>{
for $a in %_ // album return
<album>{$a / bn [@ en ]}</album>
}
</albums>
The return value consists of one XML 'document' with a root element <albums>. Nested into
that are all albums in the map. The way this is achieved is by iterating over them in a
'for' loop. It uses a path expression %_ // album to compute first all instances of albums
in the contextual map. Each such album is bound to the (iteration) variable $a and with that
new binding, the body of the loop is evaluated.
Such body is defined by a nested RETURN clause. It contains an element <album> which is supposed to contain text content. That can be constant, or - like in our example - can be specified using a TMQL path expression. To signal that, you will have to wrap this inside {} brackets. TMQL query expressions following the FLWR structure can also return lists as the following example demonstrates:
for $album in %_ // album
return
($album / bn [ @ en ]`, $album / oc [ * homepage ]`)
For each album in the map a pair of strings is returned: The first component contains the
english name of the album (if that exists) and the second component the occurrence of type
homepage.
Often FLWR expressions are easier to read than SQL-style expressions, because you can read them from the top to the end (and not the other way round as in SQL). To prove the point that the styles are mostly interchangeable, we reformulate the query to find all album pairs sharing one and the same producer.
for $album in %_ // album
for $album' in %_ // album
where
exists
is-produced-by ($album : production, $producer: producer)
and
is-produced-by ($album': production, $producer: producer)
return
($album, $album')As a side note: You may have noticed, that the syntax of the WHERE clause inside a SELECT expression is somewhat different from that inside FLWR expressions. There is no technical reason for that, it is simply that we (the editors) did not yet agree whether to use a more orthodox style (using AND, OR, ...) or the style which swapped over from tolog (using comma and |, ...). This will be resolved hopefully soon; two syntaxes for the same thing is not good. No. The structure of FLWR expressions also makes it possible to return something else: topic maps. Just like SQL takes tables and returns a table as result, and XQuery takes XML documents and returns XML content, also TMQL can take one or more topic maps and can return TM content. At this time of writing, we have no clear understanding yet which notation to use for this. While existing ones (XTM, LTM, AsTMa=) might be obvious candidates a newly developed one might integrate more smoothly. But this is another story... Nested QueriesNesting of query expressions is quite natural and could potentially be used in quite a few places. To demonstrate how much the language can be stretched, here an experimental (so not yet agreed on) syntax to use nesting within a SELECT clause. To lists all groups and their members in the map, the SELECT clause may contain a whole subquery:
SELECT $group /bn `,
" members are: " + join (",",
{
SELECT $person / bn `
WHERE
is-part-of ($person: member, $group: whole)
})
WHERE $group : group
The first column in the returned table is always a string containing the group name(s). The
second column is also a string, but is it computed by first evaluating the nested query
expression. That will return a list of person names. That list is then concatenated via the
predeclared function join, whereby the entries are separated by the
specified string ','. The result will then by concatened (an infix
+ operator between strings means concatenation) with the "members are: "
text.
You can also use subqueries inside strings. The following example would return a single string where all names of all albums are concatenated:
return
"{
SELECT $album / bn
WHERE $album : album
}"
Note, that we omitted the stringification postfix `, because that is mandated by the
context. If we want the processor to embed a basename into a string, then it has to be
converted in a text representation anyway.
Using Exist and All QuantificationOn some occasions you will have to test whether particular things exists in a map or whether certain things have a relevant property. For illustration, let us ask for all music groups in our map which have at least one female group member
for $group in %_ // group
where
some $person in $group -> whole / member
satisfy $person : female
return
($group)
While we iterate over all groups in the map, we find for each such group all members using the
path expression $group -> whole / member. If only one satisfies the condition that it is an
instance of female then the existential SOME clause is statisfied.
Conversely, we might be interested to find all boy groups, well, at least those groups where all members are male:
for $group in %_ // group
where
every $person in $group -> whole / member
satisfies $person : male
return
($group)FunctionsNot too surprisingly, the language also sports functions. They can be declared anywhere before a query expression and can then be invoked inside that:
declare function get_female_members ($g) return {
for $m in $g / member
where
$m : female
return ($m)
}
SELECT $group / bn, count(get_female_members ($g))
WHERE
$group : group
In that, we have defined a function get_female_members which computes a list of all female
members of a particular group. In the SELECT expression we iterate through all groups and
return a table with the group's name and then number of females in them. count is just
another function which takes a list and computes, well, the number of entries there. That one,
of course, is predeclared in TMQL as are many others.
Adhoc OntologiesOne of the more controversial issues is the adoption of features which would actually be expected to be in a constraint language. Hereby, we can give the query processor more information about the application domain, either by enriching the vocabulary with further topics, add new associations, or introduce associations, based on those already existing in the map. As example let us consider that our map contains associations of type has-created where we have recorded which artist (single or group) has created which album, such as "U2 as creator has created an opus 'Rattle and Hum'", or "'Leonard Cohen as creator has created an opus 'Various Positions'". It lies in the nature of associations has-created that if a group has created something, then we might also deduce that every member of the group also has created that thing. Naturally, we do not want to change our map and add all this redundant information to it. Instead we would like add a rule to the purpose of "if an individual belongs to a group and the group has created something, so has the individual".
using MYRULEZ [tolog] for {
has-created-indirect ($group, $opus) :- {
has-created ($group: creator, $opus: opus)
|
is-part-of ($person: member, $group: whole),
has-created ($group: creator, $opus: opus)
}
}
SELECT $creator
WHERE
MYRULEZ:has-created-indirect ($creator, $whatever)What happens here is that - before the actual query - we introduced an adhoc ontology and gave it a name MYRULEZ. That name can be used as prefix from then on. In this ontological part we added two rules, following the style introduced in tolog. Without going into any details these rules introduce a new association type has-created-indirect which is derived from knowledge existing in our map. The second rule, for example, covers the situation that an individual belonging to a group is also a creator of the thing in question, albeit only indirectly. The obvious benefit of such rules is to keep redundancy in maps itself low, while not burdening the query itself; it would have been possible to hard-code it there
SELECT $creator
WHERE
has-created ($creator: creator, $opus: opus)
|
(
is-part-of ($creator: member, $group: whole),
has-created ($group: creator, $opus: opus)
)On open issue is whether TMQL should adopt the tolog-ish way of thinking as the only way to express such additional domain knowledge or whether TMQL is kept promiscuous in the sense that other constraint languages (the upcoming TMCL, or OWL, for example) could potentially be used. The advantage then would be that the inferencing involved can be adapted to the problem at hand: inferencing with tolog is certainly of a higher computational complexity than that associated with OWL. On the other hand, comitting to a predicate structure like the above may allow a tighter integration in TMQL implementations. Another venue open to us is to factor out the ontological information completely into a separate document:
using wine [owl] for http://www.w3.org/2001/sw/WebOnt/guide-src/wine.rdf
SELECT $winery
WHERE
$winery is-a wine:Winery
This mechanism has some appeal, especially as it also allows to import functionality written
in programming languages. Here an example using Python:
using sales [python] for urn:x-in-house-database:sales-numbers
SELECT $artist, sales:total ($artist / bn `, 'uk', '2005')
WHERE
$artist is-a artist
As long as the processor understands how to call Python functions and also understands for
which library (package, class) the URI urn:x-in-house-database:sales-numbers stands for, it
will - when the query is actually executed - invoke the function total in that library.
ImplementorsSome potential implementors may have had a glimpse at the current TMQL draft. With its about 90 syntax rules and 33 pages (inclusive issues and editor notes) it about the same complexity as, say SPARQL. Once the ISO committee has made some decisions on various syntax issues we may also reduce these numbers somewhat. What is not so obvious at first sight is that the different language flavours do not add any computational complexity. Both, SELECT and FLWR expressions can be mapped into path expressions, so that most of this syntactical variations are swallowed by the parser anyway. This is also true for the majority of shorthand notations introduced; while the numbers may drop over time, each of them can be dealt with a single line of code. The only exception to above rule is the processing for FLWR RETURN clauses, but that one is fairly trivial. While path expressions might be an obvious implementation target, you may also want to map everything into FLWR expressions. Maybe just because you happen to have this shiny new XML database which understands XQuery and you decide to map TMQL onto XQuery. FeedbackDesigning a language for oneself is not always easy, even in cases where you completely agree with yourself. Doing this is a group or as part of a committee does not make things necessary easier, although discussions with other people help a lot. Creating a new language for a large group of developers with wildly varying backgrounds involves a lot of second guessing. So if you would like to avoid that a lack of feedback is misinterpreted as utter, unconditional bliss and happiness, then please let us know what you think. I promise that there will be no involvement of the Russian mafia. |
|||||||||||