![]() |
![]() |
![]() |
||||||||||
| SQL as TM Query Language? No, thanks! < < Home | ||||||||||||
|
SQL as TM Query Language? No, thanks!
SQL as TM Query Language? No, thanks!
IntroductionWhile this might be a strange question now, especially since the TMQL[TMQL] specification is around the corner, it is a question which has emerged a number of times in the past. Obviously, the partial answers which have been provided in various mailing list conversations were lost or were not convincing enough. This article tries to remedies this, exhaustively, and for all times. Urbi et orbi. And I even try to say something intelligent at the end. Probably the most sophisticated (and recent?) proposal was that from Kal Ahmed and Graham Moore[TMRQL]. Their approach is to define a set of database tables (actually 20 views together with 6 functions). Given that setup, a topic map stored in these tables can then be queried using standard SQL. To substantiate the fitness of their ideas, they also solved most of the TMQL use cases[TMQLuc], omitting only those which would have to generate XML content or TM content. While one might think that this looks like a "If you have a hammer, then everything looks like a nail." approach, it has a certain appeal: It off-loads the real heavy-lifting onto an SQL engine. SQL is well understood, by both, engine developers and application developers who are users of the language. Creating and operating SQL databases is in most university curricula and coursework certifications. Let us look into some of the arguments in more detail. SQL, a Good Old Friend?More complex concepts like nested queries, aggregating, stored procedures and views, though, are not so pervasive in the industry as one might want to think. There are many casual application developers out there (me included), especially those doing web applications in VB or PHP. And for those, nested SQL queries, cursors and views and even transactions are unfamiliar ground, because they simply do not need them. SQL is also not so standard as one might want. While simple SELECT, UPDATE, INSERT statements are quite similar in their effects (small subtle differences remain), the portability fun begins when things get a bit more serious (anyone here who ported a MySQL application to Oracle recently?). But my critique of the approach goes much deeper. Epicycles in the 21st Century
The fact that I can store a topic map in a tabular model and that - consequently - I can
construe a query covering a particular use case, proves just that. The same procedure can be
repeated by storing XML content into a relational database (which many actually do) and query
that. It may also be possible (no, I did not really try that) to store a topic map in a text
file and use UNIX tools like All this proves that - given the necessary amount of tables - every data structure can be theoretically mapped onto a tabular model. And - given a finite set of use cases - that for each of them an SQL equivalent can be constructed. Slightly digressing, Ptolemeus, a second century A.D. astronomer came up once with the clever idea to explain seemingly strange planetary movements (they make loops when seen from earth). He used cycles moving on cycles [epicycles]. It is a good example of not having the right choice of an orthogonal system. Using many, many tables sounds very much like this. It may be partly blamed on the TMQL use case selection in the first place, that it did not overly stress the semantic aspect, i.e. those queries which involve this little bit of reasoning which includes instance-of and (transitive) subclass relationships to be used. The queries were also meant to exemplary; real-life queries are usually much more convoluted. And the use case document only implicitely covers application scenarios which include the formulation of (adhoc ontological) rules. AdequatenessLet us look at some use cases. With all your SQL knowledge, can you guess what this TMRQL query does? (reformatted for better reading): SELECT name_value FROM tm_assoc JOIN tm_nameValue on r1p=topic_id WHERE r1t=@author ORDER BY dbo.tm_sortName(r1p)OK, without any knowledge of TMQL (please ignore details in the syntax, this is all in flux), can you guess what this does?
SELECT $a / bn
WHERE
$a is-a author
ORDER BY $a / bn @ sort
It extracts the basenames of all things which are instances of 'author' and sorts this list
according to the name in the scope 'sort'. It also works for things which are indirectly
authors, such as 'science-fiction authors' and 'poets' which all subclass the concept
'author'. I doubt the TMRQL version honors this.
Another one: To find all documents in a map, sorted by publication date (ascending), only number 3 to 5, one would write in TMQL something like this: for $d in %_ // document order by $d / rd [* publication-date] return (position(), $d / bn`, $d / rd [* publication-date] `) [3..5]or - if you are one of these beardy Perl-hackers who want to do everything in a single line - you may write: %_ // document < $#, ./bn`, ./rd [*publication-date]` > order by @_[2] asc [3..5] The TMRQL solution for this is:
SELECT dbo.tm_displayName(topic_id), occur_value
FROM tm_occurValue
WHERE
occur_type = @pubdate AND
topic_id IN
( SELECT topic_id
FROM tm_topic WHERE (type_id=@document)
OR (type_id IN
( SELECT to_id
FROM dbo.tm_tc (@document, @subclasses, @superclass, @subclass)
)
)
)
I would not bother that it is much longer than the TMQL solution. But there are numerous
references to external variables and this query has three (!) levels of selects.
There are also impressive examples like the one which uses an SQL cursor. This is necessary because for the transitive property like 'is-influenced-by' in the relational model a temporary table has to be filled first with all the relevant entries. Only then a SELECT statement can use this knowledge. In a semantic environment transitivity, reflexitivity and other properties of associations are inherent. What the example shows is that the relational database does not natively understand this. TMQL does:
using adhoc[tolog] for
is-influenced-transitive ($A : influence, $C : influenced) :-
is-influenced-by ($A: influence, $B : influenced),
is-influenced-transitive ($B: influence, $C : influenced)
SELECT $influenced, "influenced by" , $influence
WHERE
is-author-of ($influencer: author, $influence: opus),
adhoc:is-influenced-transitive ($influence: influence, $influenced: influenced)
And a TMQL implementation might even choose to be 'lazy', i.e. return results as they are
computed. Having to fill a temporary table for TMRQL first would not allow such a thing.
I think, what this shown is that developers have to think in 'tabular terms' whereas the structure they want to query is a graph-like one. Generating Different OutputsOne of the design principles of TMQL is that it should be possible to not only generate tabular results, but also XML structured content and topic map content. The reasoning behind this is that XML application servers are here to stay for a while and that embedding TM search results as XML is very, very, very handy. That a TMQL expression can also generate TM content is a natural consequence of the formalism behind TMQL. So, if it is already there, why not expose it. If you need it (it is cool to generate maps from maps when you have to mediate between different ontologies) it is there, otherwise, simply ignore it. I do not buy the argument, that we should wait and see what bigger database vendors will give us developers in terms of XML integration. Especially not after having seen first attempts. ExtensibilityI am not precisely sure, how - in the TMRQL approach - functional extensions are handled. Semantically speaking, functions, but also predicates, more terminology, other rules, etc. are all part of an application domain. Ideally, one would use an ontology language for it, depending on the expressitivity needs. Probably one would have to write more SQL functions and add other tables. Either using SQL stored procedures or - given the, uhm, elegance of stored procedures, these functions are then written as part of the application.
What I can see in TMRQL is that one table - And Is it fast?One of the open areas is the issue of speed. SQL databases have matured over 25 years now, so the expectations are high. While I think that the jury will be out for some years, here a couple of thoughts. SQL databases are generic. They are not aware what data structure they are carrying, this is defined declaratively by the database schema. An SQL-based solution like TMRQL will have to fine-tune the generic engine by adding indices and/or caching techniques. What makes me sceptical that this works is that access and query patterns for semantic web applications are not along tables for which most of the optimization mechanics of relational databases has been developed. They are along semantic axes (whatever that is), so not only along relationships such as instance-of and subclass-superclass but also application-specific ones. I seriously wonder how a relational database will react on a high number of self-joins there. We also expect that TMQL queries can be sped up considerably if the TMQL processor knows more about the structure of the content. That is then maybe encoded in a TM constraint language or may be even hard-coded in a TMQL processor. It remains unclear, yet, how this can be converted into background knowledge to be usable for TMRQL. Formal UnderpinningA formal foundation can be helpful in many ways, one of them query plan optimization. Relational databases are said to have a clear formal foundation and most people would point to papers about relational algebra. Unfortunately, SQL - driven by industry needs - has long left this boat having introduced object-orientation (still to a limited degree, which keeps an impedance mismatch with OO programming), functions, triggers and much more other fluff. I am not aware of a standardized formal model which covers this all. In fact, this is something which has been worked on for a while for TMQL and my expectation is that the TMQL semantics is - via several steps - not only defined in prose, but also formally based on abstract algebra. As user you may not care, as engine developer, you might appreciate it. Politics, Politics, PoliticsAn often heard argument is "How many non-relational implementations are really of a commercial scale?". This argument is about 15 years old and has proven to be an excellent innovation stopper. Remember object-oriented database in the mid-80ties? This industry has been killed off successfully and now all Java developers use an object-oriented language and talk to a, well, objectish database. At best. Many still use pure SELECTs and somehow manually map the results into objects. To most advanced students the idea to make objects directly persistent in the database is absolutely alien. Not cool. So what about XML databases? Are they also only a temporary disturbance in the mainstream corporate SQL-centric universe? And should all full-text databases be imported into SQL stores? Probably not. There is also a techno-political aspect to using SQL as the technological foundation for a TMQL. SQL is a moving target. It is not moving fast, but it is moving and when and how it does that is beyond the control of the TM community. And there is also another, more serious angle to that. Forcing TMQL to be implemented on top of SQL engines only favours those database vendors which have implemented the necessary SQL machinery. And you can cut off a lot of fingers from your hands and still have more than this number of companies. And even if we choose to do so, how many different implementations of TMRQL could then be actually done? One? What should I do if I have stored my topic maps in a Tamino XML database? Export it to an MS-SQL server first? Brrr :-). Similar arguments also apply to the 'tool support' argument. Yes, there is tool support for SQL, and - cynically speaking - probably one needs that. I am increasingly sceptic when it comes to tools and development environments which help me to make something work. They only increase the distance between me and the problem and cause significant costs when environments have to be switched. This includes also the niceties like SQL query plan analyzers or debuggers. If I cannot understand it without tool support, then there is something intrinsically wrong. Tools are like crutches. They help you to limp along if you cannot walk on your own. I would find it extremely unfair if an ISO standard would force TM developers into such environments; it will definitely put off most of the people I work with. SummaryOverall, I think it is worth developing a dedicated language. TMQL, at its current complexity, does not overwhelm smaller development teams and first results show that it can be implemented more or less efficiently depending on your expectations. I also believe that TMQL is an industrial language, while it will hopefully not trigger regurgitation (yeah, I had to look that up) in purists. The specification will be very declarative, leaving implementation strategies as open as possible. It will be perfectly ok, for example, to compile a TMQL query expression into a SQL select statement. A valid concern is that developing a new language carries several risks.
Bibliography
codd70 (paper)
bn: Relational Model by Codd
oc: http://www.acm.org/classics/nov95/toc.html
tmrql (specification)
bn: TMRQL
oc: http://www.networkedplanet.com/download/TMRQL.pdf
(is-proposed-by-for)
proposer: p-kal-ahmed p-graham-moore
proposed: tmrql
usage: tmql-standard
epicycles
bn: Epicycles
oc: http://csep10.phys.utk.edu/astr161/lect/retrograde/aristotle.html]
tmql (query-language)
bn: TMQL
oc (homepage): http://www.isotopicmaps.org/tmql/
tmqluc (requirements-specification)
bn: TMQL Use Cases
oc: http://www.y12.doe.gov/sgml/sc34/document/0448.htm
|
|||||||||||