logo
Lessons Learned and Future Directions < Evolution of a Perl-based Knowledge Portal < < Home 

PrevUpNext

Lessons Learned and Future Directions

Developing, maintaining and operating the server over an extended period involves at least two axes. Technology-wise we have detected some performance bottlenecks but also gained some experience to host relatively complex functionality such as provided by TM processing software inside the Apache/mod_perl environment. On the content axis we had to learn that covering sizeable and volatile areas like the "Internet" or "XML" with topic maps is a rather time-consuming process.

Technology Experiences

To improve the overall performance of the system, we used a variety of techniques:

  • The obvious one is to cache parts of the output or even completely rendered documents. Both, Mason and AxKit, support this in a rather sophisticated way. In the case of Mason, caching can be memory-based or via a very fast persistent storage. In either case it is component based, which allows great control over the caching period. Only the application developer can reliably determine how long particular information reasonably can and should be cached.

  • To reduce the impact of loading a topic map from the MySQL backend we experimented with caching a whole map within mod_perl. The early assumption was that users would spend some time within one and the same map. While providing a noticeable speedup first, this has introduced considerable complexity into the code. As the number of users increased the above assumption could only be upheld if we allowed a pool of maps to be cached temporarily. Together with the high memory consumption of mod_perl itself this eventually proved to be prohibitively expensive.

  • A much simpler change was to utilize a Linux cluster. One of the available modules for Apache (mod_backhand) allows individual web servers to communicate their load to each other. Whenever a request comes in, a front web server decides whether it should handle the request itself or whether it is more favourable to proxy the request to one of the web servers in the cluster.

    The speed up was - as it can be expected - linear to the number of involved machines. More interestingly, the best strategy to dispatch requests was to use a completely random one and not one based on a specific load distribution. Our explanation for this behaviour is that the load information servers have of each other is quickly becoming obsolete as it is broadcast only once every second. If a typical request lasts roughly a second as well, then the front-end server has no up-to-date assessment of individual loads and the quality of the balancing decision deteriorates to the point of being inferior to a random choice.

The TM software packages proved to be fairly reliable but - by themselves - were rather CPU intensive. A more severe problem, though, was the size of the data structures involved. A single map, especially when combined with all rendering information inside a topic map view, easily consumed several MB of main memory. This, and the fact that we used a fair bit of 3rd party Perl packages resulted in Apache children using up more than 40 MB. To avoid a shortage of main memory we had to limit the number of concurrent requests. To mitigate memory leakage we also put an upper limit on the size of an Apache child.

Content Experiences

The initial vision to invite external authors to write for free clearly failed. Some authors understandably would only consider writing as a commercial venture, while more academic candidates preferred to submit the content as a conference paper. Others who might have been inclined eventually built their own site or blog. As a consequence we had to feed the server with content on our own, which - also with the help of research students - proved to be fairly workable.

Map Content

The more we came to share maps produced by multiple authors, the more it became apparent that everybody has his own idiosyncrasies with regard to knowledge representation. This has led to very different authoring styles: Some authors use very systematically rather generic types for topics and associations, such as machine or process even though they were not using one of the the available top level ontologies. Not surprisingly, others had a much more adhoc approach.

While it is interesting to use this in a learning process, higher quality of map content can only be achieved by using baselining ontologies. These would force authors to use a specific vocabulary and a consistent set of association structures.

We also realised that categorising and arranging our maps into some globally valid hierarchical structure is bound to fail. First there is no single all-encompassing map of The Universe. Then, when we tried to factor this vast domain into smaller elements, vastly different organisations of maps resulted. Paradoxically enough, this resulting "patchwork" structure is exactly what is necessary for efficient joint authoring.

Map Views

Our experiences with presenting content via views is throughout positive, give or take minor annoyances such as the lack of splitting larger topics into several, subsequent slides or the lack of embedding images.

The big promise, though, free sharing of information, has not yet been entirely fulfilled. It is mainly the different abstraction levels and the degree of textual content between authors which made maps not as reusable as they could be. This is definitely an axis on which further development and self-discipline has to be exercised.

Future Plans

One of the bigger annoyances on the current site is the user interface. On the one hand it is complex enough to regularily confuse casual users, on the other hand not powerful enough for serious ontology applications and customization. Accordingly, we will completely redesign it by adding more functionality such as fulltext querying, stylesheet selection and facilities for applying map constraints and queries. To simplify the interface, though, all these features will be organized into function bars which are mostly hidden by default.

The most important call for action, though, is for upgrading the TM software itself. In this process we will deploy a new dedicated TM datastore. Not only can it serve complete topic maps to clients, but it is also capable of processing Topic Map query language expressions. This is expected to significantly cut down the complexity of our Mason components. The server is also designed in such a way that topic map data can be distributed over several machines, so that - theoretically - query expressions can be factorized and spread over several nodes in a topic map cluster environment.

The TM server's capability to process more complex queries will not only simplify the user interface related components, but it also allows to separate topic maps from their views more cleanly. Views can actually be formulated completely as a single (alas longish) query statement which returns an XML structure containing all necessary information to render every topic in the view.

This new server version will also be able to host not only topic maps, but also ontologies and queries into its database. Together with a new driver infrastructure this will allow us to implement virtual topic maps [BaVirtualMaps]. With these we can wrap a topicmappish access layer around existing resources. A DNS server or, say, a relational database, can thus be treated as if it were a topic map.


PrevUpNext