XML transformation flow processing

Jérôme Euzenat (INRIA Rhône-Alpes), Laurent Tardif (Fluxmedia)

Parts of this white paper have been presented to the 2nd conference on Extreme markup languages, Montréal (CA) , pp61-72, 2001

Abstract:

The XSLT language is both complex in simple cases (such as attribute renaming or element hiding) and restricted in complex cases (complex information flows require processing multiple stylesheets). We propose a framework which improves on XSLT by providing simple-to-use and easy-to-analyse macros for common basic transformation tasks. It also provides a superstructure for composing multiple stylesheets, with multiple input and output documents, in ways not accessible within XSLT. Having the whole transformation description in an integrated format allows better control and analysis of the complete transformation.

Keywords: XML, XSLT, Transmorpher, Transformations,

1 : Introduction and motivation

In electronic documentation, the notion of a transformation is widespread. It is a process that transforms a source document into another document, the target. This notion concerns the whole computing discipline with the advent of the XML language.

As there are multiple computing practices, there are multiple needs for a transformation system. We motivate and present here a system that targets increased intelligibility in the expression of transformations. This need is first motivated before telling why, in our opinion, XSLT falls short of the objectives of simplicity and power. The requirements for such a system are then presented.

1.1 : Motivating example

Consider someone wanting to generate part of a web site concerning bibliographic data. The source of information is a set of XML formatted bibliography documents, containing reference elements described by authors, title or abstract elements. The system aims at providing several different documents:

The two first documents must have first been stripped of abstract and non-public information.

The generation of the first three documents can be naturally expressed by the following schema in which boxes are transformations written in some transformation language (e.g. XSLT) and strip-abstract is the simple suppression of abstract elements, of elements marked as private and of mark attributes.

sample-tf2
Figure: A sample transformation flow (flowing from left to right)

The picture represents what we call a transformation flow, i.e. a set of transformations linked by information channels. It is worth noting, that a transformation flow does not indicate if it must be processed in a demand-driven (pull) or data-driven (push) manner.

1.2 : Limitations of XSLT

XSLT [clark1999a] is a very powerful technology for transforming XML documents which has been carefully designed for rendering. It has the advantage of being based on XML itself. Of course, all the manipulations that have been described in the example can be expressed in XSLT. Yet, XSLT suffers from a few shortcomings that make it both too sophisticated and too restricted at once. These shortcomings are:

Complexity

Writing simple transformations (tag translation, tree decoration, information hiding such as the abstract stripping in the example) requires knowledge of XSLT even though they can be expressed in a straightforward manner by the user. There is no simple way to implement these transformations in XSLT.

Lack of intelligibility

If it is easy to parse an XSLT stylesheet, it is not easy to understand it because roughly the same construct with many different parameters is used for writing both simple transformations and sophisticated ones. To XPath [clark1999b], XSLT [clark1999a] adds seven extension functions, including document() for processing multiple documents. The treatment of multi-source transformations concurs to XSLT's lack of intelligibility. through this document() construct, and the strong inequality of treatment with output specification that is dealt with through an explicit output XSLT construct (although, this could change a bit with XSLT 1.1 [clark2001a] it is not in a totally satisfying way).

As a consequence, building analysis tools is rather complicated. This is one of our deeper motivations. Analyzing transformation flows can be used for many purposes from displaying them (as in the above picture) to assessing the properties preserved by some transformations (and going towards proof-carrying transformations) and optimizing transformations.

Non self-sufficiency

Writing complex transformation flows involving independently designed stylesheets, multi-document dataflow and closure operation requires the use of an external environment (scripting language, shell) and compromises portability of the transformation flows.

Other pieces of work have already addressed this issue: namely the AxKit and Cocoon projects which implement pipelining of stylesheets. However, since they are concerned with demand-driven documents, they do not address the multi-output and complex dataflow issue (i.e. when a document generates several outputs that are themselves subject to independent transformations and can eventually be merged later on). These issues are important for the future XML-based information systems.

Limited power

XSLT is not so limited as it may appear. But it has been designed in such a way that some powerful operations are difficult to process. A good example is the closure operation (i.e. applying a stylesheet until its application does not change the document anymore). Such an operation is very powerful and can be written very concisely. It can be used for gathering all the nodes of a particular graph (e.g. flattening a complete web site into one document can be seen as a closure operation).

For those who need these operations, they can either implement them outside XSLT (in a non portable shell) or inside XSLT (with extra contortions).

This limited power issue is sometimes described as a lack of side effects. However, XSLT provides side-effects by applying a transformation to a document and then reading that document through an XSLT specific XPath construct document() call. Moreover, recursive expressions can be written inside XSLT as shown by [kay2000a][becker2000a].

In order to overcome these problems, we have started to design a system that relies on XSLT and attempts to remain compatible with it but embeds it in a superstructure. This system, called Transmorpher, is the subject of this paper.

1.3 : Requirements

Transmorpher is an environment for processing generic transformations on XML documents. It aims at complementing XSLT in order to:

The guidelines of the proposal are the following:

In the remainder of this paper, the design of Transmorpher is presented. The next section presents its computing model involving the composition of transformations. Then, the built-in abstract basic transformations which can be handled by Transmorpher are presented. The notion of rules for expressing straightforward transformations in a drastic simplification of XSLT is detailed. We end with a quick description of the current implementation and a comparison with other work.

2 : Computing model

Transformation flows are made of sets of transformations connected by channels on their input/output ports. Transformations can in turn be either transformation flows or elementary transformations. Channels carry the information to be transformed (currently, only XML-formatted under the form of SAX events [boag2000a]). They can take several inputs and provide several outputs during one execution. The Transmorpher computing model is thus rather simple.

Transmorpher enables the description transformation flows in XML. It also defines a set of abstract elementary transformations that are provided with an interface and execution model. Currently, the available transformations are: generators, serializers, rule set processors, dispatchers, mergers, query evaluators, external processing calls and iterators.

The interpretation of a transformation flow consists of creating the transformations, connecting them through channels and providing input to the source input channels. This interpretation can be triggered at the shell level, or programmed in another application and we are working towards making it usable as a servlet.

Transmorpher is thus made of two main parts: a set of documented Java classes (which can be refined and integrated in other software) and an interpreter of transformation flows. The transformation flows can be specified by programming the class instantiation in Java or by describing it in XML.

2.1 : Processes

The transformation flows are described in an XML document which clearly separates the rules from the processing. The transformation flows are described through a process element. There can be several such processes in one document.

The processes contain a set of subprocesses, whose main types are:

<apply-process name='name'/>

which calls an already defined process,

<apply-ruleset name='name' strategy='strategy' />

which applies a set of rules (equivalent to an XSLT stylesheet) to its input,

<apply-external type='type' file='file' />

which calls an external procedure on the input and must provide the output. This engine could be Perl, XSLT, or whatever is appropriate.

<apply-query name='name' type='type' file='file' />

which evaluates a query on the input and must provide the output. This query engine could be XQL, SQL, or whatever is appropriate.

<repeat times='n' buffer='channels'/>

which applies the contained treatment a particular number of times (or until the input and output are the same). Buffering channels are provided for expressing the information flow.;

Other instrumental subprocesses are:

<dispatch type='type'/>

which takes one input and several outputs,

<merge type='type'/>

which takes several input and one output,

<generate type='type'/>

which takes no input and one output (generally used to read from outer streams like files),

<serialize type='type'/>

which takes one input and no output (generally used to write to outer streams like files).

Each of these primitives has an id (enabling the identification of subprocesses of the same kind) and in and out attributes (enabling their connection to other processes). The name attribute denotes an element defined within the current transformation (or an imported one). The type attribute identifies a particular implementation of the basic process.

Adding other basic processes to Transmorpher should be simple because they are simply other elements to add to the DTD.

Below is a Transmorpher transformation flow, in which the processGeneral process corresponds to the flow described by the figure above.

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE transmorpher SYSTEM "../../dtd/transmorpher.dtd"> <transmorpher name="generateBiblio" xmlns="http://transmorpher.fluxmedia.fr/1.0" xmlns:regexp="xalan://fr.fluxmedia.transmorpher.regexp.RegularExpression"> <ruleset name="stripAbstract"> <remtag match="abstract" context="reference"/> <remtag match="keywords" context="reference"/> <remtag match="areas" context="reference"/> <remtag match="softwares" context="reference"/> <remtag match="contracts" context="reference"/> <remtag match="*[@status='hidden']"/> <resubst match="conference/@issue" source="([0-9]+)" target="$1e"/> <rematt match="status"/> <rematt match="isbn" context="book"/> </ruleset> <query name="troncybrunet" type="tmq" root="bibliography"> <select match="bibliography/reference[authors/p/@last='Troncy']"/> <select match="bibliography/reference[authors/p/@last='Brunet']"/> </query> <process name="processGeneral" in="R112" out="X3 Y3 Z3"> <dispatch type="broadcast" id="dispatch 2" in="R112" out="D1 D3"/> <apply-ruleset ref="stripAbstract" id="StripAbstract" in="D1" out="X1"/> <dispatch type="broadcast" id="dispatchStripped" in="X1" out="X11 X12"/> <apply-external type="xslt" id="SortTypeYear" file="biblio/sort-ty.xsl" in="X11" out="X2"/> <apply-external type="xslt" id="FormatHTML" file="biblio/form-harea.xsl" in="X2" out="X3"/> <apply-external type="xslt" id="SortCategYear" file="biblio/sort-cya.xsl" in="X12" out="Y2"/> <apply-external type="xslt" id="FormatBib" file="biblio/form-bibtex.xsl" in="Y2" out="Y3"/> <apply-external type="xslt" id="FormatXML" file="biblio/xmlverbatimwrapper.xsl" in="D3" out="Z3"/> </process> <process name="processByNames" in="R111" out="Z34"> <apply-query type="tmq" ref="troncybrunet" id="FilterTB" in="R111" out="Z34" /> </process> <main name="ProcessBiblio"> <generate type="readfile" id="bibexmo" out="R1" file="biblio/bibexmo.xml"/> <generate type="readfile" id="je" out="R2" file="biblio/je.xml"/> <merge type="concat" id="merge" in="R1 R2" out="R3"/> <dispatch type="broadcast" id="dispatch1" in="R3" out="X11 X12"/> <apply-process id="generateFormat" ref="processGeneral" in="X11" out="X31 Y31 Z31" /> <serialize type="writefile" id="writeHTML" in="X31" file="biblio/biblio.html" /> <serialize type="writefile" id="writeBIB" in="Y31" file="biblio/biblio.bib"/> <serialize type="writefile" id="writeXML" in="Z31" file="biblio/biblio-xml.html"/> <apply-process id="processTB" ref="processByNames" in="X12" out="W3" /> <serialize type="writefile" id="writeTB" in="W3" file="biblio/biblio_tb.html"/> </main> </transmorpher>

2.2 : Channels

In Transmorpher the generic processes can have several input and several output port. These ports are connected to channels that are fed in by the output of a process and can be used as input of other processes. They are abstractions that enables the expression of the flow of information in a compound transformation and not the mark of a particular implementation. The set of channels is called the dataflow.

The channels specify a unit in which processes can read and write. They are named streams which can be visible from outside a process if they are declared as their input or output.

The control inside a process can be deduced from the dataflow. There is no explicit operator for parallelizing or composing transformations: their channels denote composition, precedence and independence of processes.

Alternative solutions to channels, could have been retained in order to deal uniformly with input/output. The solution taken by [drewes2000a] consists of considering each transformation as a function from one (not necess