Tuesday, August 4, 2015

Define.xml and stylesheets

I have doubted a long time whether I should write this blog entry or not. The trigger then to do it came from an entry in the OpenCDISC forum stating "Can the indentation of items in the sidebar be controlled for all font sizes? Indentations appear normally for small fonts, but become irregular for larger fonts. Technical note: I am viewing define.xml files in Internet Explorer."

First of all, this has nothing to do with OpenCDISC. Probably however, the writer used an OpenCDISC tool to generate the define.xml after having generated the SDTM files (which I consider bad practice) and then viewed the result using Internet Explorer.
The writer of that entry doesn't even seem to realize that  a stylesheet is used for representing the define as HTML in the browser. For him/her, define.xml is what is seen in the browser.

Define.xml is however much more, it contains the metadata of your submission in XML, and not in HTML. So it can and should be used to validate the submission data themselves. Unfortunately, most validators even don't do that. The argument (sic): "Unfortunately the industry compliance with define.XML standard is not high enough to rely on user-provided metadata".
Wow!

But today I want to discuss a somewhat different topic: stylesheets.

The define.xml specification comes with a sample XSLT stylesheet that was developed by Lex Jansen (SAS), member of the CDISC define.xml development team. It is one of the best stylesheets I have ever seen. Even though, we regularly read complaints from people that they want it ... different. They do not seem to realize (or don't want to) that this is just a sample stylesheet, and that providing a stylesheet (not necessarily the one developed by Lex) is their own responsibility when submitting data to the FDA. So if they want to have changes to the stylesheet, they should make them themselves.

Now, what is an XSLT stylesheet?
A stylesheet transforms the XML into something else. The "T" in XSLT stands for "Transformation" isn't it? In many cases, the transformation is done to HTML (as in the define-stylesheet), but stylesheets can also transform XML into PDF, CSV, text files, SQL, or other XML...
So what the user sees in the browser when (thinking) he/she is opening the define.xml, is principally not the define.xml, but it is the visualization of the HTML that is generated by the stylesheet starting from the information in the define.xml.
So, essentially (and don't misunderstand me), what-you-see-is-not-what-you-have.

Now, Lex's stylesheet is an extremely good one, and it makes most information that is in the define.xml XML file display in a very-user friendly way - you can trust Lex and the define.xml team.

Transformation means manipulation (in the good sense of the word). One can however also use stylesheets to manipulate data in the bad sense of the word. Now, we are all honest people, and we would never never think about changing the define-stylesheet so that the information seen in the browser does not correctly represent what is in the define.xml XML file itself.

That is where the devil in me starts to speak ...

Let us look at a simple example: the "Key" column that is seen in the table where the variables for each dataset are defined. It looks like (here for the DS domain):


The XSLT for it in the define-stylesheet is:

        <xsl:for-each select="./odm:ItemRef">
        ...
        <td class="number"><xsl:value-of select="@KeySequence"/></td>
        ...
        </xsl:for-each>

Let us now make a small change to the stylesheet:

        <!-- added J.Aerts -->
        <xsl:variable name="MAXKEYSEQUENCE" select="max(./odm:ItemRef/@KeySequence)"/>
        <xsl:variable name="MAXKEYSEQUENCEPLUSONE" select="$MAXKEYSEQUENCE+1"/>
        <!-- end of addition J.Aerts -->
        <xsl:for-each select="./odm:ItemRef">
        ...
<!-- <td class="number"><xsl:value-of select="@KeySequence"/></td> -->
        <xsl:choose>
            <xsl:when test="@KeySequence != ''">
                <td class="number"><xsl:value-of select="$MAXKEYSEQUENCEPLUSONE - @KeySequence"/></td>
            </xsl:when>
           <xsl:otherwise><td/></xsl:otherwise>
         </xsl:choose>
         ...
        </xsl:for-each>

And what you then see in the browser is:



Do you see the difference? The values for the "Key" have been reversed! I.e. the lowest key number has become the highest and the highest has become the lowest!
But we did not change anything in the define.xml file itself isn't it? We only made a minor change to the stylesheet. Although this is a pretty harmlous example, it demonstrates that the result of a stylesheet does not necessarily represent the source XML data.
Again, we are honest people, and we would never never do something like this, and especially not when submitting data to the FDA.

So what do we learn from this?

- stylesheets should be validated. Does a stylesheet really truly visualize the data from the define.xml?
- it is the sponsor's responsibility (and not the one of Lex or of CDISC) to provide a stylesheet that truly visualizes what is in the define.xml
- the FDA should use its own stylesheets
- what you see in the browser (when a stylesheet is used), is not the define.xml
- the define.xml is a machine readable XML file defining the metadata for a submission and should be used as such
- what you see in the browser is just a human-friendly representation of what is in the define.xml - decisions should not be based on this "view"
- people should stop thinking about define.xml being a replacement for define.pdf
- in submission teams at sponsor companies, there should be at least 1-2 persons with good XML knowledge (it's easy, my students learn it in just 2 x 1.5 hours)

Comments are as always extremely welcome!

3 comments:

  1. Well said Jozef! I agree with all of the above, and it's good that the points are being made in a public forum. Hopefully this will get a lot of views. There's far too much focus put on the stylesheet at the moment.

    ReplyDelete
  2. Great post, Jozef. I agree that Lex did a nice job with the stylesheet. Among other new features, he tackled the thorny job of making the html print well. You can find Lex's latest stylesheet on the stylesheet library page on the CDISC wiki

    ReplyDelete
  3. Jozef, Thank you for the kind words.
    I think you hit the nail on the head. People are confusing the HTML rendition through the XSL stylesheet with the underlying XML many times.

    ReplyDelete