Saturday, December 29, 2012

Another copy-and-paste frustration

Yesterday and today, I worked on implementing the new SDTM oncology and a number of the new draft SDTM-IG 3.1.4 domains in my SDTM-ELT(TM) software. For each new domain, I created a template define.xml file that can be read by the software.

It was so frustrating!

The new domains come as a PDF file with tables and additional information. During creation of the templates I continously needed to switch between the PDF and between the (smart) XML editor I am using. I needed to copy the variable name, role and datatype and then paste it into the template. I then also needed to add whether the variable is mandatory (required/expected) or not (permissible).

The validation whether everything was correctly implemented was purely visual, comparing a view of the machine-readable template with the tables in the PDF file. Another frustrating task...

In earlier days, the SDTM team also published these tables in an Excel file. Although not a vendor-neutral format, this Excel file at least allowed to automate some things. The way I proceeded was as follows:
First I read the Excel file into OpenOffice Calc (a competitor of Excel). Then I exported the tables as an OpenOffice "odf" file, which is essentially a zip file containing a set of XML files. After renaming and unzipping, one needs to take one of these XML files, and transform it to CDISC define.xml using a stylesheet.
Although each new publication of the SDTM team had differently formatted Excel files, meaning that I needed to adapt the stylesheet to transform to define.xml each time, it was a very good way, and I could generate the templates in a few hours.

The CDISC SDTM team does not publish their tables in a machine-readable format anymore. I do not know why, because I cannot imagine that the team is not using any computer tools to help in generating the tables. So why not publish them in a machine-readable format?

SDTM table definitions in a machine-readable format - that would be cool!

SDTM Supplemental Qualifiers

From the document "Proposal for Alternate Handling of Supplemental Qualifiers" of the CDISC SDTM Team:
"The vision at the time the Supplemental Qualifiers structure was created was that standard review tools used at the FDA would automatically display the Non-Standard-Variables (NSVs) together with the parent data in tabular views. The reality, however, is the representation of NSVs in separate SUPP-- datasets has resulted in increased effort by reviewers to prepare for and perform reviews of SDTM datasets.
The end result is that a data structure that was created to provide a standard method for representing NSVs has not been the best structure for the viewing and analysis of SDTM data by FDA reviewers. This document describes a proposed method by which NSVs and standard variables could be represented together in the parent datasets".

Or in other words: FDA reviewers are uncapable of merging/combining datasets (tables) so that the non-standard variables appear in their original table!

Oh my God: even my undergraduate students would be able to do that!

But why haven't the non-standard-variables not always be kept in their original table? Because they would then not be distinguishable from the standard variables? Yes, it is true that SAS Transport 5 does not have the capability of keeping metadata so that standard-variables and non-standard-variables can be distinguished, except for by naming conventions.

Wait a minute!
Metadata belong in the define.xml isn't it?
So why not just state in the define.xml that a variable is a non-standard variable? For example, suppose we have a non-standard variable "QSLANG" ("Question Language") in the QS (Questionnaires) domain. In the current SDTM implementation it is "banned" to the SUPPQS dataset.

We just keep that variable in the QS dataset and write in the define.xml:

<ItemGroupDef OID="QS" Name="QS" ... def:Label="Questionnaires" ...>
    <ItemRef ItemOID="STUDYID" Mandatory="Yes" OrderNumber="1"/>
    <ItemRef ItemOID="DOMAIN" Mandatory="Yes" OrderNumber="2"/>
    <ItemRef ItemOID="USUBJID" OrderNumber="3"/>
    ....
    <ItemRef ItemOID="QS.QSLANG" OrderNumber=".." def:IsNonStandardVariable="Yes"/>
</ItemGroupDef>

and of course define the corresponding ItemDef with OID "QS.QSLANG":
<ItemDef OID="QS.QSLANG" Name="QSLANG" def:Label="Question Language" DataType="text" Length="2"/>

DONE!

Does this help the reviewer in distinguishing between standard variables and non-standard variables?
Of course not very much. The FDA reviewers usually print out the define.xml (using the stylesheet view) and then inspect the XPT datasets in parallel. Not very modern isn't it?
The simple reason is that the viewer cannot read define.xml files.

Suppose however that the datasets do not come as XPT files, but as CDISC ODM (XML) files.
In CDISC ODM there is always a strong relation between the variable definition (ItemDef) and the data value (ItemData) over the OID. So for example, we would have (extending the above example):

<ItemDef OID="QS.QSLANG" Name="QSLANG" def:Label="Question Language" DataType="text" Length="2"/>

and the value:
<ItemGroupData ItemGroupOID="QS">
    <ItemData ItemOID="STUDYID" Value="STUDYX"/>
    <ItemData ItemOID="DOMAIN" Value="QS"/>
    <ItemData ItemOID="USUBJID" Value="P0001"/>
    ...
    <ItemData ItemOID="QS.QSLANG" Value="en"/>
</ItemGroupData>

This allows us to create software that tabulates the data and e.g. colors it depending whether it is a standard variable or a non-standard variable. This is very easy. It even can be done using an XSLT stylesheet when one wants to display the data in a browser.

It even creates new opportunities, such as:
- the software/stylesheet can easily validate the data, and e.g. color cells in the table (with an additional tooltip) for which there is an SDTM rule violation. For example an empty "QSREASND" can be colored red when the value of "QSSTAT" is "NOT DONE", this as the rule is that QSREASND may not be empty when QSSTAT is populated.
- In the LB (Laboratory) domain, values of "LBSTRESN" can be colored when the value is outside the reference range, i.e. lower than the value from LBSTNRLO or higher than the value of LBSTNRHI.
- rows that contain baseline data (--BLFL=Y) are colored differently
- data is sorted based on more than one criteria

All this is currently NOT possible using the SAS XPT Viewer, because it cannot read the define.xml containing the metadata.

There are much more opportunities: the number of them is only limited by your creativity.

Once we know how the future(?) ODM-based XML format for SDTM (and SEND and ADaM) data will look like, we will pilot such software and demonstrate it.
So ... keep tuned!