Tuesday, October 4, 2016

SDTM-IG in machine-readable format

Each time a new version of the SDTM or SDTM-IG is published, people suffer. The FDA suffers as it needs to adapt its database, software vendors suffer because they need to generate new templates, builders of validation engines suffer because they must make an interpretation of the rules, and that all starting from a ... PDF document.

If I look at the SDTM-IG, I see a lot of structured information. Sections like "Description/Overview", "Specification", "Assumptions", "Examples" appear for each described domain. "Structured" means that it must be possible to put the information in an XML document, and make good parts of it machine-readable, allowing to automate tasks such as template building, rule generation, database setup.
Using the PDF document only, lots of copy-and-paste needs to be done, leading to a lot of frustration and error. Fortunately, the situation has somewhat improved, as parts of the IGs can now be downloaded in the form of Excel files and even a define.xml. But even then, lots of information is still only available as PDF text or tables, not to speak about business rules that later go into validation tools (over a very subjective  interpretation step) that are out of the control of CDISC.

So I started a very first attempt to do something about that. It is still very primitive, but may be a starting point for a more serious attemp. I limited myself to the EX and EC domains in the Interventions class, which can be found in the "Section 6.1 - EX and EC Domains" portfolio of the SDTM-IG 3.2 PDF document.
I started with the highly structured information. In the XML it is:


It contains all the information one can also find in the SDTM-IG, but also some additional information like the recommended Define-XML datatype (the IG only mentions the SAS-XPT datatype and mixes up controlled terminology and controlled format). For the controlled terminology, it explicitely states the type and ID of the CDISC-NCI codelist, e.g.:


and if the codelist is "sponsor defined":
 



All this is machine-readable, but we of course also want to have this information human-readable. So I generated a very simple stylesheet which pretty well mimicks the PDF. Applying it to the "specification" part gives the following "human-readable" view:
 


looking extremely similar to what is in the PDF:

Also, for some of the variables, we can also add a "rule", like:
 

which is again machine-readable.


For the the "Assumptions", I took a similar approach, e.g.:



translated by the stylesheet into the human-readable view again as:
 

again very similar to what is seen in the PDF (but the stylesheet can be further improved here).

For the "Examples" part, this is mostly narrative text and some tables, without fixed structure, so I decided to allow to embed XHTML into my document. This is exactly the same as what HL7-FHIR is doing (and what we intend to allow for CDISC ODM 2.0), i.e.:





translated into the "human-readable" view by the stylesheet:
 

What still needs to be done



I haven't added all information yet (for EX and EX) into the XML structure, still need to complete the examples (there is a good amount of them). Also, I want to further improve the stylesheet.
When everything is at an acceptable level, I will publish the XML, the stylesheet and the resulting HTML for download. So... stay tuned!

It must however be clear that it is well possible to have the SDTM-IG (or at least a good part of it) in a machine-readable format, allowing to automate some tasks that now are cumbersome, as they are based on copy-and-paste from PDF files.