Sunday, March 18, 2018

Is SDTM a labyrinth (standard)?

This weekend (starting on Friday) I started implementing SDTM versions 1.5 and 1.6 into our famous SDTM-ETL software, in order to better than ever support the generation of SEND datasets.

It wasn't easy.

Reason of this is that there are no machine-readable versions of SDTM 1.5 and 1.6. One can download some Excel files from SHARE, but that doesn't help very much (I do not consider Excel as "machine-readable" - it is not vendor-neutral either).

For the last 7-8 years, each time a new SDTM (or IG) version becomes available for public review, I request (using the JIRA tracker) that the standard and implementation guide be published in a machine-readable form additionally. Each time, the final "CDISC Disposition" is "considered for the future". You can guess that I am sick and tired of always getting this answer.

So I started making my own machine-readable versions in XML. No, I am not going to donate it to the SDTM team this time (I already have donated so much to CDISC), as this is something THEY should have done.
This is how it looks like (for SDTM 1.6):


Although you might not like it, I used the old name "SDS", as I consider the new naming "SDTM" very confusing, not reflecting that the SDTM-IG and SEND-IG are different implementations of the same standard. I never understood why CDISC made this confusing name change.

What I also changed, are the designations "Char" and "Num" for the data types. Instead, I assigned the "modern" data types "text", "integer", "float", "datetime" and "durationDatetime", as also used by define.xml 2.0 (one of the few modern standards we have at CDISC). 

Adding all the variables and their description (in the IG named "CDISC Notes", in the model "Description", just a matter of being consequent) was the easy part of the work.

Then the labyrinth started ...

SDTM-IG 3.2 and SEND-IG 3.0 (both based on model 1.4) contain a lot of "assumptions" like: "The xxx variables are usually not used for this domain...", which has led to validation software (also used by the FDA) marking such variables as being non-compliant with the standard (and thus needed to be "apologized for" in the reviewers guide). In the SEND-IG 3.1 (based on model version 1.5) however I find:


Does this mean that one can use these variables again without limitations? Are they no longer discouraged?

In my XML file for SDTM 1.4  I still had entries like:



The number of them is 167! I had to copy-paste these from the SEND-IG PDF file manually, as there is no machine-readable version of the IG, from texts like:


For the SDTM-IG 1.4, the number of such "discouraged" variables is 425.
So I added 167+425=692 variables manually to my XML file that will be read into the software in order to have the feature for the user that adding such a variable to a specific domain is at least unusual (and will lead to an error or warning in the validation software that incorrectly interpretes the IG).

For some domains, the number of such "discouraged" variables is even considerably higher than the number of variables listed for that domain in the IG.
SDTM: a standard of prohibitions and exceptions?


This was not the end of the tragedy. SDTM models 1.5 and 1.6 (PDF) contain statements like:


describing that a specific variable may not be used in SDTM implementations (so: only in SEND). Again, not machine-readable. As an implementor (in software) one must carefully go through each paragraph of the PDF document and look at the description to find out.

The very same "description" for "--EXCLFL" also reveals other interesting things. It also states: "Expected to be Y or null". What does this mean "expected"? Is it a non-compliance when the value is "N"? I also "expect" my children no to lie to me ...
And when looking into the IG, one finds for the codelist:

stating that the "NY" codelist must be used. However, when looking into the CDISC-CT for this codelist, one finds 4 entries (!):


Here from the published PDF, which is as well as the published Excel file not machine-readable.
It in ONLY thanks to the XML team (especially Lex Jansen) that nowadays, the CDISC controlled terminology is also published as machine-readable XML.

So, how can my software know that --EXCLFL may not be used in human clinical trial, and that for its SEND implementation in BWEXCLFL the "No Yes Response" codelist must be used, but not all of it, but only the "Y" value (or null)?

My software can't.
Unless I hardcode it in the program or first subset the "No Yes Response" codelist into a "Yes Only" codelist, and then state in my XML file that the "Yes Only" codelist must be assigned to the variable "BWEXCLFL" and to "--EXCLFL" variables in general.
All this needs to be done manually!

And how can my software know that "--EXCLFL" should not be used when the value of "--STAT" is "NOT DONE"? Again, it can't, unless I hardcode it, or have it as a simple, but machine-readable rule in my XML.

It was promised that SHARE would provide such information. It doesn't however yet. Why?

The reason is simple: SHARE is populated AFTER the SDTM/SEND models and IGs are published. So the people doing this must go through exactly the same painful process as I need to, mostly using copy-paste. I hope to hear otherwise from them.
The background is that the SDTM team is still generating the standards starting from Excel and Word, and that they put information that can easily be structured (like "not to be used for human clinical trials") as narrative (text) in a "description".
As far as I know, the SDTM team is not using databases (essentially SHARE is or should be a database) for developing the standards. They should.
If they would do so, generating a machine-readable form of the standard at the end of the process would be the matter of pushing a button.

Is this an illusion? Four of my undergraduate students proved differently.
They generated an XML version of the SDTM-IG 3.2, very highly structured, with an XML element for each variable (and attributes to it), elements for "assumptions" etc.. They did this as part of their "bachelor project" which is a relative small project in the 5th semester of their eHealth study in Graz.
They also developed a stylesheet to transform the XML into HTML and thus display the SDTM-IG 3.2 in a browser. The result (display in the browser) looks almost 100% identical to what one sees in the PDF published by CDISC.

We will present this as a poster at the European CDISC Interchange in Berlin in April this year.

If only 4 students can do this as part of a relative small project, why is the SDTM team (with over 100 members?) not capable of doing so too? "Considered for the future" is just not good enough anymore!

A few other things that catched my eye when doing these implementations:
  • In SDTM 1.6, all mention to "ISO 8601" have been removed. Why?
    Maybe it was argued that that is an implementation issue (these are mentioned in the SEND DART IG)? But then also "Char" and "Num" should be removed isn't it?
  • In a number of cases, important words have been truncated. For example for the variable RPPLSTDY, the "label" is: "Planned Repro Phase Day of Obs Start".
    For a human, it is obvious that "Obs" here means "Observation" and "Repro" means "Reproduction", but again, how can a machine understand this? Or was it again the "40 characters XPT slavery" that made this happen? How can we ever use modern technologies like "machine learning" to come to "smart SDTM systems" if we accept such limitations?
    Ideally, the model should be independent of the transport format. HL7- FHIR shows us how this can easily be done: they have 3 different transport formats (XML, JSON and RDF) for the same model.
  • For many of the (new) variables, I immediately thought: "hey, this would be a resource (or profile) in FHIR". I have been digging a lot into FHIR in the last 1-2 years (who of the SDTM team ever had a look? - I have my doubts) and I get more and more convinced that our standards should be "FHIR-like" with a model independent of the transport format (SDTM is enormously bound to XPT), and using modern terminologies like LOINC (for tests), SNOMED-CT, NCBI, UMLS, ... and modern technnologies like RESTful web services (SHARE is starting to do so).
  • SDTM and IGs very often contain the word "usually". Whereas this would be acceptable in examples or explanations in separate documents, it should never appear in the standard itself.
 For me, it becomes more and more obvious that SDTM with its table structure is a "dead end". The tragedy is that we believed SDTM to be a "database". However, it isn't. It's a "view" on a database, and then even not a good one. Due to all the redundancies, derived variables, flags etc, that have been added on request of FDA reviewers to make it "more reviewer friendly", the real data is hidden, and with each new version, it becomes worse. Or as my good friend and colleague DIH recently stated during a telephone conference: "mapping to SDTM is attempting to add information that was previously lost in the process". However, many people (including reviewers) still believe that SDTM "is the data".
At the start, SDTM (still named SDS at that time) was developed to be a "universal" model for as well human as animal studies. After many years, this proved to be the wrong approach. For example, we now see that the newest versions 1.5 and 1.6 are essentially only meant for animal studies (a lot of "not to be used with human clinical trials" variables), that we now have "general" variables that are only allowed in a single domain (e.g. EGLEAD), and with more exceptions ("this variable is generally not used...") than recommendations (required and expected variables in the IG). I guess that the next SDTM-IG will be based on a new version of the model too (want to bet for a beer on it?).

So, for implementors, SDTM and their IGs have become a labyrinth of definitions that are not definitions, vague rules and hundreds of exceptions to it, and all this in a format that is not machine-readable.

Time for the "Blue Ribbon Commission" that was announced by the new CEO David Bobbitt to rethink a lot ...