Wednesday, August 9, 2017

Implementing SDTM 1.5 in software: first impressions

Last Monday, due to a short break in my vacation (thunderstorms and mountaineering do not well fit together), I started implementing SDTM 1.5 in our popular SDTM-ETL mapping software.
The reason is that some of customers want to start working with SEND 3.1, which is the first implementation of SDTM 1.5. Remark that there is no SDTM-IG yet based on SDTM 1.5, only a SEND-IG.

What were the difficulties encountered? What were my first impressions on how easy or difficult it is to implement SDTM 1.5 in software?

First of all, there is no very good (i.e. a define-XML template) machine-readable version of SDTM 1.5. There is an Excel file available from SHARE, with a list of the variables, and an Excel list of differences with version 1.4. From the former, I could generate an XML file with variables, which I could use for the automated generation of the "CDISC Notes" in the software, and also help me somewhat generating a SEND define.xml 3.1 template for SDTM-ETL.
All the other things had to be done using the "good old" methods, i.e. copy-and-paste from the PDF documents. As I am not paid by the hour, you can guess that I didn't like this too much.

Once the SEND 3.1 define.xml template generated, I could start on the nitty gritty details. They require careful reading of the specification or IG, interprete what is written there, and program it in the software. Interpretation from a specification is always dangerous, as can be seen from the very many false positives generated from the validation software used by the FDA (No, not generated by our company).

The first problem I encountered is that the list of SDTM variables that "is never used in SEND" (see SEND-IG 3.1) does not come as machine-readable information. So, copy-and-paste was necessary.
SDTM-IG 3.2 (based on SDTM 1.4) provided a list of "not generally used" variables. As there is no SDTM-IG yet based on SDTM 1.5, I did not implement this yet, just copied the list of 1.4 instead just for the moment.

The new "--LOBXFL" ("Last Observation Before Exposure Flag") which I already criticized in the past, as it essentially is a derived variable (derived variables do not belong in SDTM), is something I already implemented in the software a few months ago, as I realized it has a major impact on the software. The user can now choose between generating/writing a mapping script himself, or to auto-generate the values during SDTM generation execution. The latter then requires an extra step, as the generated SDTM data needs to be ordered by subject and test, and compared with RFXSTDTC which is in another dataset. It must also be said that the text in the SDTM 1.5 specification is very undetailed. It says "Operationally-derived indicator used to identify the last non-missing value prior to RFXSTDTC. Should be Y or null." It doesn't state anything about whether this is "per unique test". I presume it is (opening the discussion again about what a "unique test" is). I strongly believe standards specification should be exact and precise. The definition of  "--LOBXFL" in SDTM 1.5 isn't.
Also remark that "--LOBXFL" is not even mentioned in the SEND-IG 3.1.

New in SDTM 1.5 is also the "Domain-Specific Variables for the General Observation Class" (see p. 23 of the specification). Although I understand the reasons for these, SDTM was always sold to us as containing "generic" variables, applicable to all kinds of clinical research data. I never believed in that concept. One of the reasons is surely that SDTM still wants to represent everything as 2-dimensional tables, although we all know that "the world is not flat and neither is clinical data".
As this (fortunately) short list of variables is only in the PDF, it required some extra programming with another copy-and-paste activity.

Unfortunately, the list also contains an error, or at least a severe unclarity. It states that "EXMETHOD" is such a domain-specific variable, stating "these variables are for use only in a specific domain ...". If we take this literally, this would mean that e.g. EGMETHOD and LBMETHOD are not allowed anymore. Really?
Or was it meant that "--METHOD" may only be used in combination with "EX" in the "Interventions" class? That's my interpretation sofar. But specifications shouldn't be open for different interpretations, isn't it?

A lesser problem is that the SDTM 1.5 specification also contains two new domains which i.m.o. should only appear in the SDTM-IG: "Subject Disease Milestones" and "Trial Disease Milestones" (TM). For the former, I couldn't even find the two-character domain abbreviation, so how could I implement this? I didn't care too much for now, as these two domains do not appear in the SEND-IG 3.1, so I need to wait until the new SDTM-IG is published.