Saturday, February 2, 2013

SDTM and non-standard variables

The new draft SDTM-IG 1.4 package 2 contains an interesting statement in the "SDS Proposal for alternate Handling of Supplemental Qualifiers" (file: SDS_Proposal-Alternate_Handling_of_Supplemental_Qualifiers.pdf) document. I cite:

"The Supplemental Qualifiers structure was created to address the need to represent NSVs in the SDTM. It consists of a normalized data structure designed to allow for standard representation of what often is a wide variety of sponsor-specific variables.
The vision at the time the Supplemental Qualifiers structure was created was that standard review tools used at the FDA would automatically display the NSVs together with the parent data in tabular views. The reality, however, is the representation of NSVs in separate SUPP-- datasets has resulted in increased effort by reviewers to prepare for and perform reviews of SDTM datasets.
The end result is that a data structure that was created to provide a standard method for representing NSVs has not been the best structure for the viewing and analysis of SDTM data by FDA reviewers".

Where "NSV" means "Non-standard variables".

Essentially the above statement says that reviewers are not able to combine variables in SUPP-- domains with the parent domain and bring them "back" to the original domain.

I am not surprised.

In conversations with FDA representatives, I heard that SDTM data is viewed by reviewers "as is". Usually there is no attempt to generate a database from the submission datasets. Without that, it is not easy to recombine SUPP-- datasets with the parent domain.
Obviously, there is not much database knowledge at the FDA. I have always regarded their attempt to come to a datawarehouse as "a bridge too far". If they are not able to generate a database, how would they be able to generate a datawarehouse?

The proposed solution that is provided in the above mentioned document is not new: the ODM team already proposed something very similar five years ago, but the teams ideas for an XML-ODM based replacement of SAS Transport 5 were vetoed by the FDA.
I did already write about that in the blog "SDTM Supplemental Qualifiers".

Generating a database from the submission datasets is not made easy by the structure of the SDTM. I already wrote a previous blog on this topic titled "Is SDTM a database and if so - is it a good one".

At the beginning, SDTM was developed to contain "collected data only", and only a minimal amount of "derived data" would be present. The idea was that derived data would go into ADaM.
It however soon was found out that derived data would need to go into the datasets themselves. A simple example is "AGE" in the Demographics (DM) dataset.
AGE is in most cases never collected - it is derived (calculated) from BRTHDTC (birth date) and RFSTDTC (reference start date). Both these are present and required in the DM dataset. So why is AGE than still necessary?
IF SDTM were a database representation, "AGE" would even not be allowed, as it violates the "third normal form", stating that "transitive dependency" is not allowed.
However I suspect that, as it was soon found out that the FDA could not, or did not want to generate a database from each submission, "AGE" was included in the SDTM as a variable of the DM domain.

This turned SDTM into a "View" on a database.
Regenerating a real database from a view on it, is however not an easy task.

In the last few years, we have again and over again seen that new variables are added to the SDTM. In many cases these are derived variables.
So for example, SDTM 1.3 / SDTM-IG 3.1.3 adds following variables to the DM domain:
  • RFXSTDTC: date/time of first study treatment [can however also be obtained from the EX dataset]
  • RFXENDTC: date/time of last study treatment [can however also be obtained from the EX dataset]
  • RFICDTC: date/time of informed consent
  • RFPENDTC: date/time of end of participation [can very probably be obtained from SE (subject elements) and/or from DS (disposition)
  • DTHDTC: date/time of death: this is clearly derived - I cannot image this is a standard field on the Demographics form. I haven't seen it in CDASH either
  • DTHFL: death flag - similar
  • ACTARMCD: actual arm code - this really does not belong here. It should be a variable in the SE (subject elements) domain. But it looks as it was put in DM as either reviewers do not inspect the SE datasets and/or cannot combine it with the DM dataset.
The most obvious example of a derived variable is "--TEST". There is a 1:1 correspondence between "--TESTCD" and "--TEST", clearly violating the 3th normal form of database design. The somewhat crazy thing is that in define.xml this is not taken into account: there are separate codelists for "--TESTCD" and for "--TEST". As "--TEST" is a synonym qualifier for "--TESTCD", the correct construct should be like:
<CodeListItem CodedValue="SYSBP">
    <Decode><TranslatedText>Systolic blood pressure</TranslatedText></Decode>
</CodeListItem>

and "--TEST" should even NOT be in the SDTM dataset.
In a tool that can as well look up the metadata as the data itself, the column "--TEST" would either be generated "on the fly" or the test name (value of --TEST) could e.g. appear as a tooltip when the mouse is positioned over the cell with the test code. With the ancient SASViewer this is of course not possible (as it even does not understand define.xml).

I also have always been puzzled about why we both have --STRESC and --STRESN. For --STRESN the IG says: "copied in numeric format from --STRESC". If the value in --STRESN is a copy of the value in --STRESC, why do we need it then? Maybe to explain the reviewer that the value is a numeric one? But that is already stated in the define.xml (ValueList / ItemDef) isn't it?

So why all these derived variables?

I suspect that the reason is very simple: the FDA is not able to do anything with the submitted datasets than to inspect them with the ancient SASViewer. They are not able to combine datasets with each other (generate views on views), they are not able to generate databases from them.
So each time they need a new "view" they just ask CDISC to define a new (derived) variable that is then added to the standard. The more information the FDA will want to retrieve from the datasets, the more new (derived) variables they will ask for to be included in the SDTM standard.
As can be expected, this will lead to even more "huge" datasets (as the number of variables is continously increasing), and we are then surprised that the FDA is complaining about file sizes!

Is there a way out of this?

I think there is once we have SDTM in XML (ODM-based) format. When both define.xml and SDTM datasets use the same model/format (ODM), we can generate software (e.g. viewers) that allow to "look up" information, or "calculate it on the fly".
For example, for the DM domain, suppose we leave "AGE" out of the dataset. The viewer can then take care that "AGE" is calculated "on the fly" and displayed either as an additional column (although it was not in the dataset), or as a tooltip for one of the cells (e.g. tooltip on USUBJID or on BRTHDTC).
Similarly, for the Findings domains, suppose we leave "--TEST" (test name) out of the dataset and only provide "--TESTCD". As both are connected to each other through the "code / decode" in the metadata, the column "--TEST" can automatically be generated when loading the data from the information in the define.xml. Or the value for --TEST can be supplied as a tooltip on the "--TESTCD" cell in the table.

Once we have such an XML-based format, we can start thinking about deprecating a number of SDTM variables, as it will essentially be the tool (e.g. the viewer) that takes care that they are derived, either from within the same dataset (e.g. AGE) as from other datasets (e.g. RFXSTDTC, RFXENDTC) as from the metadata in define.xml (e.g. --TEST - test name).
This will bring us to a much cleaner SDTM with much less redundant information, and considerably smaller files (I estimate a 50% reduction).

The ultimate solution is however SHARE. Once we can allocate each data point as being a SHARE object, things will become much easier.
There has been a lot of discussion about a format for SHARE. I am however more and more convinced that the format is completly unimportant. I am more and more convinced that SHARE objects can be expressed as RDF, as define.xml and ODM, as Excel (if you like that).

But that's a topic for another blog entry ...


SDTM --ORRESU and controlled terminology

A very interesting discussion (partially by e-mail, partially on the web) has been taken place in the last weeks on the question on whether --ORRESU ("Original Units") must be under control of terminology or not.

The question arose from another discussion on why there is a CDISC codelist for units, although there is already an international standard for that (UCUM), which is well used by the hospital world, and for which the use is mandatory in exchange standards for electronic health records (EHRs) such as HL7-CCD and CDA and ISO-21090.
As I was informed by someone from the FDA, the use of UCUM is also mandatory in SPL (Structured Product Labeling).

The discussion was also triggered by the observation that when one extracts lab data from EHRs and use that in clinical research and it leads to an SDTM submission, the value in LBORRESU in some cases leads to a validation error when using OpenCDISC. A typical example is "mm[Hg]" which is the international symbol (UCUM) for "millimiter mercury" as a unit for the property "pressure". OpenCDISC throws an error as "mm[Hg]" is not in the "UNIT" controlled terminology codelist.
Other people remarked that when using paper case report forms, the way the unit is written on the CRF varies very often (when not preprinted). For example, for a mass concentration, they found it written as "Gr/dl", "Gramm per deciliter", "Gramm/dL", "g/dL" etc.. Only the latter is correct CDISC controlled terminology.
So if I find "Gramm per deciliter" on the CRF, what must go into LBORRESU?
The SDTM-IG for LBORRESU states "original units" for the variable label, but the column "Controlled Terms, CodeList or format" states "(UNIT)", meaning that the value must come from the "UNIT" codelist.
Whether this codelist is extensible (by the sponsor) is not mentioned in the SDTM-IG.
In my opinion, the wording "original units" contradicts with the statement that the "UNIT" codelist must be used.
Only in an ideal world, every researcher would exactly use what is in the CDISC "UNIT" codelist when writing (as narrative text) the unit on the paper CRF.

A response came from a distinguished SDTM pioneer stating that one needs to distinguish between the "actual unit" that came from the CRF, and the way it is represented in SDTM, and which needs to come from the UNIT codelist.
So that would mean that if the researcher had written "mill.Hg" on the CRF, this would never appear in VSORRESU, but needs to be "translated" into "mmHg".
Similarly, when the data comes from an EHR, the UCUM unit "mm[Hg]" will never go into VSORRESU but needs to be translated into "mmHg" again.

If we do so, how much is ORRESU then still "original unit"? Isn't such a translation a "derivation"?
What about "translation errors"? Can we trust that the unit is always translated correctly into one from the UNIT codelist? I (and a lot of other in the discussion) have some serious doubts.

What about tests for which there is a unit available, but for which there is no unit present yet in the UNIT codelist for that property? Do we need to wait submitting to the FDA until the moment that our "new term request" is approved?
And isn't it a bit strange that CDISC just choose to ignore/exclude a standard (UCUM) that is very well used, and choose to develop its own list for units? One statement in the discussion was that UCUM is not well known in smaller labs and by a number of LIMS systems, but is that a reason to develop a whole new list, and then to say that the labs and LIMS should learn that new list, instead of stating that they should learn UCUM? It reminds me about the cartoon that you can find at http://xkcd.com/927/.

One very good suggestion was given by a representative of a biotech company, proposing to have two variables for the "original" units, one named e.g. --RPRESU (new), the "reported result unit" which is "as collected", so not under controlled terminology, and --ORRESU (existing), which is the translation into something that is in the "UNIT" codelist. As such, the "as collected" unit would not be lost, and the reviewer at the FDA can use the (--ORRESU) term that is under CDISC controlled terminology because that is familiar to him/here.

So, when the information comes from an EHR, e.g. "mm[Hg]" (UCUM) would go into VSRPRESU and would then be "translated" into "mmHg" (CDISC-CT) which then goes into VSORRESU.
Which of the two (if not both) would then be made "expected" is also an interesting discussion.

Though this doesn't take away my objections agains "the reinvention of the wheel", I think it could be a very good compromise.

Whether derived variables should go into SDTM at all is another discussion. For example, we must submit "AGE" in the demographics domain although that is usually never collected, but derived from BRTHDTH (birth date) and RFSTDTC (reference start date). Whether such derived variables should be in SDTM is surely related to the question "Is SDTM a database (design), and if so - is it a good one?" which I discussed in a previous blog entry.
My next blog entry will probably go about why SDTM has more and more derived variables.