Sunday, July 13, 2014

Why SDTM should NOT contain --TEST as a variable

All the findings domains in the SDTM have both --TESTCD (test code) and --TEST (test name) variables. There is a pure 1:1 relation between --TESTCD and --TEST: for each unique value of --TESTCD there is a single unique value of --TEST. For example for LBTESTCD=GLUC, only LBTEST=Glucose is allowed.

Here is a view from a sample SDTM submission:


Nice isn't it? But did you notice that there is an error? There is an LBTESTCD="FRUCT" with LBTEST="Glucose". Would a reviewer really notice? Can a machine easily find out that this is an error?

Although there is a 1:1 relation between both, CDISC published codelists for both separately. So there is a codelist for LBTESTCD and another for LBTEST. A bit strange isn't it?
The 1:1 relation between individual terms is then established by the "NCI code". For example, both "GLUC" (in codelist LBTESTCD) as well as "Glucose" (in codelist LBTEST) have the same "NCI code" which is "C105585". So if one wants to validate whether a test code and test name really fit together, then one needs to go over the NCI code, and that is essentially what OpenCDISC is doing.
Of course, this leads to trouble when extending a codelist with own terms for --TESTCD and --TEST. Using the "CDISC way" there is no good way in define.xml to state that an added --TESTCD belongs together with an added --TEST value. An example could e.g. be "FRUCTO" and "Fructose" which are currently not in the CDISC-CT, and which need to be added to two separate codelists.
It was also found by my colleague Dave Iberson-Hurst that this approach (linking over the NCI code) has led to versioning issues, i.e. terms changing names between versions without notice!
Also, we found a few cases where this linking mechanism leads to false/wrong values for --TEST.

Although seeing the test name for each given test code is a nice feature for the reviewer, one should ask oneselve whether --TEST should really be submitted in the datasets themselves, as this not only violates the third normal form for good database design (see my previous posts) but also blows up the sizes of the data sets themselves. I estimate that data sets could be approximately 20% smaller when --TEST would not be submitted.

There are two major types of solutions for resolving these issues.

The first is to recognize that this 1:1 relation exists, and that --TEST is essentially metadata (data about data) and that the codelists for --TESTCD and --TEST are essentially one codelist, meaning that they should be merged. This can be done using the classic CodeListItem mechanism in define.xml with a "CodedValue" and a "Decode". For example:


A viewer can then retrieve the "Decode" value from the define.xml and display it in a column --TEST that is generated by the viewer itself (so --TEST is NOT in the submitted data set). In databases, this corresponds to a JOIN between two tables (one with data and one with metadata).

If a company sticks to published CDISC-CT, then a second solution comes into play: web services, i.e. CDISC is publishing the controlled terminology and makes it available as a web service, e.g. using REST or SOAP (this could be done through SHARE). A viewer tool (or any other software)  then retrieves the value of the submitted --TESTCD (e.g. GLUC) and then looks up the value of the corresponding --TEST (test name) using the web service.
One of our students, Mr. Wolfgang Hof has designed and implemented such a web service on a local server at the university. He also implemented it (client side) in our "Smart Dataset-XML viewer": when the user hovers the mouse over a --TESTCD value, the web service is triggered, a remote database queried, and the details about the test are displayed as a tooltip:



Both approaches essentially correspond to the idea that tools should retrieve metadata for data, and that metadata are kept separated from the data themselves (as is also done in good database design).
An error such as in the above example can then not occur anymore...
So if we following this "good practice" principle, we do not need --TEST anymore.

Let's take the next step and throw --TEST out of the SDTM. It is metadata, not data!





No comments:

Post a Comment