Sunday, November 29, 2015

SDTM - Moving away from files

Reviewers at the FDA always complain about file sizes. Until now, they haven't embraced the new CDISC Dataset-XML standard mostly because the file sizes are usually (but not always) higher than for SAS-XPT. On the other hand, they do not allow us to use zipped Dataset-XML files, although these can be read by several tools (like the open-source "Smart Dataset-XML Viewer") without the need of unzipping first. Even worse, each time a new version of the SDTM-IG comes out, a number of unnecessary derived variables (like EPOCH) is added on request of the FDA, further increasing the file sizes. So they are to blame themselves ...

The first time I was re-thinking this "problem" was during the development of the "Smart Dataset-XML Viewer". During testing the tool with large SDTM files (like QS and LB), I was wondering how reviewers could ever work efficiently when they work with (ten)thousands of rows in any kind of viewer. Even though we added a number of smart features (like one-click jumping to the corresponding row in the DM dataset - try that with the SASViewer...), the amount of information is overwhelming. So we added filtering features ...

Essentially files are very inefficient for large amounts of information. If you want to find a particular piece of information, you first need to read the complete file into your tool...
Large amounts of information should reside in databases (relational or XML or mixed). Databases can easily be indexed for query speed, and tools need only to load the minimal amount of information that is required to do the task. However, only a minor part of the FDA reviewers use a database (like the Janus Clinical Trials Registry), all the other use ... files.

So what are files good for? In first instance, you need them to store computer programms. Also you need them for unstructured information. And you usually need them for transport of information between computers (although (RESTful) web services can do the same). As each SDTM submission (but also ADaM and SEND submissions) needs to be submitted to the FDA as a set of files (using the eCTD folder structure), the first thing the FDA should do (and I think they do) is to load the submission in databases, which can be the Janus-CTR.

As of that point, reviewers should be forbidden to use the submission files as files.
They should only be allowed to retrieve the information they need from the databases or CTR. That would also make their work more efficient, so that patients get safer new drugs faster.

This would also once and for all end the discussion about file sizes.

The SDTM is completely based on the concept of tables and files. SAS-XPT is still required for electronic submissions. SDTM contains large amounts of unnecessary and redundant information. An example is the "test name" (--TEST) which has a 1:1 relation with "test code" (--TESTCD). Test names can however be looked up e.g. using RESTful web services, or by a simple lookup in a database (or even in the define.xml). We urgently need to start trimming the SDTM standard, and remove all redundant and unnecessary variables, as these lead to errors in the data. We urgently need to move away from SAS-XPT for the transport. And the FDA should forbid its reviewers to use "files", and only allow them to use submission data that is in databases.












No comments:

Post a Comment