Sunday, January 8, 2017

Why rule FDAC036 is hypocritical

We all have encountered the message "Variable length is too long for actual data" when validating our SDTM, SEND or ADaM submissions using the "Pinnacle21 Validator". This error message appears in case we generated a SAS file for which a variable has been assigned a length which is (1 byte or more) larger than the length of the longest variable for that variable.
For example, if your longest AETERM has 123 characters, and you assigned a (SAS) length of 124, then this error will appear in the validation report - usually causing a lot of panic at the sponsor, as it might lead to a rejection of the submission.


In my prior blog entry, I already showed that SAS Transport 5 (SAS-XPT) is a very inefficient transport format. At the time it was developed, it was meant to enable exchange of data between different SAS systems, one on an IBM mainframe, the other being a VAX computer. Do you still own or use one of these? I don't. The format was quite OK for this purpose those days, but it is still unclear why the FDA selected that format, especially as I showed that CSV (comma separated values) is about 7 times as efficient on the average.
The FDA is mandated by law to be vendor-neutral. Although the specification for the SAS-XPT format is public (the famous TS-140 document), it is rather difficult to implement in non-SAS software, so one cannot say it is really vendor-neutral. So why did the FDA select this format in favor of CSV? Or was CSV not there yet, or couldn't it be read by the programs the FDA was using? If you know it, or can point me to any public literature, please let me know.


The FDA is always complaining about too large files, and that is why they came with the famous rule FDAC036. But isn't that the result of their own choice for the inefficient SAS-XPT format?


But let us also have a look at the SDTM standard itself. Those who have followed the evolution of the standard in the last 15 years or so know that with each new release, the number of variables has increased, thus leading to larger file sizes (also as in SAS-XPT a NULL value takes the same amount of bytes as a non-NULL value). Most of these new variables have been added ... on request of the FDA. Even worse is that most of the variables added on request of the FDA contain redundant information. A typical example are the --DY (study day) variables appearing in almost every domain. It's value can easily be calculated (also "on the fly") from the --DTC (date/time of collection) and the reference start date time (RFSTDTC) in the DM (Demographics) dataset.




So why do we need to add --DY to most of the datasets (with the danger that it is incorrect) whereas it can be calculated "on the fly"? The FDA answer is "in order to facilitate the review process". Does this mean that the review tools of the FDA cannot even do the simplest derivations? It can't be that hard - I added this feature to the open source "Smart Dataset-XML Viewer" in just one evening!


Another famous example is the "EPOCH" variable (rule FDAC021) which can normally (in a well designed study) be derived "on the fly" from the --DTC and/or the visit number. But it looks as the FDA prefers to add an extra variable to account for badly designed studies instead of requiring are well designed.


There are very many variables in SDTM that are unnecessary, and could easily be removed from the standard, as they contain redundant information. Even the --TEST (test name) variable could easily be removed, as it can simply be looked up (again "on the fly") in the define.xml.


In this example, LBTEST has been removed from the dataset, but the tool simply looks it up in the define.xml from the value of LBTESTCD


I estimate that about 20% of the SDTM variables is redundant, accounting for about 30% of the file size! So even when using the ineffĂ­cient SAS-XPT format, files sizes could be reduced by about 30% by removing these redundant variables, with the additional advantage of considerably improved data quality (redundancy is a killer for data quality).


Did you ever count how many times the same value for "STUDYID" appears in your submission SAS-XPT datasets? Well, it is in every record isn't it? The SAS-XPT format requires you to store it millions of times with the same value. Is that efficient? The reason for this is that essentially, the SDTM tables represent a "View" on an SDTM-database, rather than a database itself. In a real database, STUDYID would be stored once in a table with all studies (e.g. for the submission), and all other tables would reference it using a "foreign key", meaning that the other tables do not contain the STUDYID value itself, but a pointer to the value in the "studies" table. Now a pointer uses considerably less bytes that a (string) value itself.
The same applies to USUBJID: they are defined once (in DM) and should then be referenced (foreign key) from all other tables (using a pointer). Instead, SAS-XPT requires you to "hardcode" the value of each USUBJID as a string (not as a pointer) in the datasets.
For example, the well-known "LZZT 2013 pilot submission" has 121,749 records in the QS dataset for 306 subjects (an average of 398 records per subject). This QS dataset contains 121,749 times the same STUDYID value (12 bytes) and on an average, 398 times the same value for USUBJID per subject. So on the average, the same value for USUBJID (11 bytes) is hard-coded 398 times in the dataset, instead of using record pointers to DM. What a waste!


Remark that in our "Smart Dataset-XML Viewer", we do use pointers in such a case, in order to save memory, using the principle of "string interning".


But what if we could organize our datasets hierarchically? For example, order by subject and then by visit? So that in each dataset, the value of USUBJID would only appear once? And doesn't the "def:leaf" element in the define.xml already connect the STUDYID with the dataset itself, so that it is unnecessary in the dataset itself? That would be considerably more efficient isn't it?


The former (organization of the data per subject per visit) is exactly what the ODM standard is doing! The new Dataset-XML (based on ODM) doesn't do this: the CDISC development team decided to keep the old "2-dimensional" (but inefficient) representation in order to make it easier for the FDA to make the transition. Organizing the SDTM/SEND/ADaM data in the way ODM does it originally would further make the transport (file) more efficient.


But should all that matter? My colleagues in bioinformatics laugh at me when I tell them about the FDAC036 rule. In their business, the amount of information is much much higher, and they are able to exchange it efficiently, e.g. by using RESTful web services to exactly retrieve what is necessary.
As I already stated in the past, large amounts of data belong in databases, not in files. The file can only be a way of transport of data between applications. Essentially, when a submission arrives at the FDA, it should be immediately stored in a database (could e.g. also be a native XML database), and the reviewers should only be allowed to query such databases - they should not be allowed to mess around with files (XPT or any others). But we are still far from such a "best practice" situation, unfortunately.


Conclusion


Rule FDAC036 forces us to "save on every possible byte" when generating our SAS-XPT datasets, in order to avoid that their sizes become too large (for the gateway?). However, the SAS-XPT format itself is highly inefficient, and file sizes have grown considerably due to ever new requirements of the FDA, adding new redundant (SDTM) variables. Also we are forced to stay working with the highly inefficient two-dimensional representation, with lots of unnecessary repeats of the same information.


And then I did not speak yet about the prohibition by the FDA to submit compressed (zipped) datasets, which would reduce the file sizes by a factor of 20 and more.


It's up to you to decide whether FDA rule 036 is hypocritical or not ...