Sunday, April 21, 2013

Alternatives for SAS Transport 5 - 20 years ago

I have been often criticized that I criticize SAS Transport 5 without proposing a better alternative.
This blog contribution is a response to that.

Many many years ago (I haven't been able to find out exactly how many years ago - If you know it, please let me know), the FDA decided for SAS Transport 5 (".xpt" format) for submission of datasets. It is not clear to me how and why they came to this decision. I discussed it with several people that were already around at that time, but their statements are not unambiguous.
It is currently however clear that this was not a good choice, but that is an easy statement many years later.

First of all, is the choice for transport format important?
Essentially, SDTM is a semantic standard: it is about content.
SAS XPT, XML, CSV and others are transport formats, you can transport any kind of data with them.
So, one could transport SDTM data using any transport format, as the two are essentially independant. Stating e.g. that HL7-v3 is a better format for transporting SDTM data as it allows integration with healthcare is nonsense! It is the same as stating that "using the same truck for transporting cows and for transporting oranges allows us to breed cows that produce orange juice".
However, we cannot expect from the FDA that it accepts SDTM (and SEND and ADaM) data in any transport format - a choice (only one, or maybe a few) must be selected.

So what could have been the alternatives 20 years ago?
First we must take into account that at that time we did not have define.xml for the metadata (XML was still in it's early stages and now well known): metadata was usually submitted as a text file or PDF.

Let us look at few alternatives ...

A. CSV
Comma-separated (or similar) format has the advantage that it is very compact. It is for example (using a vertical bar as field separator) used in HL7-v2 messages with a lot of success.
I recently generated some CSV files from SDTM XPT datasets and for example found that the LB dataset from the FDA Pilot "LZZT" study was reduced by a factor 6 in file size. "Great" you would say when you think about the "file size complaints" of the FDA. However CSV also has some major disadvantages. First of all: how to declare the variable names and labels? That can be solved by adding two extra lines at the top of the document containing them. But what about associated codelists for coded variables (not to speak about valuelists)? That is much more difficult. Of course one can add extra CSV datasets for the coded values (enumerations) too, but how can one make the link between both?
I think it can be done somewhat artificially, but it is not ideal.

B. SQL
SQL is a data definition and query language for relational databases. It is standardized as ANSI-SQL and already very long in place and very much used. In a previous blog I wrote that SDTM is not to be considered as a database, but as a "View" on a database. So instead of obtaining the database itself, the FDA has decided that we must submit "Views" on the database. This has many disadvantages, as in a "View" it is much more difficult to detect data inconsistencies, and it is difficult (if not impossible) to 100% reconstruct the original database. The choice for the "View" surely was surely related to the choice for SAS Transport 5.
So why not submit the database instead of the view on the database? It would have many advantages, as well for the submittors as for the FDA itself.
How can we exchange a database in a vendor-neutral way (one of the requirements)? The answer is "ANSI SQL". The design and metadata of the database could be exchanged using ANSI-SQL "Create Table" statements, with primary key and foreign keys and everything, so also defining the relations between the tables. The actual data could then be exchanged using "SQL Insert" statements.
What would be the advantages? First of all we see that SDTM is in each new version containing new and more derived variables. For example, RFXSTDTC (Date/Time of first Study Treatment) is also already present in the EX (Exposure) dataset. RFXSTDTC was only added to DM because the reviewers are unable to combine both datasets (they just use the viewer which is not "understanding" SDTM). Using ANSI-SQL we could (and we MUST - database "normal forms" require this) get rid of redundant information, which would mean that our data is much more consistent.
Once the FDA would have received the "SQL Create Table" and "SQL Insert" datasets, it could easily import them into a database (the first step in using a data warehouse by the way), with consistencies being checked over the foreign keys. They could then create their OWN views on the data (ANSI-SQL "Create View", such as views very similar to the current SDTM datasets.
This would mean that much less data would need to be submitted (I guess a 50% reduction or more) and that the FDA could easily create any possible views on the data, something they currently cannot do.

Remark that it is still possible to keep the "vertical" table structure as in the SDTM (in the database world, this is called the "entity-attribute-value" (EAV) model. It would also allow to keep the supplemental qualifiers within their parent dataset (a major problem in the current SDTM implementation).

C. Any other?
Would you have had another choice for a vendor-neutral transport format for dataset submission 20 years ago? Please let me know or write a comment.


My personal choice:
If I would have been the FDA twenty (or so) years ago, I would have chosen for ANSI-SQL as a vendor-neutral data submission format. But I am not the FDA, and I was not working in clinical research yet at that time. I do not know whether it has ever been considered (if you know, let me know).

So why not ANSI-SQL as the new transport format today?
Data exchange between computers and especially databases is usually not done using SQL nowadays. The reason is that we have a better alternative, namely XML. XML is not limited to representing flat, two-dimensional data, but has much more features. In a world "that is not flat", it can be used for representing many-dimensional data (though SDTM currently remains "flat" - will it after SHARE?) and it can also be used to e.g. visualize data (using stylesheets) something that cannot be done using the SQL format.

Your comments are welcome, as usual!


Why SUPPQUAL sucks

Following up on my previous post on NSVs (Non-Standard Variables) which are banned to SUPPQUAL datasets in SDTM, I was wondering why it would be so hard (for the FDA) to recombine such NSVs with their parent record (or group of records) in the parent domain.
So I tried it myself.

We are currently working hard on an XML based format for SDTM (and ADaM and SEND) submissions. It is based on ODM "ClinicalData" and "ReferenceData". You can have a sneak preview here. We (one of my Masters student and I) am also working on a "smart" viewer for such data. I will report on the "Smart SDTM Viewer" later.
In the new XML format, we support that NSVs remain in their parent record, i.e. there will just be an "ItemData" for the NSV within the parent original "ItemGroupData" which contains also all the standard variables. In the define.xml, the corresponding metadata "ItemDef" then just has a flag - we use Role="SUPPLEMENTAL QUALIFIER".
Also the "Smart SDTM Viewer" supports this: such columns are simply colored differently.

As we needed test files for new format and the viewer, I took the recently published "Updated Pilot Submission Package" SAS Transport files (available from the "members only" part of the CDISC website) for the famous LZZT study. I converted them into our XML format and then tried to recombine SUPPXX datasets with their parent dataset. For doing so, I wrote an XSLT-2 stylesheet allowing two XML files to be recombined into a single one.

Everything went well for the SUPPDM, SUPPDS and SUPPAE datasets. Trouble started with SUPPLB.
The latter has over 64000 records and has a size of 55MB. The parent LB.xpt dataset has almost 60000 records and has a size of 33MB. This means that there is somewhat more than 1 NSV per LB record.
The fact alone that the size of SUPPLB is larger than that of the parent dataset shows that this format is not very efficient.
Now I started to run the transformation and waited, waited ... waited. I went to bed late in the evening and was surprised the following morning that the transformtion was still running.
Ultimately, it took over 2 days to recombine SUPPLB with LB.

So I looked into my recombination algorithm and tried to improve it. I soonly recognized there was not much room for improvement.

The problem came up again when working on the "Smart SDTM Viewer". It has a feature that when clicking a record in a SUPPXX "tab" or a record in the "RELREC" or "CO" "tab" in the viewer, one can be navigated to the parent record with a few mouseclicks only (try this using the SASViewer...). For FDA reviewers, I presume this will be a "killer feature".
Now I wanted to do the opposite, i.e. each record in the parent dataset that has one or more corresponding records in either a SUPPQUAL, CO or RELREC dataset would be marked (e.g. using another background color). In order to do so, a lookup must be done for each record in SUPPXX, RELREC and CO datasets.
What is the problem?

If you look at the structure of SUPPQUAL (CO and RELREC are very similar), you see that there is an IDVAR ("identifying variable") and IDVARVAL ("identifying variable value"). IDVAR can in principal be any variable from the parent dataset. Usually one chooses for --SEQ, but other popular identifiying variables are --SPID, --CAT, or even --VISIT or --VISITNUM.
Now when starting searching from the SUPPQUAL record, it is easy, as you know (thanks to IDVAR and IDVARVAL) what to look for in the parent domain.
When starting searching from the parent dataset and record however, you do not know (and the computer either) what to look for. You will probably start with USUBJID, and select all records in the SUPPQUAL dataset for that USUBJID. Then for each of these, you read what IDVAR and IDVARVAL are and check whether these correspond with the selected source record in the parent dataset. If there is a match, you create the NSV "ItemData" and add it as the last variable in the parent record (ItemGroupData). If you have 60000 records in the LB dataset, and 64000 in the SUPPLB dataset, all these lookups take a lot of time.

Conclusion: the way SUPPQUAL was developed easily allows to do a lookup from the SUPPQUAL record for the parent record. The other way around however is much more difficult, and will always take a lot of computer time if it has to be done for each record in large datasets such as the LB dataset.
So it is no wonder at all that FDA reviewers are unable to view NSVs together with the parent data in their standard review tools.

I do not know what was in the mind of the developers of SUPPQUAL at that time, but I would now consider it as a major design error.
There could have better choices, which I will explain in a future blog.