Thursday, November 8, 2012

SDTM databases and the FDA

Two days again, my (first year) bachelor students had their "Introduction to Databases" exam.

When looking for a good question on "CREATE VIEW" (or joins in general) I was thinking again about SDTM and whether it is suitable as a database. After all, the FDA has done several attempts to create an SDTM datawarehouse, and as everyone having some database skills knows, you cannot (or almost not) create a datawarehouse without having one or more databases.

So I came to the following exercise:

"Given the following tables:


write an SQL statement to generate the following result table:"

and then I gave them a picture of the result SDTM table for the Laboratory "LB" domain.

(P.S. a correct answer is: "
CREATE VIEW Laboratory AS
SELECT t.STUDYID, t.DOMAIN, t.USUBJID, t.LBSEQ, t.LBREFID, t.LBTESTCD, tc.LBTEST, tc.LBCAT, v.VISITNUM, v.VISIT, v.VISITDY FROM Laboratory_test t, Testcode tc, Visit v WHERE t.LBTESTCD = tc.LBTESTCD AND t.VISITNUM = v.VISITNUM; "  )

This brought me to the following thoughts: "if we submit our SDTM datasets as essentially being a "View" on an SDTM database (see a previous contribution), how can the FDA reconstruct a database from this?"

After all, they want to use this kind of data in a datawarehouse, so they need to start from databases. If they would like to reconstruct the database from the (sas transport) tables, they do need to split the "view" (e.g. the LB sas dataset) into 3 or more tables.
Or can one start from the "view" to populate a datawarehouse?

Splitting up a "view" table into 3 (or more) tables so that the first, second and third normal forms are obeyed, and doing this in an automated way, does not look to be very simple to me.
Just suppose that there is one inconsistency in the "view" (sas dataset) e.g. that for the same LBTESTCD (e.g. "BILI") there is more than one corresponding value for LBTEST (e.g. once "Bilirubin" and one "Billirubin"). What would then happen?

But we do not know whether the FDA really tries to reconstruct the original database, or that it just uses the SDTM (sas) tables "as is".

Any comments (especially from FDA people) are of course very welcome.

Saturday, November 3, 2012

Some comments on the "Pink sheet" article



The pink sheet, a well-known online journal on regulatory, legislative, legal and business developments, published an article on October 23 titled "FDA Data-Standards Landslide: "CDISC’s Model Wins Docket Comment Contest".

Here are some of the highlights followed by my personal comments (which are not necessarily those of the CDISC XML Technology Team - IMPORTANT):

"... Therefore, the agency wishes to replace it with an open set of standards “that will support interoperability,” and FDA asked for comments  on CDISC/ODM versus a competing set of standards from Health Level 7."

The question here is whether the ancient SAS-XPT transport format currently used for electronic SDTM, SEND and ADaM submissions should be replaced by either a CDISC ODM based format or by an HL7-v3 based one (probably CDA-similar, or a totally new HL7 XML-based format). Please remark that the discussion is about a transport format, i.e. the SDTM standard itself would not necessarily need to change. Or in other words, the SDTM domains and variables would remain the same, it is just the way they are formatted and transported (the transport format) that would change.

"Pharma and bio industry majors, including Novo Nordisk AS, Novartis AG,Sanofi,Amgen Inc.,Merck Sharp & Dohme Ltd., Astellas Pharma US Inc., BioClinica andBiogen Idec Inc., lined up behind CDISC."

That is of course good to hear. Sponsor companies in the past did not matter too much about transport formats, they just concentrated on semantic standards like the SDTM standard. But they did realize that the XPT format was a "dead end", with very many disadvantages leading to a lot of problems. They now do also realize that CDISC ODM maybe an excellent modern replacement, and that HL7-v3 is not a suitable format for submissions to the FDA (see e.g. my article "Ten good reasons why an HL7-XML message is not always the best solution as a format for a CDISC standard (and especially not for submission data").

"Amgen’s comments  were typical of the tenor of the support: “CDISC ODM is already well integrated into clinical data systems and there is a broad knowledge of this standard within the BioPharma industry already. It supports both metadata and data exchange, was developed for the exchange of clinical research data and is 21 CFR Part 11 compliant.
However, the company added, it will be necessary to anticipate and deal with variations in interpretation by different industry users of the standard and between sponsors and FDA, as well as “handling multiple versions within and across submissions for the same compound.”
 
The latter is of course correct. We will need to extend the ODM standard or develop an extension for it (as we did for define.xml) and we will need to publish an implementation guide. We will also develop an ISO-Schematron to clearly define most of the business rules such as the rules for the use of --STAT and --REASND. So we will not allow that there are different interpretations and "dialects" of the standard.
 
"Novartis' comments  argued that 'given that CDISC ODM is kept closer in synch with the CDISC standards themselves and the industry has been watching and moving towards being able to provide CDISC SDTM [CDISC’s Study Data Tabulation Model]-compliant submissions, it would seem to make sense that using CDISC ODM as an exchange standard makes more sense than any other'. Novartis added that work remains to be done on ODM to bring it into compliance with the widespread pharma industry use of relational technology for data storage and analysis, as well as on ODM’s extensions."
 
Of course, we also realize that work still needs to be done on ODM. Some of us are already inventorizing what should be done better in a next version of ODM, or what should be added. For example, we need even more support for data points coming from electronic health records (see some of my articles in another blog), this although ODM does already have such support through its extension mechanism.
 
"Merck’s backing of ODM was more conditional. The company said it currently uses ODM for third-party data but prefers to keep using SAS version 9 or an ASCII file format “because the ODM is ‘performance heavy,’ meaning that it slows system performance.” Therefore, Merck’s comments  suggested FDA use the “Define.XML” subset of ODM, although elsewhere in its comments it cautioned against using XML (extensible markup language) itself as a submission standard, since translating all study data into this format would be time-consuming and overly demanding of FDA system resources."
 
The remark that ODM is "performance heavy" is not entirely true. I admit that this was the case 5-10 years ago, but since then, XML parsing software and libraries have evolved enormously and show much much better performance. My own observation is that reading and displaying a very large SDTM dataset in XML (given well written software) is as performant as with the SASViewer for XPT files.
Furthermore, some new technologies such as VTD-XML allow to minimize the memory usage when parsing very large XML files, such that files that have a disk size up to about 50% of the memory size can still be parsed and displayed.
Also here, my own experience is that for the same content, XML files with their tags do not have to be larger than SAS-XPT files, as the latter wastes a lot of disk space due to the fix-length records.
 
Merck's comment on the use of define.xml builds further on this. My answer is that of course the new ODM-based format for electronic submissions will be based on define.xml!
Just as for collected data the ODM element "MetaDataVersion" contains the metadata and the element "ClinicalData" contains the actual data (and which have to match each other), the new format (although very little will be really new) will have the define.xml for the metadata and another (maybe  new element) will contain the actual submission domain data (and which have to match the metadata in the define.xml).
 
"Commenters also pointed out that clinical trials typically last so long that the data standards may change while the study is under way, creating a big problem for sponsors; that data standard disharmony creates problems for “secondary data re-use” in future studies; and that therapeutic area data standards are desperately needed".
 
The ODM team has always been very keen on taking care that ODM is "downwards compatible", i.e. that a valid ODM file of an earlier version is (except for the namespace usage) also a valid ODM file in the new versions. When we do deprecate an element or attribute, we always further support it for two intermediate versions. This is in huge contrast with HL7, where versions 2 and 3 of the standard are totally incompatible (0% reuse possible).
And even then. E.g. for define.xml we will provide an XSLT transformation stylesheet that transforms define.xml v.1.0 files into define.xml v.2.0 files.
 
"Several commenters noted that there is a need for international harmonization of clinical data standards, a fact highlighted even by groups that oppose CDISC/ODM. The Vereniging EN13606 Consortium, a Dutch group that backs a European communication standard called ISO 13606 EHR, and a group at Oxford University that backs other international standards development initiatives both called for better harmonization."
 
As I already showed in a previous contribution, it is now already possible to include data points from electronic health records (EHRs) in ODM. In that blog contribution, I demonstrated how that can be done for data points from CDA EHRs, but the same is equally true for data points coming from EN13606-based EHRs (such as OpenEHR).
The commenters are right about international harmonization, but what they point to is essentially the "EHR standards war" between HL7 on one side and OpenEHR/ISO13606 on the other side.
It should not be the task to CDISC to solve this conflict, but our submission standard should be able to support both, which it does already do now.
 
Your comments are as always highly appreciated.