Saturday, July 21, 2012

Is SDTM a database (design), and if so - is it a good one?

SDTM 1.3 and SDTM-IG 3.1.3 have just been publised. A good time to review and discuss what SDTM iin fact is. SDTM consists of related tables, so is it a database (design)?

At the university I am teaching databases and database design, so I do have some authority to make an analysis from a theoretical point of view.

Let us take a typical table from the SDTM: the Vital Signs table. A snapshot view is shown below.


First of all the keys: the combination of STUDYID, USUBJID, VSSEQ can be defined as the primary key. Although the latter is an artificial key, usually generated in the very last stage of the generation of the table.
USUBJID is certainly a foreign key to USUBJID in the DM table, and VISIT a foreign key to the TV (Trial Visits) table. Or is VISITNUM the foreign key to the TV table?
Here is where the problems start ...

In our first database course (second semestor bachelor) we teach the students about the first, second and third "normal forms" (usually abbreviated as NF). A good database design is in the third NF. A bad database design in the second NF, and a very bad database design in the first NF. A database that doesn't obey the first NF is not worth the designation "database".
The difference between a database design in the third NF and one in the second NF is that the former does not have any "transitive dependencies" anymore. A transitive dependency is an A -> B -> C dependency. Let me explain a bit.
Suppose a table with customer number (primary key), first and family name, postal code (zip code) and city.
When I do know the customer number, I do immediately know his first and family name by a simple lookup (that is what we call a dependency). I do also know his postal code. So we have two dependencies:
CUSTOMERNUMBER -> FIRSTNAME,FAMILYNAME and
CUSTOMER -> POSTALCODE

There is however still another dependency which is not on the primary key, i.e.
POSTALCODE -> CITY
as if I do know the postal (zip) code, I do automatically know the name of the city.
So we have:
CUSTOMER -> POSTALCODE -> CITY
This is what is exactly meant with a "transitive dependency", and the third NF states that these should be avoided.

Why?

Let us take an example again: I have a customer John Smith with customer number 12345 with zip code 10001, living in Manhatten, New York, all in the same table. Everything OK.
I now add a new customer (I do an insert into the database table) with customer number 23456 with name Mary Poppins, and postal code 10001 (again) and city Chicago. Agree?
Most of you will immediately protest, as the postal code 10001 does not belong to Chicago! It belongs to Manhatten, New York!
However, a database management system will approve such an insert as it has no way of knowing that there is a relationship between postal code 10001 and Manhatten. If I do know "10001" I do immediately know that the customer is living in Manhatten (remark here that I assume that this is a database of customers living in the USA only).

So, what does the theory of the third NF learn us to do in such a case? It learns us that we should split the table into two tables: The "customer" table will contain the postal (zip) code along with some other information (but not the city), and the postal code should be a foreign key to another table which contains the postal codes and their corresponding cities. When adding a new customer, it is then not possible anymore to create an invalid postal code / city pair, which increases the probability of keeping a consistent database.

Let us know return to our SDTM VS table. Obviously it is a hypervertical table (more or less according to the so-called Entity-Attribute-Value (or EAV) model). Does it obey to the third NF? Or does it have transitive dependencies? Let us have a look.

We immediately see that there is a 1:1 relationship between VSTESTCD and VSTEST: if I do know the value of VSTESTCD, then the value of VSTEST is fixed. For example, when VSTESTCD has the value "SYSBP", the value of "VSTEST" MUST be "Systolic Blood Pressure". It cannot be e.g. "Hearth Rate".
If I do have such a case (combination of VSTESTCD=SYSBP and VSTEST=Hearth Rate) tools like OpenCDISC will give me an error. If I build up a database with such a discrepancy however, the database management system will not protest at all.

The same is true for VISITNUM and VISIT in the VS table. They are dependent on each other: if VISITNUM is given, VISIT is fixed, and vice versa.
If I generate a database and do an insert into the VS table with a value of "1" for VISITNUM and "Visit 1" for VISIT, and then do (for example for another subject) an insert with "1" and "Visit 7", the database management system will not protest at all as it cannot know that there is this 1:1 relationship.

So: SDTM cannot be considered as a good database design, as it does not obey the third NF!

So what is SDTM? Why did the designers of SDTM allow that VSTESTCD and VSTEST have a dependency but are in the same table, and that VISITNUM and VISIT are columns in the same table although that violates the rule of the third NF?

The reason is simple: because the reviewers wanted this so.
A reviewer wants to see VSTESTCD and VSTEST at the same time because he/she maybe doesn't know that "HR" means "Hearth Rate". You might think this is a stupid argument, but think about the LB domain with its large numbers of test codes: a reviewer cannot know them all.
So, essentially, SDTM is a VIEW on a database, as we would implement it for end users that only have read access. For those not common with the concept of database views, here is a link to the Wikipedia entry.
So now you might ask: why don't we send the FDA the SDTM database (tables) that has a correct design (third NF - meaning minimal risk of inconsistencies) and then the FDA loads it into the data warehouse and then provides views on that to the reviewers. That would be ideal, wouldn't it?
The reason is that the FDA does not have a functioning SDTM data warehouse (there have been several attempts/versions of JASON but none of them were really of any use, and the project has now been outsourced to NCI), and that the reviewers use the SDTM tables "as is", i.e. just using the SAS viewing tools (SASViewer).
As far as I know, the received SDTM tables are never used to populate a reviewers database, and where there is no database, views can of course not be provided.

Conclusions: SDTM is not a database (design), it is a VIEW on a database. However, most people (including the FDA reviewers) use it as a database, and do not try to recover the (or a) database from the view.