Monday, November 24, 2014

Follow up to "FDA publishes Study Data Validation Rules"

My good friend and colleague at CDISC Sam Hume picked this up, corrected my code and tested it on real Dataset-XML files. Here is his code:

declare namespace def = "";
declare namespace odm="";
for $s in doc('file:/c:/path-here/define.xml')//odm:ItemDef[@Name='ARMCD'] 
    let $oid := $s/@OID
    for $armvalue in doc('DM.xml')//odm:ItemGroupData//odm:ItemData[@ItemOID=$oid]
        where string-length($armvalue/@Value) > 20
            return <error>Invalid value for ARMCD {$armvalue} - it has more than 20 characters</error>

He used oXygen XML Editor and ran the XQuery on a file rather than on a native XML database (I use eXist).

So I tried another one: rule #175: "Missing value for --STAT, when --REASND is provided" with: "Completion Status (--STAT) should be set to 'NOT DONE', when Reason Not Done (--REASND) is populated". Here is my XQuery (running against the eXist native XML database where I loaded the test files):

(: Rule FDAC175 :)
declare namespace def = "";
declare namespace odm="";
declare namespace data="";
(: get the OID for VSSTAT :)
for $s in doc('/db/fda_submissions/cdisc01/define2-0-0-example-sdtm.xml')//odm:ItemDef[@Name='VSSTAT'][1]
let $vsstatoid := $s/@OID
(: get the OID for VSREASND :)
let $vsreasndoid := $s/../odm:ItemDef[@Name='VSREASND']/@OID
(: select the VSREASND data points  :)
for $record in doc('/db/fda_submissions/cdisc01/vs.xml')//odm:ItemGroupData/odm:ItemData[@ItemOID=$vsreasndoid]
(: get the record number :)
let $recnum := $record/../@data:ItemGroupDataSeq
(: and check whether there is a corresponding VSSTAT :)
let $vsstat := $record/../odm:ItemData[@ItemOID=$vsstatoid]
where empty($vsstat)  (: VSSTAT is missing :)
return <error recordnumber="{$recnum}" rule="FDAC175">Missing value for VSSTAT when VSREASND is provided - VSREASND = {$record/@Value}</error> 

I added some comments so that the code is self-explaining.
Essentially, the FDA rule is not one rule, it are two rules. So I still need to adapt the code somewhat so that is also checks on the present of "NOT DONE" for VSSTAT. Here is the corrected part:

where empty($vsstat) or data($vsstat/@Value) != 'NOT DONE'
return <error recordnumber="{$recnum}" rule="FDAC175">Missing or invalid value for VSSTAT when VSREASND is provided - VSREASND = {$record/@Value}</error>

The data() function is important to retrieve the value from the attribute instead of getting the attribute as a node.

In the next few weeks, I will publish more about this nice way of defining the FDA rules extremely precise (no room for different interpretations) and in a machine-executable way.
If we can get this done, everybody will be playing by the same rules ... Isn't that wonderful?

Thursday, November 20, 2014

FDA publishes Study Data Validation Rules

The FDA recently published its "Study Data Validation Rules" ( for SDTM and SEND.
Unfortunately the rules come as a set of Excel files, so not vendor neutral (Excel is a product of the company Microsoft) and the rules themselves are unfortunately not machine-readable nor machine-executable.

A snapshot from the Excel file shows how the rules are defined:

Rule #67 saying that the value of "ARMCD" in the DM, TA and TV dataset should not exceed 20 characters in length.

This one is clear, but other of the over 300 rules are harder to interprete. What about:

"<Variable Label> (<Variable Name>) variable values should be populated with terms found in '<Codelist Name>' (<NCI Code>) CDISC controlled terminology codelist. New terms can be added as long as they are not duplicates, synonyms or subsets of existing standard terms."?

Anyway - not machine readable nor machine executable.

Now, many of you will say: "Wait a minute Jozef, we cannot expect the FDA to provide validation source code for different languages like Java, C#, etc.".

This is where XML comes in. Dataset-XML was recently developed to replace SAS-XPT so that we can take advantage of what XML offers us.
Now there is a W3C language for validating information in XML files, named Schematron. Schematron is an open, vendor-neutral standard, and very easy to implement. Unfortunately, it cannot (yet) - as far as I know - validate files that need information from other files, such as from the define.xml file. If you would "copy" the define.xml file into each Dataset-XML for the same submission, we could use Schematron. So as soon as Dataset-XML is accepted by the FDA, we could challenge them to provide us their rules for SDTM and SEND in a Schematron file.

Another possibility is to use XQuery. XQuery is another W3C open standard and is a query language for XML documents and e.g. used a lot to query native XML databases.

Now consider the rule: "the value of 'ARMCD' in the DM dataset should not exceed 20 characters in length". How would this be written in XQuery?
Here is the rule in machine-executable XQuery:

(: Rule FDAC067 :)
declare namespace def = "";
declare namespace odm="";
declare namespace data="";
(: get the OID for ARMCD :)
for $s in doc('/db/fda_submissions/cdiscpilot01/define_2_0.xml')//odm:ItemDef[@Name='ARMCD'][1]
let $oid := $s/@OID
(: select the ARMCD data points :)
for $armrecord in doc('/db/fda_submissions/cdiscpilot01/DM.xml')//odm:ItemGroupData/odm:ItemData[@ItemOID=$oid]
(: get the record number :)
let $recnum := $armrecord/../@data:ItemGroupDataSeq
(: check the string length of the ARMCD value :)
where string-length($armrecord/@Value) > 20
return <error recordnumber="{$recnum}" rule="Rule FDAC067">Invalid value for ARMCD {$armrecord/@Value} - it has more than 20 characters</error>

The first three lines declare the namespaces used in Dataset-XML and define.xml
The third line takes the define.xml file and extracts the "ItemDef" node for which the "Name" attribute has the value "ARMCD". This is the SDTM variable we are looking for.
The next line then extracts the OID of the "ARMCD" variable which we need in the Dataset-XML file.
The following lines ("for" line and "where" line) then iterates over all the "ItemData" elements in the DM.xml file that have the OID retrieved in the previous line: so all the "ARMCD" data points.
The next line then whether the length of the ARMCD value is larger than 20 (characters) and if so, returns an error message in XML format.

Now again, I didn't test this completely yet, but given the resources the FDA has (2014 budget is $4.7 billion), I would expect that it would be not too difficult for the FDA to publish their SDTM and SEND rules as either Schematron or XQuery.

If there are no such plans, maybe they can sponsor a project at our university. It would also make a nice master thesis...

Saturday, November 1, 2014

No to "Null Flavors"

Last week, I attended (part of) the CDISC webinar about an upcoming new batch "public review" SDTM-IG (v.3.3 - batch 2). It gave me good and bad news. First the bad news:
- even more new domains and many new variables. I am afraid that the CDISC SDTM trainings will soon need to be extended to 3 days instead of the 2 days right now.

The good news is that the SDTM team now proposes that "non-standard" variables (that until now are to be "banned" to SUPPXX data sets) may be kept in the parent domain (where they belong) and are marked in the define.xml by Role="Non-Standard Identifier" or Role="Non-Standard Qualifier" or Role="Non-Standard Timing".
This is something many of us ask already for years, essentially since define.xml 1.0 was published. You can read somewhat about this in my prior blog entries "Why SUPPQUAL sucks" and "SDTM and non-standard variables".

Very recently, there was also a webinar given by Diane Wold about the use of "Null Flavors" in CDISC. Now, Diane is one of the persons in CDISC that I highly appreciate, but in my personal opinion, she is completely wrong in this case: in my opinion, "Null Flavors" are evil.

Let me explain. "Null Flavors" have been developed by HL7 in HL7-v3 in order as a mechanism for the case where a value is not known, or cannot be represented by the HL7-v3 framework.
"Null flavors" are highly contested, even within HL7, e.g. see the blog "Smells like I dunno" of Keith Boone, one of the few "HL7-v3 gurus" and author of the best book about HL7-v3 and CDA.
One of the things I have against the "null flavors" is that it forces people to make a categorization on a reason why a data point is missing (or not representable in the HL7 framework). This categorization is extremely arbitrary, so it is of essentially no help when comparing data points. I.m.o. they just just write the reason as an extra data point (like --REASND in SDTM) as free text.
Another reason is that it encompasses values that DEFINITELY are not null. Examples are "TRC" ("trace" - which is definitely not null), "QS" ("Quantity Sufficient") meaning "a bulk/large amount of material sufficient to fill up until a certain level" (can a large amount be "null"?), "PINF" ("positive infinite") and "NINF" ("negative infinite), two amounts that every last class primary school student knows are not null. Even worse, CDISC is abusing "PINF" in the trial design datasets to state "there is no upper limit" (in the number of particants). A very strange way to define this: first set that the maximum number of participants is NULL, and then add a "flavor" saying that it is unlimited. My math school teacher probably turns around in his grave now ...

In Austria, our national Electronic Health Record system is based on HL7-v3 and CDA. But we do ONLY allow two "null flavors" which are really about nulls: one expresses that a patient has no austrian social security number (e.g. tourists), the other one expressing that the patient does have an austrian social security number, but we do not know it, e.g. as he/she forgot to bring the SSN card.
All other 13 "null flavors" are forbidden in the austrian EHR.

My opinion is clear: we should not copy the errors the HL7 organization made.