Sunday, March 5, 2017

Validating SDTM labels using RESTful web services

About a month ago, I reported about my first experiences with implementing the new CDISC SDTM-IG Conformance Rules. I now made considerable progress, having >60% of the rules implemented. These implementations are available for download and usage from here.

Today I want to elaborate a bit on how I implemented rule CG0303 "Variable Label = IG Label", using RESTful web services. Earlier implementations from others were based on copying/pasting the labels from the SDTM-IG and then hard-coding them in software. This does not only mean a lot of work, it is also error-prone, with the disadvantages that a software update is needed each time an error in the implementation is found. For example, if you search on the forum of the validation Software that the FDA is using for the wording "label mismatch" you will find many hits, especially about false positive errors. In some cases, one even gets an error on a label that looks 100% correct, but the software does not tell you what text for the label it expects. "Let the guessing begin"!
So we definitely need something better. Wouldn't it be better to use the SHARE content, load it into a central database, and query that database using a modern (easy-to-implement) RESTful web service?

That is exactly what we did. All SDTM-IG Information (from different IG versions) and all CDISC controlled terminology that is electronically available was loaded into a database, and RESTful web services were developed to make them available to anyone, and to any application. These RESTful web services (over 30 of them) are described here. Adding a new Service usually takes 1-2 hours, sometimes even less.

One of these services allows to retrieve all necessary information for a given variable in a given domain for a given SDTM-IG version. The RESTful query string description is:

http://www.xml4pharmaserver.com:8080/CDISCCTService/rest/SDTMVariableInfoForDomainAndVersion/{sdtmigversion}/{domain}/{varname}

which is pretty self-explaining. For example, to get all the Information about the variable ECPORTOT in the domain EC for SDTM-IG 3.2, the query string is:

http://www.xml4pharmaserver.com:8080/CDISCCTService/rest/SDTMVariableInfoForDomainAndVersion/3.2/EC/ECPORTOT

This service can now easily be used to validate labels in submissions, like in implementations of rule CG0303. Let's do so for a sample SDTM submission.
In our case, the SDTM submission resides in a native XML database (something the FDA SHOULD also do instead of messing around with SAS-XPT datasets). Here is the implementation of rule CG0303 in XQuery, an easy-to-learn language that is as well human-readable as machine-executable (so the rules are 100% transparent):


In the first part, the XML namespaces are declared and the location of the define.xml for this submission is set (usually this will be done by passing these as parameters from within the calling application). Also the base of the RESTful web Service is declared.

Here is the second part:



For each dataset in the submission (by iterating over all the dataset definitions "ItemGroupDef"), we get the domain name either from the dataset name or from the "Domain" Attribute in the define.xml (goes into $domain), and then start iterating over all the variables declared for the current dataset:



The variable name is obtained, and the label taken from the define.xml (remark that when using SDTM in XML, the label is in the define.xml and NOT in the dataset itself - which follows the good practice of separating data from metadata). The web service is then triggered returning the expected label from the database (can be SHARE in future), and the actual and expected label are compared. Remark that for some variables, there will not be a label from the SDTM-IG, as the variable is just not mentioned in the SDTM-IG, although it is allowed for that domain. In that case, there is nothing to compare.

If both Labels do not correspond, an error (in XML) is returned. An example is:


showing as well the actual as the expected label.

As the validation errors ("deviation" or "discrepancy" would in fact be a better word) come in XML, they can (unlike Excel or CSV) be used in many ways, and even ... stored in a native XML database ;-).