Working on and with CDISC Standards: February 2018

In a previous blog entry, I discussed the relation between Biomedical Concepts (BCs) and LOINC coding and whether these fit in CDISC SDTM in the case of laboratory tests and vital signs.I argued that the "6 dimensions of LOINC" could very well represent the ingredients of BCs, at least in the case of laboratory tests and vital signs.

But how well does this work for other domains, such as microbiology?

When looking at the microbiology domains in CDISC SDTM, being MB "Microbiology Specimen, and MS "Microbiology Susceptibility", one immediately notices that these are highly related, i.e. the records in the MS dataset provide more detailed information about records in the MB dataset. This can be seen by the fact that MSGRPID is "required" with the description "In MS, used to link to organism in MB". So, there cannot be a record in MS for which there is no record in MB. --GRPID (group ID) is used as a "foreign key" in MS to a record in MB (although the SDTM-IG does not use the word "foreign key"). This is atypical for SDTM, as relations between records in tables usually need to be put in the so-called RELREC ("Related Records") table. As such, the "--GRPID" construct in MB-MS can be seen as a "workaround" or "shortcut" to avoid having the use RELREC (or: "how to violate your own principles").

It also demonstrates the weaknesses of the SDTM as a model, completely relying on the concept of "tables", with or without references to information in other tables, and inconsistently implemented by using RELREC, a group ID or something else for referencing. If SDTM were not tied to the concept of "tables", the information of both MB and MS could be delivered as a single dataset, using e.g. XML, JSON or RDF (linked data) as a format. This would make the work of the reviewers considerably more easy, as they currently need to "swap" between two tables.

Interesting is also that for each organism found (test of presence, reported in MBORRES) there must be a unique identifier in MBSPID ("sponsor ID"). This may sound a bit surprising, but may be due to that MBORRES is essentially free text, so that there may be different "ways of writing" of the same microorganism between visits and essentially between sites. This also means that assigning the organism-identifier to MBSPID can be an enormous task, requiring much microbiology knowledge.

When looking into these two microbiology domains in the SDTM-IG, it is a bit surprising that it does not speak about coding of tests or organism at all. There is a column "MBLOINC", but it is "permissible" and is defined as "Dictionary-derived LOINC Code for MBTEST", which is of course nonsense: LOINC codes should not be derived, they should be the source. Also, nor SNOMED-CT, nor NCBI, nor ATCC coding systems are mentioned. As the sponsor-assigned MBSPID / MSSPID values are used for the unique identifiers of the organism, how can the regulatory authorities than compare microbiology data between studies? Just by visual inspection?

Let us take an example from the SDTM-IG and try to code one or more records using one or more of the above mentioned coding systems. Remark that this should essentially not be done by the sponsor, but the codes should already be present when the tests were done (i.e. in the lab). In the US this should not be a problem, as the Meaningful Use program requires the use of LOINC codes and SNOMED-CT for microbiology: in general, LOINC coding is used to report what test was done (e.g. a viral culture, or a drug susceptibility) whereas SNOMED-CT is used to provide organism identification and in some cases specimen identification (e.g. sputum, urine).

Remark that CDISC developed its own codelist for specimen identification (used by MBSPEC and MSSPEC), and that essentially the use of SNOMED coding is now allowed for these. As far as I know, there is also no mapping table available for mapping between SNOMED-CT codes for "specimen" and CDISC codes.

In the SDTM-IG 3.2, the MB examples show the following:

Comparison of rows 1-2 with the other rows demonstrate that MB is used for both detection of organisms (rows 3-6) as well as for measurement (counting) of specific organisms or types of organisms. The first two rows measure gram-negative bacteria with more specific counting for either cocci (round-shaped) and rod-shaped bacteria. The test codes are however not covered by the CDISC controlled terminology (i.e. they are sponsor-defined) and are thus not interoperable. When looking into SNOMED-CT however, one easily finds the codes 18383003 (gram-negative cocci) and 87172008 (gram-negative rods).

Why the hell was SNOMED-CT coding not used here in MB? Another CDISC "not invented here"? Also the result designation (MBORRES, MBSTRESC) is not standardized at all and thus useless in comparisons between studies - at least MBSTRESC should be standardized.

For the rows 3-6, the LOINC Code to be used is probably 623-9: "Bacteria identified in Sputum by Cystic fibrosis respiratory culture". Remark that LOINC describes tests to be performed, without saying anything about the outcome. The outcome "streptococcus", "klebsiella" then goes into MBORRES. In SDTM, the codes for the bacteria that were detected go into MBSPID, so in this case assigned by the sponsor as "ORG01" and "ORG02", which is of course not interoperable. Another sponsor, or even the same sponsor in another study, may have assigned completely different codes.

We can however easily find codes for "streptococcus pneumonia" and for "klebsiella pneumonia" in SNOMED-CT. For "streptococcus pneumonia" the code is 9861002 "streptococcus pneumonia (organism)" and for "klebsiella pneumonia" the code is 56415008 "klebsiella pneumonia (organism)".

So, if we would be allowed to code the value for at least MBSTRESC (here using SNOMED-CT), the regulatory reviewers would be able to compare between studies, which is currently not possible. As said, the SDTM-IG does also not say anything about coding for the "standardized result" in MBSTRESC.
If it would be allowed to code for the test, only the LOINC code would be sufficient, and MBTESTCD, MBTEST, MBSPEC and probably also MBMETHOD would be superfluous.

In the MS dataset (microbiology susceptibility), the details and further results for each in the tests in MB (where applicable) are provided.

The test code and test name (MSTESTCD, MSTEST) and also the test category (MSCAT) are under "sponsor defined controlled terminology", i.e. the sponsor can assign codes as he/she wishes. More recently, CDISC has published controlled terminology for these, but this is not visible in the SDTM-IG.
The CDISC lists for MSTESTCD/MSTEST contain a few codes and names for traditional susceptibility tests, but there are no codes for "EXTGROW" or "PENICILIN". So, in practice, MSTESTCD/MSTEST will contain a mixture of test codes and names that semantically have nothing to do with each other (i.e. with no coherence), as phenomenon we regularly observe in CDISC codes.

Some of the MS tests here may be covered by LOINC, other not. For example, "penicillin susceptibility" is covered by the LOINC codes 7041-7, 7042-5, depending on whether "Penicillin G" (sodium or potassium salt) or "Penicillin V" is meant. A quick look into the LOINC database shows that there are about 1700 such "susceptibility" tests described. Interesting for our example is that LOINC dropped the designation "by E-Test" as it is a propriety name. It was replaced by "by gradient strip". For the susceptibility test for the sponsor drug, one cannot expect that there is a LOINC code of course. In such a case, the standardized LOINC value "susc" can however be used for the "property". Remark that also SNOMED-CT has a number of codes for drug susceptibility tests (as "procedures").

All these demonstrates that the MB and MS domains have not been well-designed. Primary cause of this is probably the table structure of SDTM, containing database views instead of representing a true relationaldatabase. Additionally, the refusal to move into multi-hierarchical model for SDTM, the intertwining of SDTM with the SAS-XPT format, and the refusal to use (or even allow the use of) coding systems from healthcare (especially LOINC and SNOMED-CT), i.e. the "not-invented-here syndrome".

It is time that CDISC starts developing Biomedical Concepts (BCs) for microbiology, this instead of further developing "non-invented-here" controlled terminology for microbiology tests. Furthermore, CDISC should start encouraging the use of LOINC and SNOMED-CT for tests and results in microbiology. It should recognize that LOINC is not only about classic laboratory tests (where the FDA overruled CDISC), but is also very useful in other domains.

Working on and with CDISC Standards

Saturday, February 10, 2018

CDISC Biomedical concepts, LOINC and other standards: Microbiology