Friday, April 11, 2014

Using UCUM units for CDISC-SEND

At the European CDISC Interchange we once again discussed replacing CDISC controlled terminology for units by UCUM, the latter being the worldwide standard for units, and used everywhere in healthcare and in electronic health records (mandatory in HL7-CDA).
Once again, I got the objection that UCUM is not usable in SEND (Standard for Exchange of Nonclinical Data) which is especially about preclinical research using animals, bacteria etc..
The statement about UCUM not being usable for SEND is just not true. Most people coming with this argument usually haven't read the UCUM specification, so I will explain here how UCUM can (and should be) used for SEND.
Consider the following "unit" used in SEND:  g/animal/day
it contains 3 components, one not being a unit at all: "animal". As unfortunately in all CDISC-CT for units, objects or properties (like "animals") have been mixed up with real units. But sometimes this is just necessary to compensate for basic errors made in the SDTM and SEND models.
Now, as this also happens in other industries and sciences, UCUM has developed a mechanism for dealing with this. It is called "annotations". So the UCUM notation for "g/animal/day" is:
with the "topic" (what it is about) of the unit being in curly brackets, the "annotation".
Also remark that „day“ is written in another way as in the CDISC notation.

One of the advantages of using UCUM is that it comes with a machine-executable XML file that describes all the units and their relations (ucum-essence.xml). So UCUM "knows" that a day is 24 hours, that an hour is 60 minutes and so on, and that all in a machine-executable way. So if you would want to calculate how many N g/{animal}/d is in „milligram per animal per minute“ this can be fully automated (which is impossible when using CDISC-CT).
The next objection of the SEND people followed immediately: "yes, but you can write anything between the curly brackets (defining the that part is an 'annotation'), so we do not have any control anymore what people will submit at all".
Well, the answer to that is pretty easy: CDISC should not control "units" like "g/animal/day", it should controll the annotations. So instead of publishing an ever growing lists containing things like "g/animal/day", they should be publishing lists of allowed "annotations". I picked out a few examples from the current SEND-CT that could be done:
remark that "bar" has in CDISC-CT a totally different meaning than in the rest of the world: a "bar" in CDISC-CT is a unit of packaging like "a bar of chocolate" whereas for the rest of the world it is a unit of pressure. By making clear that it is a "UCUM annotation" however (i.e. putting it in curly brackets), it is immeditely clear that it is not a unit for pressure, but something else, and can even be used parallel with the real "bar" unit.
Another advantage of having CDISC control over the annotations instead of the "units list" itself is that it allows for flexibility. For example if someone (e.g. an investigator) needs to have a unit "animals per cage", a request must be made to CDISC to extend the "UNIT" codelist with a new term, which takes months, with the possibility that the new term request is turned down.
When using UCUM however, with CDISC having controll over the annotations, the term can be used immediately as "cage" and "animal" are already in the list of allowed annotations. So the investigator can just use:
which is a valid UCUM unit.
Now the investigator realizes that "animals per cage" is not very precise. He/she has chickens, so what is important for the "density" (a UCUM property by the way) is the number of chickens per square meter. Instead of needing to request for a new term once again, he/she can simply use:
which is again a valid UCUM unit with the additional advantage that a computer can immediately and fully automatically calculate how many "animals per square yard" that is.
The investigator however also works with birds that can really fly (in contrast to chickens). So in that case, the density is better defined by the number of birds per cubic meter. Without needing to do a request for a new term again, he/she can now write:
which again is fully automatically interconvertable to e.g. „number of animals per gallon“.
So my proposal to CDISC is: discontinue the development of this ever growing list of units (that are not units) and that are not interconvertible by computers. Start using UCUM and publish lists of allowed annotations. For each SEND (but also SDTM) variable, CDISC can then publish a list of (one or more) "strongly preferred" units. For example for "height of subjects" (DM.HEIGHT):
(the latter is UCUM for "inches"). This set is a valid set of UCUM units which are fully interconvertible by computers (UCUM „knows“ that an inch is 2.54cm – CDISC-CT does not have that in machine-executable code)
Or for a SEND variable that describes the amount of food for the animals:
which are all valid UCUM units with the additional advantage that even when the investigator has been collecting the amount of food as "gram per animal per month" (g/{animal}/mo) this can be fully automatically recalculated in one of the above.
 Comments are very welcome as always

1 comment:

  1. Thanks for the helpful examples, Jozef. The case for UCUM seems very solid to me.

    There seem to be three objections to the use of UCUM:

    The first is that whilst UCUM provides documentation of prefixes, core elements, and symbols that can be used to create 'standard' unit terms for consumption by the biomedical and healthcare industry, it does not provide a finite list of standardised, electronically consumable units of measure terminology. The custodians of UCUM would say that is because that list is, to all intents and purposes, infinite. Nonetheless, that does not preclude CDISC from creating a UCUM compliant list, following the UCUM rules. The CDISC unit codelist is not UCUM compliant.

    The second is that unit fragments contained within UCUM do not cover the breadth of units required for all data submissions to FDA. I think you have nicely shown that not to be the case. One of the issues we have in clinical research is that the definitions of data elements and terminology are often ambiguous. The UCUM approach is explicit and eliminates any such ambiguity. I really like that.

    The third is that UCUM has multiple terms with the same meaning (e.g. g/l and mg/ml). I see the complete coverage that UCUM provides as a benefit, not a hindrence. Often there is not worldwide agreement on preferred units (e.g. subjects’ weight is measured in kg in some countries and in lbs in others). When GSK use local labs in studies (particularly cancer studies), we have to deal with very many units (i.e. units that would not be found in the CDISC terminology). In particular, we spend quite a bit of time determining conversion factors from esoteric units to our preferred unit. We also have to deal with cases where there is a business need to convert data from preferred units to other units. It would be much nicer to have an (industry) tool that could handle that for us - and that would require compliance with UCUM. None of that precludes CDISC from deciding which of the UCUM compliant units are the CDISC preferred units and recording that in the form of a terminology.