Friday, April 11, 2014

Using UCUM units for CDISC-SEND



At the European CDISC Interchange we once again discussed replacing CDISC controlled terminology for units by UCUM, the latter being the worldwide standard for units, and used everywhere in healthcare and in electronic health records (mandatory in HL7-CDA).
 
Once again, I got the objection that UCUM is not usable in SEND (Standard for Exchange of Nonclinical Data) which is especially about preclinical research using animals, bacteria etc..
 
The statement about UCUM not being usable for SEND is just not true. Most people coming with this argument usually haven't read the UCUM specification, so I will explain here how UCUM can (and should be) used for SEND.
 
Consider the following "unit" used in SEND:  g/animal/day
 
it contains 3 components, one not being a unit at all: "animal". As unfortunately in all CDISC-CT for units, objects or properties (like "animals") have been mixed up with real units. But sometimes this is just necessary to compensate for basic errors made in the SDTM and SEND models.
Now, as this also happens in other industries and sciences, UCUM has developed a mechanism for dealing with this. It is called "annotations". So the UCUM notation for "g/animal/day" is:
 
g/{animal}/d
 
with the "topic" (what it is about) of the unit being in curly brackets, the "annotation".
Also remark that „day“ is written in another way as in the CDISC notation.

One of the advantages of using UCUM is that it comes with a machine-executable XML file that describes all the units and their relations (ucum-essence.xml). So UCUM "knows" that a day is 24 hours, that an hour is 60 minutes and so on, and that all in a machine-executable way. So if you would want to calculate how many N g/{animal}/d is in „milligram per animal per minute“ this can be fully automated (which is impossible when using CDISC-CT).
 
The next objection of the SEND people followed immediately: "yes, but you can write anything between the curly brackets (defining the that part is an 'annotation'), so we do not have any control anymore what people will submit at all".
Well, the answer to that is pretty easy: CDISC should not control "units" like "g/animal/day", it should controll the annotations. So instead of publishing an ever growing lists containing things like "g/animal/day", they should be publishing lists of allowed "annotations". I picked out a few examples from the current SEND-CT that could be done:
{animal}
{cage}
{CAPSULE}
{BAR}
 
remark that "bar" has in CDISC-CT a totally different meaning than in the rest of the world: a "bar" in CDISC-CT is a unit of packaging like "a bar of chocolate" whereas for the rest of the world it is a unit of pressure. By making clear that it is a "UCUM annotation" however (i.e. putting it in curly brackets), it is immeditely clear that it is not a unit for pressure, but something else, and can even be used parallel with the real "bar" unit.
 
Another advantage of having CDISC control over the annotations instead of the "units list" itself is that it allows for flexibility. For example if someone (e.g. an investigator) needs to have a unit "animals per cage", a request must be made to CDISC to extend the "UNIT" codelist with a new term, which takes months, with the possibility that the new term request is turned down.
 
When using UCUM however, with CDISC having controll over the annotations, the term can be used immediately as "cage" and "animal" are already in the list of allowed annotations. So the investigator can just use:
 
{animal}/{cage}
 
which is a valid UCUM unit.
 
Now the investigator realizes that "animals per cage" is not very precise. He/she has chickens, so what is important for the "density" (a UCUM property by the way) is the number of chickens per square meter. Instead of needing to request for a new term once again, he/she can simply use:
 
{animal}/m2
 
which is again a valid UCUM unit with the additional advantage that a computer can immediately and fully automatically calculate how many "animals per square yard" that is.
 
The investigator however also works with birds that can really fly (in contrast to chickens). So in that case, the density is better defined by the number of birds per cubic meter. Without needing to do a request for a new term again, he/she can now write:
 
{animal}/m3
 
which again is fully automatically interconvertable to e.g. „number of animals per gallon“.
 
So my proposal to CDISC is: discontinue the development of this ever growing list of units (that are not units) and that are not interconvertible by computers. Start using UCUM and publish lists of allowed annotations. For each SEND (but also SDTM) variable, CDISC can then publish a list of (one or more) "strongly preferred" units. For example for "height of subjects" (DM.HEIGHT):
cm
m
[in_i]
 
(the latter is UCUM for "inches"). This set is a valid set of UCUM units which are fully interconvertible by computers (UCUM „knows“ that an inch is 2.54cm – CDISC-CT does not have that in machine-executable code)
 
Or for a SEND variable that describes the amount of food for the animals:
 
g/{animal}/d
g/{animal}/wk
 
which are all valid UCUM units with the additional advantage that even when the investigator has been collecting the amount of food as "gram per animal per month" (g/{animal}/mo) this can be fully automatically recalculated in one of the above.
 
 Comments are very welcome as always