Finding Size Measurements¶
Size measurements are common in electronic health records, especially in radiology and other diagnostic reports. By ‘size measurement’ we mean a 1D, 2D, or 3D expression involving lengths, such as:
|1.2 cm x 3.6 cm||2D measurement|
|3 by 4 by 5 cm||3D measurement|
|1.5 cm2||area measurement|
|4.3 mm3||volume measurement|
|2.3 - 4.5 cm||range of lengths|
|1.1, 2.3, 8.5, and 12.6 cm||list of lengths|
|1.5cm craniocaudal x 2.2cm transverse||measurement with views|
ClarityNLP scans sentences for size measurements, extracts the numeric values for each dimension, normalizes each to a common set of units (performing unit conversions if necessary), and provides output in JSON format to other pipeline components.
The source code for the size measurement finder module is located in
A single string, the sentence to be scanned for size measurements.
A JSON array containing these fields for each size measurement found:
|text||text of the complete measurement|
|start||offset of the first character in the matching text|
|end||offset of the final character in the matching text plus 1|
|temporality||CURRENT or PREVIOUS, indicating when the measurement occurred|
|units||either mm, mm2, or mm3|
|condition||either ‘RANGE’ for numeric ranges, or ‘EQUAL’ for all others|
|x||numeric value of first number|
|y||numeric value of second number|
|z||numeric value of third number|
|values||for lists, a JSON array of all values in the list|
|xView||view specification for the first axis|
|yView||view specification for the second axis|
|zView||view specification for the third axis|
All JSON measurement results contain an identical number of fields. Any fields that are not valid for a given measurement will have a value of EMPTY_FIELD and should be ignored.
All string operations of the size measurement finder are case-insensitive.
ClarityNLP uses a set of regular expressions to recognize size measurements. It scans a sentence with each regex, keeps track of any matches, and finds the longest match among the matching set. The longest matching text string is then tokenized, values are extracted, units are converted, and a python namedtuple representing the measurement is generated. This process is repeated until no more measurements are found, at which point the array of measurement namedtuples is converted to JSON and returned to the caller.
ClarityNLP is able to recognize size measurements in a number of different formats. Using notation similar to that of , we define the following quantities:
|x y z||Any numeric value, either floating point or integer|
|cm||Units for the preceding numeric value|
|by||Either the word ‘by’ or the symbol ‘x’|
|to||Either the word ‘to’ or the symbol ‘-’|
|vol||Dimensional modifier, either ‘square’, ‘cubic’, ‘sq’, ‘sq.’, ‘cu’, ‘cu.’, ‘cc’|
|view||View specification, any word will match|
With these definitions, the measurement formats that ClarityNLP recognizes are:
|x cm||3 mm, 5cm, 10.2 inches|
|x vol cm||5 square mm, 3.2cm2|
|x to y cm||3-5 cm, 3 to 5cm|
|x cm to y cm||3 cm to 5 cm, 3cm - 5 cm|
|x by y cm||3 x 5 inches, 3x5 cm|
|x cm by y cm||3 mm by 5 mm|
|x cm view by y cm view||3 cm craniocaudal x 5 cm transverse|
|x by y by z cm||3 x 5 x 7 mm|
|x by y cm by z cm||3 x 5mm x 7 mm|
|x cm by y cm by z cm||3 mm x 5 mm x 7 mm|
|x cm view by y cm view by z cm view||3 cm craniocaudal by 5cm transverse by 7 cm anterior|
ClarityNLP can also find size measurements with nonuniform spacing between the various components, as several of the examples above demonstrate. Newlines can also be present within a measurement. Inconsistent spacing such as this appears frequently in electronic health records.
These medically-relevant measurement units are supported:
|millimeters||mm, millimeter, millimeters|
|centimeters||cm, centimeter, centimeters|
|inches||in, inch, inches|
ClarityNLP tries to distinguish uses of the word ‘in’ as a preposition vs. its use as a unit of length. It cannot correctly identify all such instances. Hence the word ‘in’ preceded by a numeric value may sometimes generate false positive results.
Numeric values can be integers (sequence of digits) or floating point values. The digit before the decimal point is optional. Some examples:
- 3, 42
- .314, 0.314