Finding Size Measurements


Size measurements are common in electronic health records, especially in radiology and other diagnostic reports. By ‘size measurement’ we mean a 1D, 2D, or 3D expression involving lengths, such as:

Example Meaning
3mm 1D measurement
1.2 cm x 3.6 cm 2D measurement
3 by 4 by 5 cm 3D measurement
1.5 cm2 area measurement
4.3 mm3 volume measurement
2.3 - 4.5 cm range of lengths
1.1, 2.3, 8.5, and 12.6 cm list of lengths
1.5cm craniocaudal x 2.2cm transverse measurement with views

ClarityNLP scans sentences for size measurements, extracts the numeric values for each dimension, normalizes each to a common set of units (performing unit conversions if necessary), and provides output in JSON format to other pipeline components.

Source Code

The source code for the size measurement finder module is located in nlp/algorithms/finder/


A single string, the sentence to be scanned for size measurements.


A JSON array containing these fields for each size measurement found:

Field Name Explanation
text text of the complete measurement
start offset of the first character in the matching text
end offset of the final character in the matching text plus 1
temporality CURRENT or PREVIOUS, indicating when the measurement occurred
units either mm, mm2, or mm3
condition either ‘RANGE’ for numeric ranges, or ‘EQUAL’ for all others
x numeric value of first number
y numeric value of second number
z numeric value of third number
values for lists, a JSON array of all values in the list
xView view specification for the first axis
yView view specification for the second axis
zView view specification for the third axis
minValue either min([x, y, z]) or min(values)
maxValue either max([x, y, z]) or max(values)

All JSON measurement results contain an identical number of fields. Any fields that are not valid for a given measurement will have a value of EMPTY_FIELD and should be ignored.

All string operations of the size measurement finder are case-insensitive.


ClarityNLP uses a set of regular expressions to recognize size measurements. It scans a sentence with each regex, keeps track of any matches, and finds the longest match among the matching set. The longest matching text string is then tokenized, values are extracted, units are converted, and a python namedtuple representing the measurement is generated. This process is repeated until no more measurements are found, at which point the array of measurement namedtuples is converted to JSON and returned to the caller.

Measurement Formats

ClarityNLP is able to recognize size measurements in a number of different formats. Using notation similar to that of [1], we define the following quantities:

Shorthand Meaning
x y z Any numeric value, either floating point or integer
cm Units for the preceding numeric value
by Either the word ‘by’ or the symbol ‘x’
to Either the word ‘to’ or the symbol ‘-’
vol Dimensional modifier, either ‘square’, ‘cubic’, ‘sq’, ‘sq.’, ‘cu’, ‘cu.’, ‘cc’
view View specification, any word will match

With these definitions, the measurement formats that ClarityNLP recognizes are:

Regex Form Examples
x cm 3 mm, 5cm, 10.2 inches
x vol cm 5 square mm, 3.2cm2
x to y cm 3-5 cm, 3 to 5cm
x cm to y cm 3 cm to 5 cm, 3cm - 5 cm
x by y cm 3 x 5 inches, 3x5 cm
x cm by y cm 3 mm by 5 mm
x cm view by y cm view 3 cm craniocaudal x 5 cm transverse
x by y by z cm 3 x 5 x 7 mm
x by y cm by z cm 3 x 5mm x 7 mm
x cm by y cm by z cm 3 mm x 5 mm x 7 mm
x cm view by y cm view by z cm view 3 cm craniocaudal by 5cm transverse by 7 cm anterior

ClarityNLP can also find size measurements with nonuniform spacing between the various components, as several of the examples above demonstrate. Newlines can also be present within a measurement. Inconsistent spacing such as this appears frequently in electronic health records.


These medically-relevant measurement units are supported:

Units Textual Forms
millimeters mm, millimeter, millimeters
centimeters cm, centimeter, centimeters
inches in, inch, inches

ClarityNLP tries to distinguish uses of the word ‘in’ as a preposition vs. its use as a unit of length. It cannot correctly identify all such instances. Hence the word ‘in’ preceded by a numeric value may sometimes generate false positive results.

Numeric values can be integers (sequence of digits) or floating point values. The digit before the decimal point is optional. Some examples:

  • 3, 42
  • 12.4887
  • .314, 0.314


M. Sevenster, J. Buurman, P. Liu, J.F. Peters, P.J. Chang
Natural Language Processing Techniques for Extracting and Categorizing
Finding Measurements in Narrative Radiology Reports
Appl. Clin. Inform., 6(3) 600-610, 2015.