Tuesday, December 10, 2019

The Beauty of Signed vs. Unsigned Integers

The Beauty of Signed and Unsigned Integers

"The time has come", the Walrus said, "to talk of many things, but mostly about signed and unsigned integers."

Well, he didn't really say that and I feel a bit guilty about butchering one of my favorite poems, but that's what I'm going to talk about here.

I think this is a topic that most modern coders don't think about a whole lot, but I'm sure that people who work in compiled languages, or who have been around the block a few times, will chuckle appreciatively about this. It recently bit me in the tuckus and I thought it would be useful to talk about it. But first, why does it matter?

RTFM

The data product I've been working with recently is produced by NOAA and is extremely well documented. It's one of the products generated by the GOES satellites named AOD (Aerosol Optical Depth) and a key point the NOAA User Guide calls out is this.

"The AOD and its valid range are stored as packed 16-bit unsigned integers in the product files. This may be important when reading the data as some software may interpret them as signed integers. The packed values of AOD should have a valid range of [-0.05, 5]. "

Imagine my surprise then, when I plotted some of this data and got something that looked like this.

old_AOD

It's very pretty, and it shows the smoke being produced from the Kincade Fire in Sonoma CA nicely, but notice the range of the color bar. It shows a range of roughly [-2.5, 2.5], which is just not right. (BTW, the square brackets indicate a closed interval.)

Look at the Data, Dummy!

The NOAA User Guide states that the AOD data are, "stored as packed 16-bit unsigned integers". I'll talk about "packing" in a separate post, but for now, if we just look at the "16-bit unsigned integer" spec, that means I should have raw data values in the file that range from [0, 65536].

However, when I looked at the raw data I had extracted from the file, I found that I had Min and Max values of -31128 and 32750, respectively, and a whole lot of other numbers in between. Here's a little bit of the raw data, dumped out of the AOD file.

<snip>
    _, _, _, _, _, _, _, _, _, -30111, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, 12079, 12714, _, _, _, _, _, _, 
    4837, 3861, 3509, _, _, _, _, _, 3167, _, 3519, 4368, 2789, 3478, 2406, 
    3080, _, 2784, 2407, 3194, 2592, _, _, _, _, _, _, _, _, _, _, _, 3845, 
    _, _, _, 3712, _, _, _, _, 3015, 2937, 2818, _, 2981, _, _, _, _, _, _, 
    7934, 7676, _, _, _, _, _, _, _, _, _, _, _, _, _, 16670, 12271, _, -3, 
<snip>

(NOTE presence of -30111 value in 1st line of output and -3 in last line)

Signedness

The Wikipedia page describes signedness better than I can, but in a nutshell, if you have "unsigned" integers, there should be no negative numbers present. If there are, it means that the data is being stored as a "signed" type. This is totally fine, and in fact, certain data types, like NetCDF for example, silently convert unsigned integers to signed when data is loaded into them. It's up to the data vendor to use metadata to notify the user that the data is unsigned, and in fact, the GOES data has this metadata tag,

AOD:`_Unsigned` = 'true'.

Knowing that I shouldn't be seeing negative values in my data, but that I was, and that the data dictionary explicitly stated what I should be seeing, I decided to do some testing. Sure enough, if I loaded the following values into a NetCDF I had defined as being "16-bit unsigned,

63896, 50000, 0, 32128

I got the following out when I read the file,

-1640, -15536, 0, 32128

And that was the key to figuring out the problem.

The Solution

As I mentioned above, the maximum value I should have been able to store as a 16-bit integer was 65536. (In actuality, it's 65535, but you can read about that here Any value that is negative and coming originally from an unsigned data set means that the original value was larger than the signed data maximum. (I know, I know, what the hell does that mean?)

16-bit signed integers have a range of [−32768, 32767]. So a value that is negative means that it was greater than 32767 originally.

You can see that in my example above, where 63896 signed got stored as -1640 unsigned. Basically the negative side of 0 is used to store the offset from the maximum unsigned value.

max_uint - offset = original value

65536 - 1640 = 63896

In any case, the simple solution to convert back is to add 65536 to every negative value in the raw data. Astute readers will notice that my original Min and Max values of -31128 and 32750 then convert to 34408 and 32750, which seems weird at first. However, recall that as the negative signed numbers get closer to zero, the closer they get to the maximum unsigned value. So in the little data snippet I dumped out of the file, there was a -3. If we convert that to unsigned, we have 65533, which is actually very close to the maximum unsigned value possible.

Moral of the Story

  • Look at your raw data very closely before assuming the code is incorrect
  • Better yet, write unit tests against test data you create, not operational data
  • Understand the data you're working with and read the documentation about it

We assumed initially that there was a problem in the code we'd written, or that we had somehow received scaled data from NOAA. Going back to look at the raw data, before ANY scaling was applied by the libraries was the key to discovering what was happening. This is why it's good to create test data that you can write unit-tests against, to validate that your code is performing "as expected". More importantly though, really, really understand the data you're working with. This was an amazing learning experience for me, one that I really enjoyed, but it could have been incredibly frustrating.

Next posts will be on working with the NetCDF data format and on the concept of data "packing".

No comments: