Tuesday, December 10, 2019

The Beauty of Signed vs. Unsigned Integers

The Beauty of Signed and Unsigned Integers

"The time has come", the Walrus said, "to talk of many things, but mostly about signed and unsigned integers."

Well, he didn't really say that and I feel a bit guilty about butchering one of my favorite poems, but that's what I'm going to talk about here.

I think this is a topic that most modern coders don't think about a whole lot, but I'm sure that people who work in compiled languages, or who have been around the block a few times, will chuckle appreciatively about this. It recently bit me in the tuckus and I thought it would be useful to talk about it. But first, why does it matter?

RTFM

The data product I've been working with recently is produced by NOAA and is extremely well documented. It's one of the products generated by the GOES satellites named AOD (Aerosol Optical Depth) and a key point the NOAA User Guide calls out is this.

"The AOD and its valid range are stored as packed 16-bit unsigned integers in the product files. This may be important when reading the data as some software may interpret them as signed integers. The packed values of AOD should have a valid range of [-0.05, 5]. "

Imagine my surprise then, when I plotted some of this data and got something that looked like this.

old_AOD

It's very pretty, and it shows the smoke being produced from the Kincade Fire in Sonoma CA nicely, but notice the range of the color bar. It shows a range of roughly [-2.5, 2.5], which is just not right. (BTW, the square brackets indicate a closed interval.)

Look at the Data, Dummy!

The NOAA User Guide states that the AOD data are, "stored as packed 16-bit unsigned integers". I'll talk about "packing" in a separate post, but for now, if we just look at the "16-bit unsigned integer" spec, that means I should have raw data values in the file that range from [0, 65536].

However, when I looked at the raw data I had extracted from the file, I found that I had Min and Max values of -31128 and 32750, respectively, and a whole lot of other numbers in between. Here's a little bit of the raw data, dumped out of the AOD file.

<snip>
    _, _, _, _, _, _, _, _, _, -30111, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, 12079, 12714, _, _, _, _, _, _, 
    4837, 3861, 3509, _, _, _, _, _, 3167, _, 3519, 4368, 2789, 3478, 2406, 
    3080, _, 2784, 2407, 3194, 2592, _, _, _, _, _, _, _, _, _, _, _, 3845, 
    _, _, _, 3712, _, _, _, _, 3015, 2937, 2818, _, 2981, _, _, _, _, _, _, 
    7934, 7676, _, _, _, _, _, _, _, _, _, _, _, _, _, 16670, 12271, _, -3, 
<snip>

(NOTE presence of -30111 value in 1st line of output and -3 in last line)

Signedness

The Wikipedia page describes signedness better than I can, but in a nutshell, if you have "unsigned" integers, there should be no negative numbers present. If there are, it means that the data is being stored as a "signed" type. This is totally fine, and in fact, certain data types, like NetCDF for example, silently convert unsigned integers to signed when data is loaded into them. It's up to the data vendor to use metadata to notify the user that the data is unsigned, and in fact, the GOES data has this metadata tag,

AOD:`_Unsigned` = 'true'.

Knowing that I shouldn't be seeing negative values in my data, but that I was, and that the data dictionary explicitly stated what I should be seeing, I decided to do some testing. Sure enough, if I loaded the following values into a NetCDF I had defined as being "16-bit unsigned,

63896, 50000, 0, 32128

I got the following out when I read the file,

-1640, -15536, 0, 32128

And that was the key to figuring out the problem.

The Solution

As I mentioned above, the maximum value I should have been able to store as a 16-bit integer was 65536. (In actuality, it's 65535, but you can read about that here Any value that is negative and coming originally from an unsigned data set means that the original value was larger than the signed data maximum. (I know, I know, what the hell does that mean?)

16-bit signed integers have a range of [−32768, 32767]. So a value that is negative means that it was greater than 32767 originally.

You can see that in my example above, where 63896 signed got stored as -1640 unsigned. Basically the negative side of 0 is used to store the offset from the maximum unsigned value.

max_uint - offset = original value

65536 - 1640 = 63896

In any case, the simple solution to convert back is to add 65536 to every negative value in the raw data. Astute readers will notice that my original Min and Max values of -31128 and 32750 then convert to 34408 and 32750, which seems weird at first. However, recall that as the negative signed numbers get closer to zero, the closer they get to the maximum unsigned value. So in the little data snippet I dumped out of the file, there was a -3. If we convert that to unsigned, we have 65533, which is actually very close to the maximum unsigned value possible.

Moral of the Story

  • Look at your raw data very closely before assuming the code is incorrect
  • Better yet, write unit tests against test data you create, not operational data
  • Understand the data you're working with and read the documentation about it

We assumed initially that there was a problem in the code we'd written, or that we had somehow received scaled data from NOAA. Going back to look at the raw data, before ANY scaling was applied by the libraries was the key to discovering what was happening. This is why it's good to create test data that you can write unit-tests against, to validate that your code is performing "as expected". More importantly though, really, really understand the data you're working with. This was an amazing learning experience for me, one that I really enjoyed, but it could have been incredibly frustrating.

Next posts will be on working with the NetCDF data format and on the concept of data "packing".

Displaying Rasters in Jupyter Notebooks

Rasters in Jupyter

It's the easy stuff that trips you up the most, right? Here I was, wanting to talk about creating rasters in R and Python when I realized that I didn't have a great handle on how to display them. Turns out that it is very simple to do. (GeoTIFF generated from code taken from Jared's excellent gdal/ogr cookbook.)

In Python

Use GDAL and Matplotlib

from matplotlib import pyplot
%matplotlib inline
from osgeo import gdal

img = gdal.Open('test.tif').ReadAsArray()
im = pyplot.imshow(img)

python<em>jupyter</em>geotif

In R

Use the raster library's built-in plot() function

library(raster)
library(repr)

options(repr.plot.width=4, repr.plot.height=3)
img_file <- raster("test.tif")
plot(img_file)

NOTE: repr is used solely to control the output size of the plot. Without it, the image is rather large.

R<em>jupyter</em>geotif

Rebirth of the Blog

Rebirth of the Rants

Rebirth of the Rants

Well, maybe not so much the rants, but since going back to work, I've felt the need to capture a few of the things I've had to re-learn for posterity. Maybe one day I'll organize them and use them for tutorials, or something, but in the meantime, at least I'll have captured them somewhere that I can search and easily share them.

I had been using Github personal pages to do something similar, but I could never get into the swing of updating the pages there. The Pure.css templates that I used on that site looked great, but there was overhead in maintaining them AND, as my friend Mash pointed out to me years ago, it's stupid not to leverage some of the built-in tools that Google provides on Blogger.

This time around, I'm going to draft all my entries in Simplenote as Markdown and then cut-and-paste the HTML they generate into Blogger. I use SImplenote already and while it's not perfect, I like it. I did a little write-up on it a year ago and it's probably time to do another that captures what I've learned about using it since then, both good and bad.

There's a great deal of old and krufty stuff in the blog right now, but I'll whittle it down as I move forward. For now, I'll focus on getting new material up that I think is relevant.

So without further ado, let's get to it!