Thursday, May 10, 2012

Through the JVM looking-glass.

So you've got a binary file full of integers in bog-standard x86 format, just a straight list of unsigned short ints.  And you need to read them into a Java ArrayList.  Sounds simple enough.

Now, you're an experienced C programmer, so you know you'll have to deal with endianness: the JVM is big-endian, and x86 is little-endian.  Every time you find 0xff00 in your file, you need to flip those bytes around to read 0x00ff.  So you write a simple loop -- in pseudocode,

int accumulator = 0;
while( b = file.readByte() ) {
   accumulator = accumulator | b;
   accumulator = accumulator << 8;
}

And low and behold, 0xff00 becomes ... -1.

You might have guessed it already, but the problem is that Java doesn't support unsigned integers and that first byte, 0xff, is read as -1 per 2's complement.

Now, if you can visualize all that bit-shifting and or'ing, you might not see this as a problem -- the 0xff will just be OR'ed into your short, then shifted.  This is not what happens.

Instead, our byte b, which has been interpreted as -1, is silently coerced into a short -- 0xff becomes 0xffff, which is then OR'ed with 0x0000 to yield 0xffff, aka -1.  Our next byte, 0x00, is also coerced and OR'ed to no effect.

[Why anyone would want a signed byte is a whole 'nother question -- next time I want to store a value between -128 and 127, I guess I know what to use.]

The fix is to undo the 2's complement -- to turn -1 into 255, like so:

int accumulator = 0;
while( b = file.readByte() ) {
   short s = (short) b;
   if( s < 0 ) {
      s = s + 0x100;
   }
   accumulator = accumulator | s;
   accumulator = accumulator << 8;
}

That's how you read a single unsigned short in Java.  Just so intuitive, don't you think?  Compare with Python:

   s = struct.unpack("B", file.read(1))