Store: An Efficient Binary Serialization Library

Philipp Kant, FP Complete

philipp@fpcomplete.com

Haskell in Leipzig, 2016-09-15

Serialisation

serialisation: represent data as sequence of bytes
- to save it to files
- to send it to another computer/process
possible features
- versioning, backwards compatibility
- architecture independence
- cross-language compatibility
- easy to use
- speed

Store

use case: distributed high performance computing
typical data: vectors of simple data types, fits in memory
design goal: speed
- no versioning, fixed architecture
  
  communication between identical binaries on the same architecture
- minimalistic protocol
  
  no backtracking, avoid multiple allocations

Serialising Simple Data

data of fixed size (Int, Double, …) via Storable

class Storable a where
  sizeOf :: a -> Int -- ^ The value of the argument is not used.
  poke :: Ptr a -> a -> IO ()
  peek :: Ptr a -> IO a
  ...

what about lists and vectors?

prefix data by its length – but only when needed

data Size a
    = VarSize (a -> Int) -- ^ size depends on value
    | ConstSize !Int     -- ^ size is statically known

(Size :: Int64) = ConstSize 8
(Size :: [Int64]) = VarSize (\xs -> 8 * length xs)

For elementary datatypes like Int, Double, and the like, serialisation can be performed via the Storable typeclass in the base library.
It is very efficient (basically memcpy), but has one important limitation: the size of the serialised data must be determined by the type alone, and not the value.
- With a known size, Storable can simply copy the required number of bytes
- But that limitation is too restrictive: what if you want to serialise lists or vectors?
- You could just prefix every value by its size, but that would be excessive: the memory required for a Vector of Ints would be doubled. Also, determining the size of a vector would require traversing the vector.
- The solution is to distinguish between types with a fixed size and types where the size does depend on the value.
- That way, for a Vector of Ints, we need only one additional Int (the length of the Vector).

The Store Typeclass

-- public API
class Store a where
  size :: Size a
  poke :: Poke ()
  peek :: Peek a

encode :: Store a => a -> ByteString
decode :: Store a => ByteString -> Either PeekException a

-- implementation, not exposed
newtype Poke a = Poke
  { runPoke :: forall byte. Ptr byte -> Int -> IO (Int, a) }
newtype Peek a = Peek
  { runPeek :: forall byte. Ptr byte -> Ptr byte -> IO (Ptr byte, a) }

internals of Poke and Peek not exposed

library declares many instances
more via Generic
Peek and Poke are Applicative and Monad instances

Declaring Instances

data Sumtype = I8 Int8
             | I32 Int32
instance Store Sumtype where
  poke (I8  x) = poke (0 :: Word8) >> poke x
  poke (I32 x) = poke (1 :: Word8) >> poke x
  size = VarSize $ \x -> 1 + case x of
    I8  _ -> 1
    I32 _ -> 4
  peek = do
    tag <- peek
    case tag :: Word8 of
      0 -> I8 <$> peek
      1 -> I32 <$> peek
      _ -> fail "Invalid tag"

… or simply

data Sumtype = I8 Int8
             | I32 Int32
             deriving Generic
instance Store Sumtype

no boilerplate, no mismatches between size and poke

Benchmarks

serialise/deserialise vector of length 100 of

data SomeData = SomeData !Int64 !Word8 !Double
    deriving Generic
instance S.Store SomeData

Streaming Data

store: serialisation to/from strict ByteStrings

efficiency: one allocation per serialisation, no partial results
networking: data arrives in chunks, need streaming

add thin streaming layer on top of store

Streaming with ByteBuffer

decodeMessage :: (MonadIO m, Store a)
    => ByteBuffer -> m (Maybe ByteString) -> m (Maybe (Message a))

-- | Copy the contents of a 'ByteString' to a 'ByteBuffer'.
copyByteString :: MonadIO m
    => ByteBuffer
    -> ByteString
    -> m ()

-- | Try to get a pointer to @n@ bytes from the 'ByteBuffer'.
--
-- If there are not enough bytes in the ByteBuffer, indicate
-- how many bytes are needed
consume :: MonadIO m
    => ByteBuffer
    -> Int
    -> m (Either Int ByteString)

Implementing ByteBuffer

type ByteBuffer = IORef BBRef
data BBRef = BBRef {
      size      :: {-# UNPACK #-} !Int
      -- ^ The amount of memory allocated.
    , contained :: {-# UNPACK #-} !Int
      -- ^ The number of bytes currently in the'ByteBuffer'.
    , consumed  :: {-# UNPACK #-} !Int
      -- ^ The number of bytes that have already been consumed.
    , ptr       :: {-# UNPACK #-} !(Ptr Word8)
      -- ^ Pointer to the beginning of the 'ByteBuffer'.
    }

-- invariants:
--   size >= contained >= consumed >= 0
--   contained - consumed = available

pointer arithmetic handled inside library

but what if there's a mistake in the library?

The ByteBuffer is implemented with a Pointer and a few numbers: size is the total number of allocated bytes, contained is the number of 'valid' bytes that have already been copied to the buffer, and consumed is the number of bytes that have already been read.
When we're adding bytes to the buffer, we might have to change its size. Also, after most of the bytes have been consumed, we want to move the remaining bytes to the front.
Some invariants should always be fulfilled: we can't have more bytes in the buffer than its size, and we never want to consume more bytes than have been copied to the buffer – otherwise we're reading garbage.
- All of this is implemented at a very low level – pointer arithmetic. The low-level details are well shielded from the user, but still … what if there's a mistake in the library code? That could lead to segfaults! That's not why we're using Haskell!

How to Avoid Pointer Errors

if only the type system could help us!

enter: Refinement Types, LiquidHaskell

extend the type system with refinement predicates
The executable liquid tries to prove these predicates
non-invasive: annotations in comments
- no code changes
- no effect on performance

LiquidHaskell for ByteBuffer

{-@
data BBRef = BBRef
    { size      :: { v: Int | v >= 0 }
    , contained :: { v: Int | v >= 0 && v <= size }
    , consumed  :: { v: Int | v >= 0 && v <= contained }
    , ptr       :: { v: Ptr Word8 | (plen v) = size }
    }
@-}

each construction of a ByteBuffer will be checked:

new :: MonadIO m => Maybe Int -> m ByteBuffer
new maybel = liftIO $ do
    let l = max 0 . fromMaybe (4*1024*1024) $ maybel
    newPtr <- Alloc.mallocBytes l
    newIORef BBRef { ptr = newPtr
                   , size = l
                   , contained = 0
                   , consumed = 0
                   }

{-@ mallocBytes :: l:Nat -> IO ({v:Ptr a | plen v == l}) @-}

So, how do these annotations look like?

Here's the type of out buffer, with the invariants encoded in LiquidHaskell:
- size is not merely an Int, but it's an Int that is non-zero
- contained is also non-zero, and it cannot exceed the size of the buffer
- we're also specifying that we never want to read garbage, by forcing consumed to be less or equal to contained.
- finally, we're requiring the length of the pointer – the number of allocated bytes – to be equal to the size field.
Whenever we construct a new BBRef, LiquidHaskell will check that all those constraints are true.
For example, the function new allocated a new buffer.
- The invariants among consumed, contained, and size are obviously respected – note the max 1 in the binding of l
- With this specification for mallocBytes, we also have the correct relation between size and the length of the pointer.
  
  There are quite a few specifications for library functions that come shipped with LiquidHaskell, and it is possible to give specifications for imported functions.

Validate Functions

{-@ unsafeConsume :: MonadIO m
  => ByteBuffer
  -> n:Nat
  -> m (Either Int ({v:Ptr Word8 | plen v >= n})) @-}
unsafeConsume :: MonadIO m
        => ByteBuffer
        -> Int
        -> m (Either Int (Ptr Word8))
unsafeConsume bb n = liftIO $ do
    bbref <- readIORef bb
    let available = contained bbref - consumed bbref
    if available < n
        then return $ Left (n - available)
        else do
             writeIORef bb bbref { consumed = consumed bbref + n }
             return $ Right (ptr bbref `plusPtr` consumed bbref)

{-@
  plen ptr == size >= 0
  contained <= size
  consumed <= contained
@-}

\begin{align} &\texttt{available}\; \geq \texttt{n}\\ \Leftrightarrow\; & \texttt{contained}\; - \texttt{consumed}\; \geq \texttt{n}\\ \Rightarrow\; & \texttt{plen p} - \texttt{consumed}\; \geq \texttt{n}\\ \end{align}

Reallocations

{-@ enlargeBBRef ::
       b:BBRef
    -> i:Nat
    -> IO {v:BBRef | size v >= i
                  && contained v == contained b
                  && consumed v == consumed b} @-}
enlargeBBRef :: BBRef -> Int -> IO BBRef
enlargeBBRef bbref minSize= do
    let getNewSize s | s >= minSize = s
        getNewSize s = getNewSize $
          (ceiling . (*(1.5 :: Double)) . fromIntegral) (max 1 s)
        newSize = getNewSize (size bbref)
    ptr' <- Alloc.reallocBytes (ptr bbref) newSize
    return bbref { size = newSize
                 , ptr = ptr'
                 }

{-@ reallocBytes :: Ptr a -> l:Nat -> IO ({v:Ptr a | plen v == l}) @-}

Let's look at another function: when we want to copy the contents of a ByteString to a ByteBuffer that is smaller than the ByteString, we have to enlarge it by reallocating the pointer to a larger area of memory.

This should increase the size, but should not change the number of bytes that are in the buffer, nor the consumed bytes.
Setting it to the exact size required has a bad worst-case behaviour if we get increasingly bigger ByteStrings, so we grow the buffer exponentially instead.
- LiquidHaskell can verify that the newSize is indeed at least as large as the given number
- From the specification of reallocBytes, it sees that size matches the length of the pointer
- The other values of the buffer remain the same. In particular, since the size only grew, the number of contained and consumed bytes are still valid.

Summary

store is for you if
- speed is important
- no need for cross-platform compatibility
- no need for backwards compatibility
- your data fits into memory
lots of instances out of the box, straightforward to declare your own
lightweight streaming layer
LiquidHaskell to prove absence of errors in the pointer logic