Philipp Kant, FP Complete

Haskell in Leipzig, 2016-09-15

- serialisation: represent data as sequence of bytes
- to save it to files
- to send it to another computer/process

- possible features
- versioning, backwards compatibility
- architecture independence
- cross-language compatibility
- easy to use
- speed

- use case: distributed high performance computing
- typical data: vectors of simple data types, fits in memory
- design goal: speed
no versioning, fixed architecture

communication between identical binaries on the same architecture

minimalistic protocol

no backtracking, avoid multiple allocations

- data of fixed size (Int, Double, …) via Storable

class Storable a where sizeOf :: a -> Int -- ^ The value of the argument is not used. poke :: Ptr a -> a -> IO () peek :: Ptr a -> IO a ...

what about lists and vectors?

prefix data by its length – but only when needed

data Size a = VarSize (a -> Int) -- ^ size depends on value | ConstSize !Int -- ^ size is statically known

(Size :: Int64) = ConstSize 8 (Size :: [Int64]) = VarSize (\xs -> 8 * length xs)

-- public API class Store a where size :: Size a poke :: Poke () peek :: Peek a encode :: Store a => a -> ByteString decode :: Store a => ByteString -> Either PeekException a -- implementation, not exposed newtype Poke a = Poke { runPoke :: forall byte. Ptr byte -> Int -> IO (Int, a) } newtype Peek a = Peek { runPeek :: forall byte. Ptr byte -> Ptr byte -> IO (Ptr byte, a) }

internals of `Poke`

and `Peek`

not exposed

- library declares many instances
- more via
`Generic`

`Peek`

and`Poke`

are`Applicative`

and`Monad`

instances

data Sumtype = I8 Int8 | I32 Int32 instance Store Sumtype where poke (I8 x) = poke (0 :: Word8) >> poke x poke (I32 x) = poke (1 :: Word8) >> poke x size = VarSize $ \x -> 1 + case x of I8 _ -> 1 I32 _ -> 4 peek = do tag <- peek case tag :: Word8 of 0 -> I8 <$> peek 1 -> I32 <$> peek _ -> fail "Invalid tag"

… or simply

data Sumtype = I8 Int8 | I32 Int32 deriving Generic instance Store Sumtype

no boilerplate, no mismatches between `size`

and `poke`

serialise/deserialise vector of length 100 of

data SomeData = SomeData !Int64 !Word8 !Double deriving Generic instance S.Store SomeData

store: serialisation to/from strict ByteStrings

efficiency: one allocation per serialisation, no partial results

- networking: data arrives in chunks, need streaming

add thin streaming layer on top of store

decodeMessage :: (MonadIO m, Store a) => ByteBuffer -> m (Maybe ByteString) -> m (Maybe (Message a))

-- | Copy the contents of a 'ByteString' to a 'ByteBuffer'. copyByteString :: MonadIO m => ByteBuffer -> ByteString -> m () -- | Try to get a pointer to @n@ bytes from the 'ByteBuffer'. -- -- If there are not enough bytes in the ByteBuffer, indicate -- how many bytes are needed consume :: MonadIO m => ByteBuffer -> Int -> m (Either Int ByteString)

type ByteBuffer = IORef BBRef data BBRef = BBRef { size :: {-# UNPACK #-} !Int -- ^ The amount of memory allocated. , contained :: {-# UNPACK #-} !Int -- ^ The number of bytes currently in the'ByteBuffer'. , consumed :: {-# UNPACK #-} !Int -- ^ The number of bytes that have already been consumed. , ptr :: {-# UNPACK #-} !(Ptr Word8) -- ^ Pointer to the beginning of the 'ByteBuffer'. }

-- invariants: -- size >= contained >= consumed >= 0 -- contained - consumed = available

- pointer arithmetic handled inside library

- but what if there's a mistake in the library?

if only the type system could help us!

enter: Refinement Types, LiquidHaskell

- extend the type system with
**refinement predicates** - The executable
`liquid`

tries to prove these predicates - non-invasive: annotations in comments
- no code changes
- no effect on performance

{-@ data BBRef = BBRef { size :: { v: Int | v >= 0 } , contained :: { v: Int | v >= 0 && v <= size } , consumed :: { v: Int | v >= 0 && v <= contained } , ptr :: { v: Ptr Word8 | (plen v) = size } } @-}

each construction of a ByteBuffer will be checked:

new :: MonadIO m => Maybe Int -> m ByteBuffer new maybel = liftIO $ do let l = max 0 . fromMaybe (4*1024*1024) $ maybel newPtr <- Alloc.mallocBytes l newIORef BBRef { ptr = newPtr , size = l , contained = 0 , consumed = 0 }

{-@ mallocBytes :: l:Nat -> IO ({v:Ptr a | plen v == l}) @-}

{-@ unsafeConsume :: MonadIO m => ByteBuffer -> n:Nat -> m (Either Int ({v:Ptr Word8 | plen v >= n})) @-} unsafeConsume :: MonadIO m => ByteBuffer -> Int -> m (Either Int (Ptr Word8)) unsafeConsume bb n = liftIO $ do bbref <- readIORef bb let available = contained bbref - consumed bbref if available < n then return $ Left (n - available) else do writeIORef bb bbref { consumed = consumed bbref + n } return $ Right (ptr bbref `plusPtr` consumed bbref)

{-@ plen ptr == size >= 0 contained <= size consumed <= contained @-}

\begin{align}
&\texttt{available}\; \geq \texttt{n}\\
\Leftrightarrow\; & \texttt{contained}\; - \texttt{consumed}\; \geq \texttt{n}\\
\Rightarrow\; & \texttt{plen p} - \texttt{consumed}\; \geq \texttt{n}\\
\end{align}

{-@ enlargeBBRef :: b:BBRef -> i:Nat -> IO {v:BBRef | size v >= i && contained v == contained b && consumed v == consumed b} @-} enlargeBBRef :: BBRef -> Int -> IO BBRef enlargeBBRef bbref minSize= do let getNewSize s | s >= minSize = s getNewSize s = getNewSize $ (ceiling . (*(1.5 :: Double)) . fromIntegral) (max 1 s) newSize = getNewSize (size bbref) ptr' <- Alloc.reallocBytes (ptr bbref) newSize return bbref { size = newSize , ptr = ptr' }

{-@ reallocBytes :: Ptr a -> l:Nat -> IO ({v:Ptr a | plen v == l}) @-}

- store is for you if
- speed is important
- no need for cross-platform compatibility
- no need for backwards compatibility
- your data fits into memory

- lots of instances out of the box, straightforward to declare your own
- lightweight streaming layer
- LiquidHaskell to prove absence of errors in the pointer logic