Wednesday 30 September 2015

I stumbled upon a 'functional programming interview question' recently; and I came up with a really nice functional programming answer, so feel free to drop the quotes actually.

Question: Given 'n' and a stream of floats, return a stream of the running average of the most recent n elements. If there aren't n elements yet output the average of all the elements available.

I looked at the problem and my first thought was a scanl maintaining the sum, count and the last n elements, with an appropriate update function.
Something like:

`avg :: Int -> [Float] -> [Float]
avg b = map getAvg . tail . scanl f def 
    where 
      getAvg (x, _, _) = x
      def = (0, 0, [])
      f :: (Float, Int, [Float]) -> Float -> (Float, Int, [Float])
      f (_, 0, _) v = (v, 1, [v])
      f (avg, n', x:xs) v | n' == b = (avg - x/n + v/n, n', xs++[v])
                          | otherwise = (avg*n/(n+1) + v/(n+1.0), n'+1, x:xs++[v])
          where 
            n :: Float
            n = fi n'

fi = fromIntegral`

We could improve the update function by tracking the sum and calculating the average at the time of `getAvg`
`avg :: Int -> [Float] -> [Float]
avg b = map getAvg . tail . scanl f def 
    where 
      getAvg (x, n, _) = x/fi n
      def = (0, 0, [])
      f :: (Float, Int, [Float]) -> Float -> (Float, Int, [Float])
      f (_, 0, _) v = (v, 1, [v])
      f (s, n', x:xs) v | n' == b = (s - x + v, n', xs++[v])
                        | otherwise = (s + v, n'+1, x:xs++[v])`

While that's pretty functional what with the scanl but I want to improve it.
The inspiration: `avg(i) = (f(i) - f(i-n))/n(i)`

So what do we need in order to calculate the average at some random point in the stream?
The sum of the last n elements and the min(number of elements, n).

Let f(i) | i < 0 = 0
            | otherwise = sum of all elements upto the ith index.
Sum of the last n elements = f(i) - f(i-n)


In the spirit of Haskell we can combine these quite cleanly:
Sum of the elements: `f(i) = scanl (+) 0 l !! i`
Sum of the elements n elements back: `f(i-n) = replicate n 0 ++ f(i) !! i`
`let lSum = scanl (+) 0 l`
Sum of the last n elements : `f(i) - f(i-n) = zipWith (-) lSum (replicate n 0 ++ lSum) !! i`
Number of elements so far: `n(i) = map (fi . min n) [1..] !! i`
Average: `avg(i) = (f(i) - f(i-n))/n(i)`

Join 'em together:  
f :: Int -> [Float] -> [Float]
f n l = zipWith (/) (zipWith (-) lSum lSum') lLen where lLen = map (fi . min n) [1..] lSum = tail (scanl (+) 0 l) lSum' = replicate n 0 ++ lSum  
fi = fromIntegral 

Bootiful.

Heavy lifting: Mountains out of molehills

Small post today on some lovely stuff, time to strut that elegant Haskell plumage;

The specific task:
I represent a grid: `type Grid = (Int -> Info, Int -> Info)`
where the two functions are: `row index -> row information, col index -> col information`.

I have a function that updates a particular location (row, index) in the grid that manifests with an update to the row/column information (the update may fail):
`update :: Info -> Maybe Info`

However if the update fails, the entire grid becomes inconsistent so I want to propagate that inconsistency up.
Thus the problem is to create a function `liftInfoEdit: (Int, Int) -> (Info -> Maybe Info) -> Grid -> Maybe Grid`
That's the outline, the corner pieces.

This is a pretty cool theme pervasive in Haskell: leverage existing machinery to raise a function editing a piece of information up through the layers of the structure containing that information.

The first thing to note is that the edit is identical and independent w.r.t. the row and column of the grid; therefore we split our ed `t` into two separate streams and then finally join them together:

`t :: (Int -> Info) -> Maybe (Int -> Info)
trans :: Grid -> Maybe Grid
trans = (liftA2 . liftA2) (uncurry (,)) (t***t) $ grid`

Now we need to define `t`.
So what we could do is go down the following path:
Turn `(Int -> Info)` to `(Int -> Maybe Info)`. What we would like to do at this point is `(Int -> Maybe Info) -> Maybe (Int -> Info)` however the tires simply squeal in the mud. We want a change on a small portion of the function to be reflected on the entirety of the function.

What we could do instead is, split our function up and then glue it back together. Before we go further, one abstraction for a function `a -> b` is a `set(a, b)`.
Splitting a function up ~ split up its domain into 'a' and the rest.
`splitUpF :: a -> (a -> b) -> ((a, b), a -> b)
splitUpF a f = ((a, f a), f)

mergeF :: Eq a => ((a, b), a -> b) -> a -> b
mergeF ((a, b), g) a' | a == a' = b
                      | otherwise = g a'`


We now have all the pieces, its time to put them together; its time to do the dance: the jig-saw.

setG (r, c) v = updateLine (Row r) c v >=> updateLine (Col c) r v

updateLine :: Line -> Ind -> Maybe Value -> Grid -> Maybe Grid
updateLine l j v grid = dimap isom1 (uncurry (liftA2 (curry isom2)))
                        (finalFunc *** Just)
                        $ grid
    where
      isom1 = view (from iso1)
      isom2 = view iso1
      (info, cond) = grid l
      finalFunc :: (Line -> Info) -> Maybe (Line -> Info)
      finalFunc = splitUpF l >>>
                  (g *** Just) >>>
                  uncurry (liftA2 (curry mergeF))
      g = second h >>> uncurry (fmap . (,))
      h info = if cond info'
               then Just info
               else Nothing
          where
            info' = info & rep %~ f
            f :: Digs -> Digs
            f p i | i == j = v
                  | otherwise = p i


Thursday 6 August 2015

rPrINwtF


I've been playing around with type classes of late. My type checker and me have had some epic fights. We're not speaking now, it's late and I don't want to get into another shouting match.
Seriously, type classes and type inference don't mesh but they unlock some cool stuff. 

I'm going to wrangle a mangled attempt at printf. Hopefully we come out with something like: printf "Msg1: {}; Msg2: {}" "Hello" "World" == "Msg1: Hello Msg2: World"
We need to parse the format string and be able to move from hole to hole. For that we create a zipper class that will allow us to move from hole to hole in the string as well insert into a hole.

class Zip a where
  type Comp a
  next :: a -> a
  insert :: Comp a -> a -> a

instance Zip ([a], a, [a]) where
  type Comp ([a], a, [a]) = a
  next (l, a, []) = (l, a, [])
  next (l, a, r:rs) = (a:l, r, rs)
  insert x (l, a, r) = (l, x, a:r)
-- We insert before the cursor

instance Zip a => Zip (r -> a) where
  type Comp (r -> a) = Comp a
  -- We push the next function inside the arrow
  next f r = next (f r)
  -- We push the insert function inside the arrow
  insert lx ra r = insert lx (ra r)

We need a parser that will parse the string into a zipper where every time we go 'next' we move to the next hole. For every hole we add an extra empty string for the hole. This helps with holes present at the end of the string.

parser :: String -> ([String], String, [String])
parser s = case l of
      [] -> ([], "", [])
      (x:x':xs) -> ([x], x', xs)
      [x] -> ([x], "", [])
    where
      x = try (manyTill anyChar (string "{" *> manyTill anyChar (char '}'))) <|> many1 anyChar
      l = getRes (many x) [] s >>= (:[""])

As well as its inverse

s_ :: ([String], String, [String]) -> String
s_ (l, x, r) = concat $ reverse l ++ x:r
class Printf a where
  p_ :: String -> a
Base Case 1:
instance Printf ([String], String, [String]) where
    p_ = parser
Inductive Case:
instance (x ~ Comp a, Zip a, Printf a) => Printf (x -> a) where
  -- The first next is to move past the inserted value, the second is to move past the empty string denoting the hole, the final next is to move past the string.
  p_ s x = next . next . next . insert x . p_ $ s

That allows us to do something like this: s_ (p_ "{}, {}, ||| {}" "Hello" "World" "!!!")
p_ now takes an arbitrary number of parameters and attempts to insert them into the holes in the string to be formatted then s_ converts it into the formatted string. It's a bit of a muddled up version of rPrINwtf, but at least I spelt it right.

Tuesday 4 August 2015

KMP 2: Haskell

In a previous post I wrote about the KMP algorithm and how it supposedly worked for Haskell. Further investigation reveals that the complexity is O(n^2). Through the next two posts I'll iteratively transform this O(n^2) version into the O(n) KMP algorithm by annotating the table we developed with various bits of information; I'll use this information to optimize the traversal of the table.

All the tender juicy bits reside in the table: the finite automata driving the cursor over the substring to be matched, so lets create that first.

Lets create a simple machine that is constructed from the needle and consumes the haystack; it progresses if the character matches and gives up when the first failed match.

Thus the state of the machine at any point of time could have one of three values: Finished with Failure, Finished Successfully, In Progress But Undecided. We can use a Maybe Any, for a 3 valued type (Maybe's instances will come in handy; we use Any for its Monoid instance)

type KMP a = Mealy a (Maybe Any)

simpleMachine :: Eq a => [a] -> KMP a
simpleMachine [] = pure (Just (Any True))
simpleMachine (a:as) = Mealy f
    where
      f a' | a == a' = (Just (Any False), simpleMachine as)
           | otherwise = (Nothing, pure Nothing)

KMP says, upon failure, we jump to the point with the next longest match. 
The machine above will start matching the haystack starting from the beginning of the haystack.
If we want to push back the point at which we start matching we need to delay the machine.

delay :: KMP a -> KMP a
delay k = Mealy $ \a -> (Just (Any False), k)

delay simpleMachine would start matching from the second character of the haystack.
delay (delay simpleMachine) would start matching from the third character of the haystack.

The list of machines starting from every possible point in any haystack.
ds :: [KMP a] 
ds = g : map delay ds

We need to run the machines in parallel and the state of the machine at any point should be the leftmost one (in the above list) that has not yet failed (because the leftmost one will have run the longest corresponding to the longest match).
The applicative instance of the Mealy machine allows us to run the machines in parallel. We gather up the internal states using an or.
table = foldl1 (liftA2 (<>)) ds

The problem with the machine above is, when the needle is not present in the haystack,  it does not terminate because we keep looking to the next simple machine in a list of infinite machines to check if it has had any success. So we need to limit the number of machines we run in parallel: i.e. the size of the haystack.

table sizeOfHaystack = foldl1 ((liftA2 . liftA2) (||)) (take sizeOfHaystack ds)

Thus the machine
kmpMachine :: forall a. Eq a => Int -> [a] -> KMP a
kmpMachine sizeOfHaystack needle = foldl1 (liftA2 (<>)) (take sizeOfHaystack ds)
    where
      ds = g : map delay ds
      g :: KMP a
      g = simpleMachine needle
can be driven with
driveMachine :: [a] -> Bool -> KMP a -> Bool
driveMachine [] b k = b
driveMachine (a:as) b k = uncurry (driveMachine as) . h . next k $ a
    where 
      h (x, y) = (getAny . maybe (Any False) id $ x, y)

solve :: Eq a => [a] -> [a] -> Bool
solve needle haystack = driveMachine haystack False (kmpMachine (length haystack) needle)

This implementation is the naive O(n^2) version: whenever we switch to a new machine we have to drive it all the way from the start.

In a subsequent post, we'll see how to transform this naive O(n^2) version into the O(n) version.
The way we optimize the algorithm is to optimize its pivotal point: the merging of two machines.

Monday 27 July 2015

Moore Machines

I've been gainfully employed for the past couple of months in a Silicon encrusted dream world. Food and comp sci fun and fun and food chase their respective tails. Hence this blog has gathered dust.
So to 'shake it off', I'm going to hop on ekmetts 'machines ' and see where we get to.

Imagine you have a task that logs intermediate output.
Should the task fail midway it should restart and the output must be identical to the first.
The abstraction we will use for a task is a pure function.
Now, the task needs a random number.
So we provide it a random seed alongside some random number generator alongside its other parameters.

Now let's say that we need to perform multiple instances of this task in parallel. Each instance has a  UID and we have no control over when the instances get dispatched.
We now have to provide each of the instances it's own random seed. That involves generating and storing all of these seeds on the dispatcher side and thus does not scale.
We need a better of locally generating the random number on the side of the task.
So one of the problems is that we have to store the entire array of random numbers in memory on the side of the dispatcher. Is there some way to compress that array?
Yes of course, if we abstract away the array and take it as a sequence of values generated from some random number generator.
Thus all we have to do is pass to the task the random number generator, a random seed and the index in the sequence it needs. For the index that task can use the UID mentioned earlier. The problem of space (storing all the random numbers simultaneously) is solved as finding the nth random number in the sequence generated via a random number generator is constant in its space consumption.

Yet the problem of generating the nth random number is linear in the number of tasks which can prove to be intractable.
Let's see if we can improve the indexing operation above.
The first observation we make is: it should be impossible to get the nth element of a random number generator with anything less than the (n-1)th element, the seed,  and the random number generator.
Thus if we want to be able to access some nth and the mth (n /= m) element in the sequence with the same time complexity, we can't use the same generator. Thus we need another generator.
So we split the space of indices into two parts and apportion different generators to each of them.
That doesn't give us any asymptotic improvement though, because for each part now we have n/2 possible indices.
But wait, what does having a different generator mean. From the point of views of their properties, we'd say two generators G1 and G2 are different if there is no relationship between the sequence of random numbers they produce. There is a relationship between two sequences of random numbers if for any n and m, given n elements of one sequence and given m elements of the other we are able to predict either the n+1 or the m+1 elements of the respective sequences.

Thus creating new generators is cheap because we can simply attach different seeds to the generators and they become different generators.
Using this we can create as many generators as we want and minimize the bucket size of the index space to a constant value. This means however that the number of generators is proportional to n i.e. the number of random seeds we need to generate is proportional to n. Thus we are at the same problem we were at before, but wait, the number of random values we have to generate has been decreased, which means we can recursively apply this logic and get a tree where the leaves of the tree  is the sequence of random numbers we need. Now accessing a leaf simply requires log(n) time. Internal nodes contain random values that are the seed to the random number generator that generates it's leaves.
Say we create a binary tree, then the path from the root to the leaf that represents the nth element in our sequence consists of an encoding of n as a binary number and each  bit representing a whether we take the first or the second value generated by the random number generator.

Code follows:
import Data.Machine
import System.Random


genN' 0 _ = []
genN' i g = f (next g)
    where
      f (n, g') = n:genN' (i-1) g'

genN i s = genN' i (mkStdGen s)

findN1 n = last . genN n
     
machine g = unfoldMoore h g
    where
      h g = (i, f)
          where
            (i, g') = next g
            f False = g'
            f True = snd . next $ g

driveMoore :: Moore a b -> [a] -> [b]
driveMoore m [] = []
driveMoore (Moore b f) (a:as) = b:driveMoore (f a) as

findN2 n s = last (driveMoore (machine randGen) bin)
    where
      randGen = mkStdGen s
      bin = toBin n
      toBin 0 = [False]
      toBin i = uncurry f (divMod i 2)
          where
            f i j = h j:toBin i
            h 0 = False
            h 1 = True

That gives us a log(n) time indexing scheme. Is it possible to improve  it?
Let's try and provide a random access scheme.
Fundamentally, the reason it has proven hard until now was that the elements of the sequence were dependent on each other in some way. If we break that assumption it becomes much easier.

Friday 27 February 2015

KMP: Haskell

KMP is quite a nice algorithm. I was re-implementing it in haskell and of course I, as I continually am, was surprised to find a beautiful smile with a perfect set of teeth glinting behind a veneer of imperative plaque.

{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE DeriveTraversable #-}
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE NoMonomorphismRestriction #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TupleSections #-}

import Data.Functor.Foldable
import qualified Data.Map as M
import qualified Data.Vector as V
import qualified Data.Set as S
import Data.Monoid
import Data.List
import Data.Char
import Control.Monad
import Control.Applicative
import Control.Lens
import Control.Monad.State
import Control.Monad.Reader
import Control.Monad.Free
import System.Random
import Test.QuickCheck


{-
  So in implementation 1 what we are going to do is 
  first build the table and then use the table to 
  check if the needle is present within the haystack in O(n).
  where n is the size of the input string. 
  So the ith element of the table will tell you which position of 
  the table to jump to should the ith character of the needle
  fail to match the tracked character of the haystack assuming that
  you have successfully matched all characters upto that ith character
-}

type KMP a = [(a, Int)]


impl1 :: forall a. Eq a => [a] -> [a] -> Bool
impl1 needle haystack = tryMatch 0 haystack
    where 
      tryMatch j []
               | j == len = True
               | otherwise = False
      tryMatch j (x:xs)
               | j < 0 = tryMatch 0 xs
               | j == len = True
               | x == needle !! j = tryMatch (j+1) xs
               | otherwise = tryMatch ((tbl !! j) + 1) (x:xs)
      tbl = tblp needle
      len = length needle
{-
  The way one generally builds this table is to keep the current index i
  you are at in the table  and another index j 
  such that [0..j] is the longest prefix matching a suffix ending at i-1
-}

tblp [] = []
tblp needle = -1:unfoldr g (1, -1)
    where 
      g :: (Int, Int) -> Maybe (Int, (Int, Int))
      g (i, j) | i == len = Nothing
               | needle !! i == needle !! (j+1) = Just (j+1, (i+1, j+1))
               | otherwise = Just (f j, (i+1, f j))
               where 
                 f x = if x >= 0 
                        then if (needle !! i) == needle !! x 
                             then x
                             else f (tblp needle !! x)
                        else x
      len = length needle
                           


{-
  I'd say that was pretty horrible. All this low level inspection of data is quite grungy and frankly quite unfit for civilzed society I say. It calls for a more declarative approach. 
  The problems with the above is the implementation of tblp. 
  If we think about it, we would like to have something that doesn't actually 
  commit to anything until it knows for sure that this is the right path. 
  
-}


{-
  So instead we could declaratively construct our table.
  Assume we have a transition function that will take us to
  the right state if the input doesn't match to
  and to the same state if it does.
  If we have exhausted our input then we are done
  If we have outstanding characters and the next character
  matches the character we are at then we get the state we 
  were a
  the transition function at this point
-}


data Kmp a = Next { nxt :: (a -> Kmp a), done :: Bool}

impl2 :: Eq a => [a] -> [a] -> Bool
impl2 n h = match table h
    where 
      match k [] = done k
      match k (x:xs) = done k || match (nxt k x) xs
      table = tbl2 n (const table)

tbl2 :: forall a. Eq a => [a] -> (a -> Kmp a) -> Kmp a
tbl2 [] f = Next f True
tbl2 (x:xs) transition = Next g False

    where 
      g a | x == a = tbl2 xs (nxt (transition x))
          | otherwise = transition x


naive needle haystack = any ((== needle) . take len) suffixes
    where 
      suffixes = tails haystack
      len = length needle

prop1 n h = kmp n h == naive n h

test = quickCheckWith myArgs

myArgs = stdArgs { maxSuccess = 750, maxSize = 750 }

Saturday 31 January 2015

Life by Comonads: 3

Conways game of life is a cellular automata - an environment with a synchronous update scheme. The status of the cell in the next time-step is decided by its current status and its environment. This makes it a perfect candidate for some comonadic power play. The rules are quite simple and you can find them online.
A cells status is simply that of dead or alive for which we will use booleans.
Deciding whether a cell is alive or dead in the next iteration will be a function of the form
decide :: Conway -> Bool 
In order to view the environment we need to be able to step up and down as well as the left and right we wrote previously.
This is where duplicate of the list zipper comes into play. A duplicated zipper will represent the environment.
type Conway = LZip (LZip Bool)
shiftU :: LZip (LZip Bool) -> LZip (LZip Bool)
shiftU = fmap shiftR
shiftD :: LZip (LZip Bool) -> LZip (LZip Bool)
shiftD = fmap shiftL

We can finally decide whether or not a cell stays alive for the next iteration by viewing the environment - We obtain a list of the status of the surrounding cells in 'doa', then count the ones alive and finally use that to determine its status according to the rules.

decide :: Conway -> Bool 
decide z | count < 2 = False 
         | count == 3 = True 
         | alive && count == 2 = True 
         | otherwise = False 
         where 
           alive = extract . extract $ z 
           count = if alive then count' - 1 else count'
           count' = foldl (\s b -> if b then (s+1) else s) 0 doa 
           doa = g <$> [shiftL, shiftR, id] <*> [shiftL, shiftR, id] 
           g s s' = case s' z of 
                        LZip _ z' _ -> extract (s z') 

The last piece is the decideLZip function. In order to extend the current state to get the new one we need a function that decides a column of the board for the given focus. 'decideColumn' must traverse the board vertically in order to decide the status of each of the cells in the column.

decideColumn :: Conway -> LZip Bool
decideColumn v = LZip l c r
  where 
    c = decide v
    l = fmap decide (tail $ iterate shiftD v)
    r = fmap decide (tail $ iterate shiftU v)

Finally our step function simply extends the current state using decideColumn.
step :: Conway -> Conway
step = extend decideColumn

We can then step the state as many times as we want by iteratively extending it with the step function.
The board at the nth time step would be (step^n) $ initial, according to the redefinition of (^) as defined below. We can pretty print the board in a certain neighborhood:
visual = (\c -> if c then 'o' else ' ')
pretty :: Int -> Conway -> [String]
pretty n c@(LZip l v r) = (map . map) visual b
    where
      b :: [[Bool]]
      b = crop2D n c
pp :: Int -> Conway -> IO ()
pp i c = mapM_ putStrLn (pretty i c)
_ ^ 0 = id
f ^ n = f . (f ^ (n-1))
And voila, thats the conway game of life modelled beautifully using commands.
A sample starting state - The Blinker - is presented below.

initial :: Conway
initial = LZip (repeat allFalses) blinker (repeat allFalses)
    where 
      allFalses = LZip (repeat False) False (repeat False)
      blinker = LZip (True:repeat False) True (True:repeat False)