Monday, 27 July 2015

Moore Machines

I've been gainfully employed for the past couple of months in a Silicon encrusted dream world. Food and comp sci fun and fun and food chase their respective tails. Hence this blog has gathered dust.
So to 'shake it off', I'm going to hop on ekmetts 'machines ' and see where we get to.

Imagine you have a task that logs intermediate output.
Should the task fail midway it should restart and the output must be identical to the first.
The abstraction we will use for a task is a pure function.
Now, the task needs a random number.
So we provide it a random seed alongside some random number generator alongside its other parameters.

Now let's say that we need to perform multiple instances of this task in parallel. Each instance has a  UID and we have no control over when the instances get dispatched.
We now have to provide each of the instances it's own random seed. That involves generating and storing all of these seeds on the dispatcher side and thus does not scale.
We need a better of locally generating the random number on the side of the task.
So one of the problems is that we have to store the entire array of random numbers in memory on the side of the dispatcher. Is there some way to compress that array?
Yes of course, if we abstract away the array and take it as a sequence of values generated from some random number generator.
Thus all we have to do is pass to the task the random number generator, a random seed and the index in the sequence it needs. For the index that task can use the UID mentioned earlier. The problem of space (storing all the random numbers simultaneously) is solved as finding the nth random number in the sequence generated via a random number generator is constant in its space consumption.

Yet the problem of generating the nth random number is linear in the number of tasks which can prove to be intractable.
Let's see if we can improve the indexing operation above.
The first observation we make is: it should be impossible to get the nth element of a random number generator with anything less than the (n-1)th element, the seed,  and the random number generator.
Thus if we want to be able to access some nth and the mth (n /= m) element in the sequence with the same time complexity, we can't use the same generator. Thus we need another generator.
So we split the space of indices into two parts and apportion different generators to each of them.
That doesn't give us any asymptotic improvement though, because for each part now we have n/2 possible indices.
But wait, what does having a different generator mean. From the point of views of their properties, we'd say two generators G1 and G2 are different if there is no relationship between the sequence of random numbers they produce. There is a relationship between two sequences of random numbers if for any n and m, given n elements of one sequence and given m elements of the other we are able to predict either the n+1 or the m+1 elements of the respective sequences.

Thus creating new generators is cheap because we can simply attach different seeds to the generators and they become different generators.
Using this we can create as many generators as we want and minimize the bucket size of the index space to a constant value. This means however that the number of generators is proportional to n i.e. the number of random seeds we need to generate is proportional to n. Thus we are at the same problem we were at before, but wait, the number of random values we have to generate has been decreased, which means we can recursively apply this logic and get a tree where the leaves of the tree  is the sequence of random numbers we need. Now accessing a leaf simply requires log(n) time. Internal nodes contain random values that are the seed to the random number generator that generates it's leaves.
Say we create a binary tree, then the path from the root to the leaf that represents the nth element in our sequence consists of an encoding of n as a binary number and each  bit representing a whether we take the first or the second value generated by the random number generator.

Code follows:
import Data.Machine
import System.Random


genN' 0 _ = []
genN' i g = f (next g)
    where
      f (n, g') = n:genN' (i-1) g'

genN i s = genN' i (mkStdGen s)

findN1 n = last . genN n
     
machine g = unfoldMoore h g
    where
      h g = (i, f)
          where
            (i, g') = next g
            f False = g'
            f True = snd . next $ g

driveMoore :: Moore a b -> [a] -> [b]
driveMoore m [] = []
driveMoore (Moore b f) (a:as) = b:driveMoore (f a) as

findN2 n s = last (driveMoore (machine randGen) bin)
    where
      randGen = mkStdGen s
      bin = toBin n
      toBin 0 = [False]
      toBin i = uncurry f (divMod i 2)
          where
            f i j = h j:toBin i
            h 0 = False
            h 1 = True

That gives us a log(n) time indexing scheme. Is it possible to improve  it?
Let's try and provide a random access scheme.
Fundamentally, the reason it has proven hard until now was that the elements of the sequence were dependent on each other in some way. If we break that assumption it becomes much easier.