Tuesday, 4 August 2015

KMP 2: Haskell

In a previous post I wrote about the KMP algorithm and how it supposedly worked for Haskell. Further investigation reveals that the complexity is O(n^2). Through the next two posts I'll iteratively transform this O(n^2) version into the O(n) KMP algorithm by annotating the table we developed with various bits of information; I'll use this information to optimize the traversal of the table.

All the tender juicy bits reside in the table: the finite automata driving the cursor over the substring to be matched, so lets create that first.

Lets create a simple machine that is constructed from the needle and consumes the haystack; it progresses if the character matches and gives up when the first failed match.

Thus the state of the machine at any point of time could have one of three values: Finished with Failure, Finished Successfully, In Progress But Undecided. We can use a Maybe Any, for a 3 valued type (Maybe's instances will come in handy; we use Any for its Monoid instance)

type KMP a = Mealy a (Maybe Any)

simpleMachine :: Eq a => [a] -> KMP a
simpleMachine [] = pure (Just (Any True))
simpleMachine (a:as) = Mealy f
    where
      f a' | a == a' = (Just (Any False), simpleMachine as)
           | otherwise = (Nothing, pure Nothing)

KMP says, upon failure, we jump to the point with the next longest match. 
The machine above will start matching the haystack starting from the beginning of the haystack.
If we want to push back the point at which we start matching we need to delay the machine.

delay :: KMP a -> KMP a
delay k = Mealy $ \a -> (Just (Any False), k)

delay simpleMachine would start matching from the second character of the haystack.
delay (delay simpleMachine) would start matching from the third character of the haystack.

The list of machines starting from every possible point in any haystack.
ds :: [KMP a] 
ds = g : map delay ds

We need to run the machines in parallel and the state of the machine at any point should be the leftmost one (in the above list) that has not yet failed (because the leftmost one will have run the longest corresponding to the longest match).
The applicative instance of the Mealy machine allows us to run the machines in parallel. We gather up the internal states using an or.
table = foldl1 (liftA2 (<>)) ds

The problem with the machine above is, when the needle is not present in the haystack,  it does not terminate because we keep looking to the next simple machine in a list of infinite machines to check if it has had any success. So we need to limit the number of machines we run in parallel: i.e. the size of the haystack.

table sizeOfHaystack = foldl1 ((liftA2 . liftA2) (||)) (take sizeOfHaystack ds)

Thus the machine
kmpMachine :: forall a. Eq a => Int -> [a] -> KMP a
kmpMachine sizeOfHaystack needle = foldl1 (liftA2 (<>)) (take sizeOfHaystack ds)
    where
      ds = g : map delay ds
      g :: KMP a
      g = simpleMachine needle
can be driven with
driveMachine :: [a] -> Bool -> KMP a -> Bool
driveMachine [] b k = b
driveMachine (a:as) b k = uncurry (driveMachine as) . h . next k $ a
    where 
      h (x, y) = (getAny . maybe (Any False) id $ x, y)

solve :: Eq a => [a] -> [a] -> Bool
solve needle haystack = driveMachine haystack False (kmpMachine (length haystack) needle)

This implementation is the naive O(n^2) version: whenever we switch to a new machine we have to drive it all the way from the start.

In a subsequent post, we'll see how to transform this naive O(n^2) version into the O(n) version.
The way we optimize the algorithm is to optimize its pivotal point: the merging of two machines.

No comments:

Post a Comment