Friday, 27 February 2015

KMP: Haskell

KMP is quite a nice algorithm. I was re-implementing it in haskell and of course I, as I continually am, was surprised to find a beautiful smile with a perfect set of teeth glinting behind a veneer of imperative plaque.

{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE DeriveTraversable #-}
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE NoMonomorphismRestriction #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TupleSections #-}

import Data.Functor.Foldable
import qualified Data.Map as M
import qualified Data.Vector as V
import qualified Data.Set as S
import Data.Monoid
import Data.List
import Data.Char
import Control.Monad
import Control.Applicative
import Control.Lens
import Control.Monad.State
import Control.Monad.Reader
import Control.Monad.Free
import System.Random
import Test.QuickCheck

  So in implementation 1 what we are going to do is 
  first build the table and then use the table to 
  check if the needle is present within the haystack in O(n).
  where n is the size of the input string. 
  So the ith element of the table will tell you which position of 
  the table to jump to should the ith character of the needle
  fail to match the tracked character of the haystack assuming that
  you have successfully matched all characters upto that ith character

type KMP a = [(a, Int)]

impl1 :: forall a. Eq a => [a] -> [a] -> Bool
impl1 needle haystack = tryMatch 0 haystack
      tryMatch j []
               | j == len = True
               | otherwise = False
      tryMatch j (x:xs)
               | j < 0 = tryMatch 0 xs
               | j == len = True
               | x == needle !! j = tryMatch (j+1) xs
               | otherwise = tryMatch ((tbl !! j) + 1) (x:xs)
      tbl = tblp needle
      len = length needle
  The way one generally builds this table is to keep the current index i
  you are at in the table  and another index j 
  such that [0..j] is the longest prefix matching a suffix ending at i-1

tblp [] = []
tblp needle = -1:unfoldr g (1, -1)
      g :: (Int, Int) -> Maybe (Int, (Int, Int))
      g (i, j) | i == len = Nothing
               | needle !! i == needle !! (j+1) = Just (j+1, (i+1, j+1))
               | otherwise = Just (f j, (i+1, f j))
                 f x = if x >= 0 
                        then if (needle !! i) == needle !! x 
                             then x
                             else f (tblp needle !! x)
                        else x
      len = length needle

  I'd say that was pretty horrible. All this low level inspection of data is quite grungy and frankly quite unfit for civilzed society I say. It calls for a more declarative approach. 
  The problems with the above is the implementation of tblp. 
  If we think about it, we would like to have something that doesn't actually 
  commit to anything until it knows for sure that this is the right path. 

  So instead we could declaratively construct our table.
  Assume we have a transition function that will take us to
  the right state if the input doesn't match to
  and to the same state if it does.
  If we have exhausted our input then we are done
  If we have outstanding characters and the next character
  matches the character we are at then we get the state we 
  were a
  the transition function at this point

data Kmp a = Next { nxt :: (a -> Kmp a), done :: Bool}

impl2 :: Eq a => [a] -> [a] -> Bool
impl2 n h = match table h
      match k [] = done k
      match k (x:xs) = done k || match (nxt k x) xs
      table = tbl2 n (const table)

tbl2 :: forall a. Eq a => [a] -> (a -> Kmp a) -> Kmp a
tbl2 [] f = Next f True
tbl2 (x:xs) transition = Next g False

      g a | x == a = tbl2 xs (nxt (transition x))
          | otherwise = transition x

naive needle haystack = any ((== needle) . take len) suffixes
      suffixes = tails haystack
      len = length needle

prop1 n h = kmp n h == naive n h

test = quickCheckWith myArgs

myArgs = stdArgs { maxSuccess = 750, maxSize = 750 }

No comments:

Post a Comment