A Small Matter of Programming

Persistent Vectors and Hash Maps for Haskell

January 03, 2011

One of the prominent features of Clojure are a core set of immutable data structures with efficient manipulation operations. Two of the most innovative and important are the persistent vector and persistent hash map.

As a little project I set myself in order to get to know Haskell better, I have been porting these structures to Haskell. I think it's now at a state where the basics are there and usable, so I've put it up on my Github. The API provides Data.PVector (the persistent vector) and Data.PHashMap (the persistent hash map). The interface for both has been kept as consistent as possible with Data.Map.

Basic usage

These two data structures are:

  • Immutable. Unlike Data.Array.ST and Data.HashTable, there are no monads in sight.
  • Persistent. They provide "update" operations which do not destroy the original structure.
  • Efficient. Unlike Data.Array, updating a PVector or PHashMap doesn't copy the entire structure, but only the changed path.

Here's a demo of what you can do with a PVector:

ghci> :m + Data.PVector
ghci> empty -- the empty pvector
fromList []
ghci> append 1 it
fromList [1]
ghci> append 42 it
fromList [1,42]
ghci> append 13 it
fromList [1,42,13]
ghci> let a = it
ghci> a ! 0 -- indexes are from 0 to n-1
1
ghci> a ! 1
42
ghci> a ! 2
13
ghci> set 1 71 a -- a new PVector with the element replaced
fromList [1,71,13]
ghci> adjust succ 2 a -- apply a function to a single element
fromList [1,42,14]
ghci> Data.PVector.map succ a -- apply a function to all elements
fromList [2,43,14]
ghci> fromList [1..10] -- convert a list to a PVector
fromList [1,2,3,4,5,6,7,8,9,10]
ghci> elems it -- convert a PVector to a list
[1,2,3,4,5,6,7,8,9,10]

And here's a demo of the basic functionality of PHashMap:

ghci> :m + Data.PHashMap
ghci> empty Data.HashTable.hashString
-- an empty PHashMap (requires a key hash function)
fromList hashFn []
ghci> insert "foo" 1 it
fromList hashFn [("foo",1)]
ghci> insert "bar" 42 it
fromList hashFn [("foo",1),("bar",42)]
ghci> insert "qux" 123 it
fromList hashFn [("qux",12),("foo",1),("bar",42)]
ghci> insert "qux" 13 it -- inserting an existing key overwrites by
default
fromList hashFn [("qux",13),("foo",1),("bar",42)]
ghci> let a = it
ghci> a ! "foo"
1
ghci> a ! "baz" -- using (!) is unsafe
*** Exception: array index out of range: element not in the map
ghci> Data.PHashMap.lookup "bar" a
Just 42
ghci> Data.PHashMap.lookup "baz" a -- 'lookup' returns a safe Maybe
Nothing
ghci> adjust succ "foo" a -- apply a function to a value
fromList hashFn [("qux",13),("foo",2),("bar",42)]
ghci> Data.PHashMap.map succ a -- apply a function to all values
fromList hashFn [("qux",14),("foo",2),("bar",43)]
ghci> keys a
["qux","foo","bar"]
ghci> elems a
[13,1,42]
ghci> fromList Data.HashTable.hashString [("a", 1), ("b", 2), ("c", 3)]
fromList hashFn [("b",2),("c",3),("a",1)]
ghci> toList it
[("b",2),("c",3),("a",1)]

Installation

To try it yourself, just clone it from my Github repo, then configure, build and install it:

runhaskell Setup.hs configure --user
runhaskell Setup.hs build
runhaskell Setup.hs install

Performance

The single-element operations for each of these structures technically run in logarithmic time. However, they implemented as 32-ary trees which never exceed a depth of 7 nodes, so you can treat them as constant-time operations (for relatively large constants).

How it works

I wrote this code after reading the following explanatory blog posts on how the structures work in Clojure. I haven't veered too far from Clojure's implementation, so these posts should also provide a decent birds-eye overview of my Haskell implementation.

To do

Things that I haven't gotten round to but would be good to have are:

  • Match Data.Map in completeness
  • Performance tuning

    • More strictness
    • A more efficient fromList (it currently constructs lots of intermediary structures
  • Make a PVector-based implementation of IArray (?)
  • Unit tests
  • Benchmarks

This is my first proper project in Haskell, so I'm sure there are many problems with it. If you have any suggestions, I'd love to hear your comments. And Github pull requests would be very welcome!