Clojure Gazette 184: The Onslaught of States

The Onslaught of States


Issue 184 - August 01, 2016

Hi Clojurers,

Computers have given us the ability to encode information in a very elegant and reductive way. We use the smallest unit that can carry information --- the bit --- and by turning it on or off, we have two possible states. Putting multiple bits in an order gives us a way to multiply their states. One bit is two states. Two bits have four states. Four bits have sixteen states. And thirty-two bits have 4,294,967,296 states. The number of possible states grows exponentially with the number of bits. b bits have 2b possible states.

How many bits of RAM does your computer have? Mine has 8 gigabytes, which is 64 billion bits. How many possible states is that? My calculator says: "Error". That means it's a lot :) If you add one more bit, it doubles. Joe Armstrong gave a great talk about the difficulties of dealing with such a huge amount of state.

And so we find ourselves with a problem: we've got this awesome power of exponentially growing possible states. The vast majority of those states are worthless, just like most random sequences of letters are not meaningful sentences --- let alone true sentences or answers to your questions. Our job as programmers is to keep this exponential monster load of bits set correctly. If we can't set it to the exact right answer then at least to a meaningful answer.

We've got to tame that exponentiation somehow. Luckily, exponents have an inverse operation called logarithm. If we can use logarithms, we can "undo" the exponentiation and get some control over it. There is a mathematical structure that can come to the rescue: the tree. If you add n nodes to a tree, the depth of the tree grows at log n. If you store data in the nodes, you can access it logarithmically by walking down the tree. Hence, the concept of the "trie" was invented in 1959 by René de la Briandais and later named by Edward Fredkin . Tries get my vote for the third most important idea in computer science.

"Trie" stands for reTRIEval. Tries are a way to store data in a tree structure with an index. The index determines the path into the tree that you need to follow to find your data. They give us a way to manage large amounts of memory efficiently. Flat arrays are natural to work with in memory. They're just contiguous blocks. It's easy and fast to access any element by offset. You can say "give me the 1001th element" and get it right away.

But offsets are rarely the key that you want to look something up by. You want to say "Find me the record for Jane Smith." Her name tells you the path to take. Go to the "S" drawer, skip to the "M" section, then riffle through the small number of records in that section until you find Smith, Jane. Of the thousands of records, you can get to the record you need quickly. This is the power of tries.

Clojure's famous persistent collections use tries for their implementations. A Clojure vector is a trie. Vectors are indexed by a signed 32-bit integer, so they can store 231 elements (one bit is reserved for the sign +/-). Because the branching factor of the trie is 32, that means accessing any element at worst requires log32 231 = 6.2 steps in the path. Because you can't have a partial step, the worst case is 7 steps. Clojure's hash maps use the hash code of the key --- 32 bits --- to follow a path down the branches of the tree. You've got the same maximum number of steps: 7.

And then what do we do when we get Jane's record? We look up the page we want. Then we look up the section and field we want. We can use paths all the way up and all the way down. And so we find ourselves building trees out of our tries. We use get-in to walk paths. And that way, we can easily manage even more data.

In Clojure you can look up any of 4 billion values with 7 memory fetches. Tries can theoretically get bigger, too, with bigger indexes (64-bit) and higher branching factors. Trees (and tries) let us manage huge amounts of data for little cost. They give us a tool to deal with the exponentiation of the state space. And they're baked into Clojure. Two of the three biggest ideas in computer science at the heart of Clojure (Universal Turing Machine and Tries). That's not bad. But what if it had the third (building reliable systems from unreliable parts)? Well, a person can dream :)

Rock on!

PS Want to get this in your email? Subscribe !
PPS Want to advertise to smart and talented Clojure devs ?