Escaping the maze of data structures

Eric Normand's Newsletter
Software design, functional programming, and software engineering practices
Over 5,000 subscribers

Summary: We start with a Haskell type that models a feature very well. As the feature changes, the data model eventually evolves into something like Clojure's hash maps. Discussion follows.

I've seen it at many places I've worked. As the software grows and the team grows, it becomes difficult to work with the data structures. It comes from several places.

Firstly, there is a set of data structures that are used in many places in the code. These data structures usually represent concepts that are core to the domain. For example, if you're writing a task management app, the data structure representing the task is likely used in many namespaces. Entity pattern.

Are the namespaces wrong?

The task data structure is likely a map (with named keys) or a vector (with positional values). It might be nested. As more and more features are built that use this task data structure, the data structure contains more and more information that is needed by those features. And more code comes to rely on the structure of that data, often through nested access like get-in and update-in. Each get-in hard codes a path into the nested data. The data's structure becomes known only diffusely (spread across many namespaces) and rigidly (any change breaks something). It's not just get-in. We also do destructuring.

Another thing also happens, though ideally it wouldn't. Sometimes, we add a key to the data structure, pass it through our pipeline, and don't see the key at the other end. What happened? Ideally, each function in our data transformation pipeline takes a data structure as its argument and returns a modified copy of that data structure. You achieve that easily in Clojure, like this:

(defn change-name [person new-name]
  (assoc person :name new-name))

The built-in data structure operations do this automatically. change-name returns whatever you pass it, but with a different name. That means that you can add keys to person and this piece of code doesn't have to know about them. It's a form of decoupling. All change-name knows is that person is a map and :name is where you put the name. It does know something, though. As you write code to access and update the data across multiple namespaces, the data structure will resist change.

This is the recommended way to write this function because it is relatively decoupled. But we could write it another way:

(defn change-name [{:keys [email username]} new-name]
  {:email email :username username :name new-name})

While this looks like it's doing the same thing, it's not. This function implicitly assumes that all keys are known. And keys are often entirely known at the beginning of a project. You and your few teammates can keep it in your heads. But as the code grows, it becomes harder and harder.

What's worse is that you could actually look at the person data you're being passed and see what the keys are at the REPL. So you do know, at the time of writing this function, what you expect in there. You get it working--and it looks like it's working fine. But

These symptoms are very common in Clojure codebases of a certain size. I'll summarize:

  • Structural rigidity: The data structure cannot change because knowledge about that data structure is spread around the code.

    Example: (get-in person [:contact-info :address :zip]) -- a deep path into the data hard-coded

  • Disappearing keys: New keys are lost somewhere along the pipeline, meaning you cannot add to the data structure.

  • Repeated logic: Logic for dealing with the data structure, such as common ways of handling missing values or defaults, are spread around the code.

  • Difficulty understanding: The structure of the data at a given point in the code, such as from a function argument, is unknown. People use tactics like printing the arguments to see what keys are available.

  • Brittle code: Code stops working when you change the data structure even slightly.

I want to say that this is part of the cost of maintenance of Clojure codebases. Clojure gives us a lot of flexibility and concision. That flexibility is great at first, when our codebase is small and our team is small enough to communicate. It allows us to move very quickly. But as it grows, we need to deal with the costs of this flexibility.

There are some common attempts at solutions to this. These could be the focus of multiple essays. But here's my short attempt at it:

Better contract enforcement

People use spec or malli to define the inputs and outputs of each function. This partially solves the problem (it's easier to know what each function can assume is passed to it), and it imposes other costs.

It's a partial solution because it addresses the difficulty of understanding the structure of the data at any point, but it does not address the other symptoms. It's treating the problem as a contract enforcement problem, but it can still lead to brittle code, disappearing keys, repeated logic, and structural rigidity.

And it imposes new costs: Schemas must be maintained. They're just more code to deal with.

Better tools for nested data

Another tool that is possible is to use something that makes working with nested data more concise and less error-prone. For example, you could use Specter. Specter makes working with nested data much easier. The reason this is a partial solution is that it makes manipulation of nested data very concise. Your repeated logic is reduced. You have less code. You're still specifying a key path to a particular value, but the logic is so concise that you can easily change it.

That's the idea at least. I don't have much experience with Specter. I just wanted to mention it for completeness.

Simplification

Another solution that I don't think we give enough credit to is simply to simplify your code. Simplifying your code is not easy. It usually involves finding more general solutions to your problems. These general solutions need to assume less about your data structures. So this solves a lot of the problem.

I sometimes feel like this is really the approach the Core Clojure team advocates. If you're dealing with these symptoms, the problem is that your data and codebase are too complicated. You should look for ways to find more general solutions. This is typically expressed as a need to "work at the right level of abstraction."

Your "types" (the structure of the data) should be smaller. This "smallness" of types is actually a complex topic that needs to be unpacked. For instance, the type of a linked list is very small:

type List a = Empty | Cons a (List a)

What this says is that a List of elements of type a is either empty or it's a Cons of an a and another list. The type definition is small, but it can hold infinite data of type a.

When we look at the type of a Clojure map, it basically has two methods:

(assoc m key value) ;; associate key/value in m
(find m key) ;; get the key/val pair for key

(It has others, but this is the core of it: Store key/value pairs, then get the value given the key.) Here are the others:

(seq m) ;; get key/val pairs
(dissoc m k) ;; remove key from m

The others (vals, keys, update, etc.) can be derived from these.

This is a very small type. But we tend to make data structures that are much bigger than this. Many of the keys are specified. The values that are in the map have a lot of constraints on them. There's nested data. If you were to write it all down, it would be very long. It would not be as short as List or the hashmap contract.

Instead of thinking "can I make something that assumes less about the data structure?", we should be asking "can the problem be solved with something with a smaller description?" The size of the description is the actual problem. It's born by the spec or malli schema we would need to write. Or it's born by the assertions spread around the code.

This solution is hard. I wish there were a good way to know if this is always possible. I suspect it isn't.

Better data abstractions

Encapsulation

Another approach is to centralize the knowledge of the data structure. Instead of calling update-in on a long literal path, we name the function that does that update. For example, instead of:

(update-in person [:contact-details :address :zip] str/trim) ;; trim whitespace off of the zip

We instead name this operation and add it to the person namespace:

(defn trim-zip [person])

Then we disallow direct access to the data structure outside of the person namespace. If we can pull it off, it solves a lot of the problems:

  • The structural rigidity disappears because we have a small, centralized set of code that can manipulate the data structure. If we need to modify the data, we can just change the code in that one namespace.

  • If keys are disappearing, you know exactly what code you need to audit.

  • You can refactor repeated logic into its own functions within the person namespace.

  • You no longer need to know what data you have. Just know that it's a person and you can access it with the public functions of person.

  • You won't have brittle code because all access goes through functions you control.

The major challenge of this approach is that when we have a lot of existing code, we don't even know all of the places that expect a person data structure. How do we replace calls to assoc and get with calls specific to the person domain operations? How do we do so incrementally?

I think I have a solution. Here it is.

An generic encapsulation mechanism

The main strategy is to encapsulate the data structure (representing an entity) with a wrapper. The wrapper acts like a data structure by delegating any operation down to the wrapped value. One namespace at a time, you turn off the delegation, breaking that namespace. Then you fix the problems in that namespace. You use a linter to make sure that only one namespace can access the data in the wrapper.

Let's define a new type that represents an encapsulated piece of data with a name for the kind of entity it represents.

(deftype Wrapper [wrapper-name inner-value])

;; constructor
(defn wrap [name value]
  (Wrapper. name value))

;; accessor
(defn unwrap [wrapper name]
  (if (= (.wrapper-name wrapper) name)
    (.inner-value wrapper)
    (throw (ex-info "Wrong name." {}))))

;; mutation
(defn mutate [wrapper name f & args]
  (wrap name (apply f (unwrap wrapper name) args)))

We would use it like this. Before we might have this data structure:

{:id "32123"
 :username "lamedude89"
 :name "John Montal"
 :address "321 Main St."}

We instead use a constructor to wrap it, defined in the person namespace:

(defn ->Person [id username name address]
  (wrap ::person {:id id :username username :name name :address address}))

Modules about features; share common data structures

Sean Allen
Sean Allen
Your friendly reminder that if you aren't reading Eric's newsletter, you are missing out…
👍 ❤️
Nicolas Hery
Nicolas Hery
Lots of great content in the latest newsletter! Really glad I subscribed. Thanks, Eric, for your work.
👍 ❤️
Mathieu Gagnon
Mathieu Gagnon
Eric's newsletter is so simply great. Love it!
👍 ❤️