What are product and sum types?

This is an episode of Thoughts on Functional Programming, a podcast by Eric Normand.

Subscribe: RSSApple PodcastsGoogle PlayOvercast

Product and sum types are collectively known as 'algebraic data types'. These are two ways of putting types together to make bigger types. Product types multiply their states, while sum types add them. With these two 'operations', we can precisely target a desired number of states. In this episode, we explore how we can eliminate corner cases with algebraic data types.

Transcript

Eric Normand: What are product and some types? By the end of this episode, you'll know how to use these two concepts to create a very well-fitting model for your domain. My name is Eric Normand, and I help people thrive with functional programming.

Data modeling is very important in functional programming. Product and some types are a way of thinking about how our data is structured. These things, product and some types, together, they make up what's called algebraic data types. That's just a term that people use all the time.

I'm going to give a little scenario. Imagine we have some process-workflow engine where we give people a to-do list, a checklist of things to do. When they do it, then someone else has to come by and approve it, that they did it correctly. Then, just for safety, extra assurance, we have a second person check, too.

When they check it, they either say it's approved, or disapproved or rejected, something like that. There's a positive and a negative. Also we have to remember that before they approve it or reject it, we have to record that they haven't yet.

We actually have three cases. We have they haven't checked it yet, they have approved it, or they have rejected it. Then we have two of those because we want two checks.

If you look at all the possible states, we have three for the first one and three for the second one. Then there's all the combinations of the two. The first one approves, the second one rejects. The first one approves, the second one approves. There's all these combinations. How do you figure that out? It's three times three, so there's nine.

Then when you think about it, these people, they're doing the checks one at a time. There's the first person's check and then the second person's check. You might have all these rules in your system like if the first person rejects it, we don't need the second check. We're just going to tell them to do it again.

You need two approvals. If you have a rejection from either one, then you don't need the second rejection, if you have the first rejection. What does it mean to have a rejection and then an approval because it's not even possible to get in your system?

What does it also mean for the first one to be not checked yet, but the second one is already checked because if they're going in order, that is not a valid state either.

What's happened is we've created this system where we've got nine possible states. It turns out that several of them are not valid. They don't make sense in our system. What does it mean for the second approver to check before the first approver has checked? What does that even mean?

Because if they come first, they should go in the first slot, not the second slot. Then, what does it mean for someone to approve after it's been rejected? That also doesn't make sense.

If you do the math out, there's quite a few states in here that don't make sense. Now, the math we've done — let's just go back to that — is we've had three possible states and we multiplied it by three possible states. We get nine possible total states.

We multiply. That's where the product comes from in product types. When you have two separate fields and each one has certain number of options, to figure out the total number of options for the two, you multiply them together. If you had a third one, you multiply that one together. You had a fourth field, you multiply those together. Now, that's the product, because product is multiplication.

Where does the sum come from? Remember, sum is plus. The plus is that you have one unchecked, plus one approved, plus one rejected. As an example, you could have just a state, a status of approved and rejected, and then that's a type.

That's like an enum. That's two states, and then to add a third one, you can make that a maybe. You're adding the none case, the nothing case, by wrapping it up in a maybe or an option. That's the sum type because you're able to add one, you're not multiplying them.

The classic — what do I call it? — essential product type is the tuple. The tuple is just a bunch of values. We have a pair. That's a two tuple. You can have a triple, that's a three tuple. The tuple is a product type, meaning you multiply all the possible states together to figure out all the states that your tuple is in.

The typical sum type is something like an either. Either they're approved or they're rejected. It's an enum.

The exercise that we have to do as data modelers is take the system that we've created, which we've created poorly because there are states that are invalid in there, that don't make any sense.

The thing is if you're using a type system, the type system is going to say this is a valid state because that's what you described. A tuple of two of these maybe statuses can be in any one of nine states, even though maybe three or four of them don't make sense.

You're not getting help from the type system to avoid those. How do you avoid them? You're going to have to have conditionals in there to make sure that they don't happen. What kind of conditionals?

You're going to say, do we have some approval rating for the first one? Because if so, then we'll use the second one for this check that we're doing now. If not, we'll use the first one. It's a conditional.

The conditional is there because you've got invalid states. You're adding complexity by having these invalid states. What we want to do is use the sum and product types to rework this so that it doesn't have any invalid states.

Just to get a little bit deeper into the problem that we had, like why did we get here, where we have these invalid states, we broke the problem down. We said, each approver has a certain status or they haven't done it yet. That's three.

There's three possible choices for each one, and that's true of both. It's true of both, but when you combine them, the combination isn't a pure product because there are states in that product that you don't want to be possible.

The problem was combining them with a tuple. Unfortunately, that happens a lot. I do it all the time where I say, "Oh, I have two Booleans and I'm trying to represent something with three states, but two Booleans is four states." Then I have all this code to make sure I don't get into that weird fourth state that shouldn't make sense.

Let's start with that two Booleans case. If we have two Booleans and we only need to represent three states, how would we do that instead? What else could we do?

Three is a prime number, so you think I can't do a product really because a prime number is only a product of one and itself. The one doesn't make sense as a type. What we do is then have to use a sum. We just make an enum or a sum type, a union, of the three possible choices. Instead of using two Booleans, we use three possible choices.

Likewise, we could say that there's the case where no one has checked yet, the case where the first one has checked, and then the case where the second one has checked. That's three cases.

In the case where no one has checked, there's no other data we need to store. It's just no one has checked. We could call that enum [indecipherable 10:26] . The case where the first person has checked, we have to capture what status they gave it. Did they give it approval or rejection?

In the case where two people have checked, we have to capture both of them. We capture the first one and the second one. Both are approval or rejection.

If you add that up, you have zero case. That's one. The one person has checked. There's two possibilities there. That's 1+2. That's three. Then in the two case, we have approve or reject twice. That's 2*2. That gives us four. That's 1+2+4 = 7. That's seven cases.

You could try to model more closely that once the rejection has happened, first, you can't have a rejection and then an approval. We could say in the two-person case, we don't need to capture the first one anymore. That first one has to be an approval. We don't need to capture it.

Instead of the two-person case, we could call it, "One approves," or we could have, "Zero have checked, one have checked and approved, two have checked and approved and rejected." We don't record who rejected it, whether it's the first one or the second one. Maybe we could. We could say, "Rejected by the first, rejected by the second."

We're splaying out all the possibilities. In this case, if we have zero have checked, that's one case. One has approved, also one case. Two have approved, one case. First one rejected, one case. Second one rejected. That would imply the first one approved. That's also one case. How many is that?

Zero, first approved, second approved, first rejected, second rejected, that's five. That is probably closer to what...That probably exactly fits what we want. Since five is prime, it's hard to do a product in there, but you could have a product to where it's like, 2*2+1. It's 4+1. Somehow, you have a 2*2 in there to get the four.

There's all sorts of possibilities there that we can explore. The idea is you are using the structure of the product and sum to closely match. As closely as possible, you want to target exact the actual number of cases that you want to capture.

When you talk about all the zero, and then one, and then two, you might start to think we need something like a list. I would explore that for sure, not only because it makes sense — zero, one, two — but it also is like a more general data structure. What if we want to change our process so we have three approvals necessary? This is a way of making it the more general case.

Of course, there's always a trade-off between getting very close with a huge type that you're developing that has all these constraints on it versus have a more general type that does have some invalid cases, but you work within the code with a few if-statements to avoid those. That might actually be cleaner. In a lot of cases, it is. Sometimes, the generality of it is actually better.

Let's look at the list as a type. Lists are a sum type. They're an interesting sum type. List is either an empty list or an element and the rest of a list. The prototypical English words we use when we're talking about a sum type is either/or. You have either an empty list, or the first element of the list and the rest.

"Or" is usually used for sum types. "And" is usually used for product types, so the element and the rest of the list. Notice the type is recursive. A list could have another list inside of it. That's what makes it really interesting.

If you were just to do the math, let's say it's not recursive, it's two cases, a list or the head of the list, and something else. Because it's recursive, that list could be infinite in theory. There's no bounds to it. The size of this list now becomes infinite, the number of possible states this list could be in. It's unbounded in its growth.

That's another constraint on your system. Do you want to allow unconstrained growth? In a type system like this that we're talking about, it's harder to make a list of length three or two, like we want here. It's harder to do that than it is to have an infinite list.

If you want it to be length two, you're going to have to do checks to make sure it doesn't grow pass that, but sometimes you want that. You want the underlying generality of we don't know how many approvals we're going to want in the end.

It could be possible that you use a list. You have the empty list. That means zero people have checked. Then you add a check and a status. That's the first element on the list, the only element on the list. Then you add a second one. Then you add a third one. You add a fourth one, all these statuses.

Then you can do, it's another way of doing it, you'd say if any of them are rejections, we reject it. Or maybe you could count them up and figure out a vote between do we approve or reject.

You can do all of these things. As opposed to modeling the transitions very precisely with the type, you're using more of a calculation to figure out whether to approve or reject that task.

These are all ideas in data modeling. I'm exploring this idea because the third part of my book is all about data modeling. I don't have a very organized view of what this chapter or this part is going to be about, what are the skills that people need to learn.

I think sum and product types are a pretty good way of being able to talk about what states are possible. A lot of languages don't have them in a strict way. They allow you to make them in a lot of times, but it's a hack. In Clojure, we don't really have sum and product types in the sense that Haskell has them, but we do make tuples using vectors.

Hash maps are product types because you have different values, then each of those values takes on different values, and so you multiply them together to get the total number of states possible. But we don't have enums. What we use is keywords usually, if we need enums.

It's not very straightforward in Clojure, how to use them. They're more like idioms and patterns that we use.

Likewise, in something like JavaScript, you can basically do the same thing. You can use a raise to represent tuples, or you can use hash maps to represent tuples. How do you represent sum types? There's no enum, but you could use strings, you could use something like that.

It's not the easiest thing to see, and that's why I want to teach people how to see this stuff.

I learned this in Haskell. Haskell does it really straightforward, really clearly. That's how I learned it, and I'm trying to teach it to people who don't have the opportunity to work in Haskell.

Anyway, these are really important concepts. You have product types, which means you multiply the number of states to get the total number of states. The word in English that we use for product types is "and."

You have the first status "and" the second status. Then, there's some types which add the states together. You might just add one. You might add 10 other states. The word we use in English is either/or, or just either or just or. It's either/or because it's one or the other. It's not the or that has both inclusive or it's either/or, exclusive or. Those do a sum.

Lists are a sum type, but it's recursive. it's not just adding because you're adding infinity to it. It really explodes the number of possible states, which might be fine because you still have the two basic cases, where I either have a list or I don't. It's either empty or I have an element that is the first thing. That's how we use it. When we process lists, we often do it recursively.

If you enjoyed this episode, you can find all the past episodes at lispcast.com/podcast. There you'll find all of the old podcasts, the existing podcasts with audio, video, and text.

You'll also find links to subscribe, whether you subscribe with RSS, iTunes, YouTube, or just text RSS. You'll also find links for finding me on social media, on Twitter, LinkedIn, email, etc.

Looking forward to hearing from you. Thank you for listening. My name is Eric Normand. This has been my thought on functional programming, rock on.