What a monoid is and why monoids kick monads' butt

Everyone talks about monads but monoids are where it's at. Monoids are simple and make distributed computation a breeze. In this episode, we learn the two properties that make an operation a monoid, and how to use it do distributed computation.


Eric Normand: What is a monoid? By the end of this episode you will understand the two properties of monoids, and you'll be able to apply them in your systems. That's a big promise. My name is Eric Normand and I help people thrive with functional programming.

Why are monoids important? Why even bring them up? That's a really good question.

I know a lot of people talk about monads. I think it's mostly because they want to know what they are and they don't understand them. They think there's some magic in Haskell about how IO happens with monads. That's another topic.

This episode is about monoids. I actually think monoids are more interesting, more applicable to helping us write better code. Especially in a parallel or distributed code in the system that's got a computation that has to spread out over different course or different machines.

A monoid lets you break up a task into smaller tasks pretty much arbitrarily. You don't have to spend a lot of computation figuring out how to break it up. You break it up into small tasks, spread that out to different workers. These are on different threads or in different machines. Here's the key. The monoid lets you put them back together. Put the answer back together.

What are the two properties of monoid? The two properties of a monoid...First of all, it's an operation. A monoid is an operation. It's not a type or a kind of type. It's not a property of a certain class of values. It's a property of an operation over those values. For example, addition of numbers is a monoid. Multiplication of numbers is also a monoid.

What are the two properties that operation is associative? I have a whole episode on what is associative. I've explained it before. I'll explain it briefly in a second. The other property is that operation needs to have an identity value. I also have an episode on that. You should look those up if this is confusing. I'll go over a brief recap.

An associative operation is one that has two arguments, otherwise known as a binary operation. Addition has two arguments. It has A plus B. A and B are the two arguments. That's pretty clear. It's got to have two arguments.

Here's the other thing. It's got to take values of the same type and return a value of that same type. All three things, the two arguments and the return value have the same type.

Look at numbers, takes two numbers, returns a number. That's addition, right? Multiplication is the same way. Takes two numbers and returns a number. It's not returning a string or something else. It's returning a number.

That's part of what makes it associative. The other thing is that the grouping. Because you've got this...I look at it like a triangle. Like an upside down triangle. Got an A and a B and it gives me a C and they're all the same type.

Because I've got the A and the B at the top and the C at the point at the bottom, you can think of this associative operation as a combining operation. Addition is a way of combining two values.

It's not always combining. It's not always clear that that's a good way to think about it. In multiplication, are you really combining two numbers? With addition, you definitely are. You're combining two piles of rocks into one big pile of rocks. That's addition. It's where the abstract operation of addition comes from. If you just take a piles of things and you put them together and now you have a big pile.

If you imagine, I have three things to add. I have A, B and C and I want to add them up. I can add up A and B first and get a number and then I add C to that number, so I use the same operation again. I did a plus, A plus B, get a number, let's call it D and then I take that D and I do D plus C and now I get E and that is my answer.

You notice it's...I'm trying to make it graphical here, I add A and B and those are in a triangle and lead down to the point D. Then that D and C form a new triangle and it leads down to the point E.

That's one way I can do it, or I could do this other way. I have A plus B plus C. I could take C as the top of the triangle, go down to D and go down to E. Graphically, it's symmetrical. It's the same. I'll give the same answer but I'll group them differently.

The triangles group on one side to the left and on the other side they group to the right. That is what associative means. Really, when you put this into the context of doing work over distributed systems, what it means is I can make a hierarchy of workers where the workers at the bottom of the hierarchy, the bottom of the tree, are doing, adding up the tiny bits.

There's like their supervisor, right above them, adding up all the work that they're doing, into a bigger number. Then there's the supervisors of the supervisors, and they're adding up all the supervisors work. There's a supervisor of the supervisors of the supervisors and they're adding up all their work and there's like the super-super-supervisor, up at the top adding up everything.

You can make a hierarchy and it doesn't matter how they're grouped. That's what associative means. That the triangles, as they're added up, I guess this way the triangles are going up but as they're added up, it doesn't matter how they're grouped.

I can do the work at the bottom and it will get grouped up into the top arbitrarily. It's what you don't want to do when you're distributing the work is spend a lot of time thinking spending CPU power, figuring out what's the best way to break this problem up. You just want something, like I'm just going to chop this pile of rocks and have put my hands in it and push it roughly to.

You get one half and you get one half and start counting. What they're going to do is say, "It's too big for me to count, I'm going to split it roughly in two and I'm going to tell my sub-supervisors to work on it." Then they're going to split it up into two and give it to their two workers. The answers will bubble up through this associative operation.

Probably digressed too much about associativity here. I do have a whole episode on it. The other thing is it needs an identity. This is in order to be considered a monoid. This is just the definition of a monoid.

Identity really just means an empty value. Let me put that in quotes, "empty value." In numbers for addition, the empty value is zero. It's like, where do you start. Before you count, where do you stand, where do you start, you start at zero and then you start one, two. You don't have to say it when you count because it's assumed. I have zero before I start.

With multiplication, the identity is one which is where you start, because if you started at zero you multiply it out and you're always going to get a zero. One is like the null value, the value that doesn't change anything, that's what the identity means.

Basically, you need a way for someone to get started. You need a value that they can start at. Your workers are saying, "Well, I haven't counted anything yet. I'm at zero." Don't want to digress too much into identity again. There's an episode on that if you want to learn more about it.

Let me give some examples of monoids in case this wasn't clear that it's not just about addition. String concatenation is a monoid. If I have string A and string B, I concatenate them I get a string C. Notice the types are all the same. String, string, and string. Then it's got an identity. What is the identity of strings of concatenation?

It's the empty string. It's where you start. If I take an empty string and I concatenate it on to any string I get that string back. Front or back. Likewise, with list concatenation. Lists and strings are pretty similar structurally. Map.merge() is associative. If I take two maps and I make a new map that has all the keys from A and all the keys from B, that is associative.

The identity — you can probably think of it — is the empty map, so no keys. That's perfectly fine. I have already talked about addition and multiplication. AND and OR are associative. They take two booleans and return a boolean. The identity is different for OR, the identity is true, and for AND, the identity is false.

Is that right? No. I said that wrong. For OR the identity is false, and for AND, the identity is true. You can work that one out for yourselves on some paper. It's a useful exercise. Some other things that you might not think of as monoids...if you squint, they're combining operations. Min and max.

Min takes two numbers and will return the smaller of them. Again, remember, we've got the two arguments, same type. Return value is the same type. It doesn't matter how you group it. If I have three numbers A, B, and C, and I want to figure out what is the smallest of all three of these, I can take the first two and min them.

The results of that and the third one, I can min it. It doesn't matter how I start, how I group it. Same for max. Min is associative. What is its identity? The identity of min has to have the property that if I do min with that identity and any other number — let's call it A. The identity mined with A will always give me A, so what number is that?

When you think about it, it must be positive infinity. It must be the biggest possible number of some logical representation of that, of infinity. A number that is bigger than any other number. Likewise, the identity of max is the smallest possible number which is negative infinity, some logical representation of that.

That's great because that means in a hierarchy — in a tree — you can figure out the biggest number or the smallest number. I've seen this in a video somewhere. I don't know where but someone was trying to show that you can do distributed work in people. If each person does just a very small task, the whole room could come up with the correct result of the computation.

What they did was they asked everyone to write down their birthday and then when everyone had finished that...they were in rows. It was like a theater style classroom and they said, "If you're on the left," so there's no one to your left, you're the furthest on the left, "pass your birthday to the right and then if you have two birthdays in your hand, pass the biggest one to the right."

The highest birthday in the row starts to bubble up to the right because everyone's doing this one little comparison operation, they passed it to the right. The person on the right, the person with no one to pass it to, they have two, what they're supposed to do is pass the highest one forward.

Now you have the rows are bubbling up, the highest birthday to the front. The person in the front right position in the classroom when everyone had passed...everyone except for the people on the left, they now have one card, except for the person in the front who has two. You're right, he or she has two cards.

Some card plus the highest one which is the highest birthday. Then the professor took the card and said, "This is the birthday of the oldest person in the room." Right of the bat there and then, that person stood up and it was some old person. [laughs] It's like, "Oh, it worked. There is the oldest person in the room."

Then they asked, "Is anyone older than that?" It was, "No." They found the answer. When you think about this, they were doing the max operation on the dates, or maybe the min operation because it's the oldest birthday, so it's the minimum birthday.

They were doing min and they were able to combine them. First on a one-on-one basis but then you notice the min...each individual person was doing one comparison but then when the min for the row was discovered, that one was mined with the whole class. All of the rows were mined together.

This is an example of arbitrary grouping, so whatever rows happened to exist, it was arbitrarily grouped and then arbitrary grouping of the rows. They were in some random arbitrary order as well. When the answer bubbled up, it was a single return value that had been the mean of multiple means.

That is the power of associativity, right there. You can start with two things, two dates, compare them. You get an answer. Then someone else is doing the same thing. They're comparing two dates. They get an answer.

Now, you're comparing those two. You get an answer. It's like a hierarchy and the answer is bubbling up. I think that that's a pretty good story for explaining how these things can do distributed work, and why they're so good at it. You could imagine having this classroom do lots of work like that, lots of different computation tasks.

They could add up numbers, where each person has two numbers, and then they pass the answer to the end of the row, and they add up all these numbers, etc. Then the rows add up all their numbers, and the work is done faster, because it's done in parallel. All the rows are working at the same time, instead of one person going through one number at a time, adding, and adding, and adding.

I just wanted to give one more example of a kind of task you'd do. Let's say that you're at a school, it's just one building, it's got all these classrooms, and you need to figure out a lunch order for everyone in the whole school, because you're going to have lunch delivered today.

You know that if you went individually, as one person, to each person in the school, you could figure out the lunch order, but it would take longer than you have. It's already 9:00 AM, you have to get everybody's order, then it has to have time to cook it, and deliver it for lunch, so there's just no time to walk to every person and ask them.

What can you do? To break this problem up, you can say, "Well, if I have two lunch orders, I know how to combine them into a bigger lunch order." If everyone writes down what they want, let's say their student ID and then what they want for lunch. Everyone has that first initial piece of data.

You can have, let's say, someone at the end of the row in their classrooms combine all of those into a single, bigger order. That's a combining operation. Then someone at the front of the class would combine all of the rows together into a single one.

They would run to the floor manager. That floor manager would aggregate all of the classrooms on that floor together into a bigger lunch order. Then all of the people, all of the floors move even down or up. It doesn't matter, but let's say they all send their order down.

There's a school manager that adds up all of those orders into one big order. Then they send it off to the catering service.

What you're doing is a HashMap merge. Everybody's got a key and value. It's their student ID and then their lunch order. You're doing a merge between, first, everyone in the row. Maybe, as it move to the right, it gets aggregated up. Then, as it moves forward in the classroom, the rows get aggregated up into a classroom HashMap.

The classroom HashMaps get merged together into a floor level HashMap. Then the floors get merged, down onto the bottom floor, into one big school-wide HashMap.

That can happen very quickly, because each of the floors is up, and each of the classrooms, and each of the people, and in classrooms each row, each person is working in parallel. It's basically the amount of time it takes each...

The school-wide lunch order is the answer. It just takes as long as, let's say, it's four floors. The merging of four hash maps, plus, however long one floor takes to generate the answer because the other three floors are happening in parallel. It's a way of gaining some time by doing things in parallel.

It's the same thing. You're taking this problem. You're breaking it up. You're making sure that the types are the same on the two arguments and the return value. The type is hash map of ID to order. Now you can merge them. You've turned it into a monoid.

You could say if you were going to do this in an imperative, just loop through all the students' way. You might never make the hash map for each individual student because you might just say, "What's your student ID and what's your order? What's your student ID? What's your order? What's your student ID? What's your order?"

You never made that smaller type or the smaller hash map to make it of the same type, but by giving everyone the instruction to make the small hash map, you've now turned that into a monoid, which allows for this distributed collection of an aggregation of the data into a single answer.

You're breaking up the problem, the big problem into a lot of smaller problems. You're having the other computers do the small problems, and then you're combining the answers. I'll recap super quick. A monoid is an operation, like hash map merge, that is associative and has an identity.

Some examples, string concatenation, min, max, map merge, addition and multiplication. There's a bunch of others. I went over a story about how a room of people could calculate the oldest person in the room in a distributed way.

I also talked about how you could calculate the lunch order of a really big school in less time than it would take to go to each person individually. That was it. I have a little takeaway that you could do. The way I like to think of it is this classroom.

It's like this classroom that did this distributed calculation is actually a pretty good way of thinking of how you can distribute some work. If you could come up with this operation that each individual person can do, and then the rows can do, and then the whole room can do, if it's the person on the right-hand side of the row can combine, then you've got a monoid.

You can distribute the work to the whole classroom.

Think of how you would get classrooms to do the work. Some examples that you might think of is adding, adding numbers together. Let's say, you wanted to calculate the...Everyone has like some money in their pocket or some are in their purse or something. You could figure out how much money is in the room right now.

That's pretty cool. You can add it all up. You figure out the youngest person, the oldest person, you can figure out this lunch order or some other operation like that that is able to combine or figure out maybe.

Think of it like a tennis tournament. You're trying to figure out who the best person is. You take two people and you figure out the winner between those two. That winner becomes the return value.

As a hierarchy, you can figure out who is the best tennis player in this tournament. You don't have to play everybody against everybody. You're doing like a monadic operation to get up to the top. I do want to tell you, if you want to see all the past episodes and if in the future you want to see future episodes, they don't exist yet. That's why they're future. You can go to lispcast.com/podcast. There you'll find links to subscribe.

You'll also find a page for each episode that includes the audio version, a video version, and a text transcript of what I'm saying here. You'll also find my email address and links to social media if you want to get in touch with me.

I'd love to hear about the kinds of problems that you've found that you can solve in this classroom. Let me know about them. We'll talk about it. We'll get into a nice, deep discussion.

I can't say I'll see you next time, so until next time. I'm Eric Normand. See you later.