What is commutativity and why is it so useful in distributed systems?

Commutativity is an algebraic property that means that order doesn't matter. Because network messages arrive out of order, it's the perfect property for distributed systems. In this episode, you'll learn what it is (with some real world examples), why it's useful, and 3 ways you can make an existing operation commutative.

Transcript

Eric Normand: What is commutativity and why is commutativity so useful in distributed systems? By the end of this episode, you will have three ways to use commutativity to make your distributed systems more reliable, faster and easier to scale. My name is Eric Normand, and I help people thrive with functional programming.

Commutativity is important for a number of reasons. You could say, "Eric, you're getting all math-theta on me again, I don't want to go back to algebra class. Why commutativity?" Well, listen, commutativity is just a name that mathematicians have made up.

I'm sorry if you had a bad experience in math class. Commutativity is super important for distributed systems. When you have distributed systems, one of the most costly, expensive things you can do between those nodes and the system is to communicate, so that you can coordinate.

You don't want to be waiting for each other. You don't want to be waiting for messages to travel across the network. The whole point is that you can work independently. If you didn't want to work independently, you'd be on the same machine.

We want to reduce the interdependence of the ordering of our operations. We want to make it so that the order the work gets done in doesn't matter. That's what commutativity is.

Let's go over the what. Commutativity is an algebraic property of certain operations. By extension, we can say it's also a property of certain actions, certain effects. Not just mathematical functions, but also effects on the world.

You can do them in different orders and get the same result.

What it means is that order doesn't matter. In some things, the order does matter. If you're making a sandwich, the order matters. You got to start with the bread, then put something on the bread, and then put something else on.

If you did it in a different order, you'll get a different sandwich. It might fall apart. Might not even be a sandwich anymore. If you put the meat on the outside, it wouldn't be a sandwich.

In some operations, the order doesn't matter. That's very nice. One example that we use all the time is adding stuff to a set. If you've got a set data structure in your language and you start adding things to it, it doesn't matter what order you put them in. If you have a set of numbers and you put one, two, three, or three, two, one, it doesn't matter.

The data structure is effectively the same. They might be stored differently in memory, but the semantics of the data structure are such that in a set, the order doesn't matter. When you're comparing to, it doesn't look at the order of when they were added.

That's a nice thing. It means that I can use a set and guarantee that it doesn't matter what order. It means that different threads can be adding to the same set at the same time. It doesn't matter how the threads are interleaved, if one is slower than the other, if one thread skips some numbers, and decides to do it in a different order from what you gave it. It doesn't matter. They're free to work in an uncoordinated fashion, which is very nice for distributed systems.

Why is this so nice? Once you're on a network, the order of messages between nodes is just not guaranteed. If I make two requests to a web server, I cannot guarantee which one will arrive first, and I cannot guarantee which one will send its response back first. There's so many things going on between my computer and the other computer, that there's no way to guarantee that ordering.

Especially, if there's a lot of processing that you have to do. If the other computer is going to do a little bit of computation, sometimes, computation takes longer than another time. There's no way to do it. When you have two computers that are working together to work on a bigger job, so that each have a sub-job to do, you don't know which sub-job is going to happen first, is going to finish first. You want to make sure that you can work out of order. It just helps things.

Order is actually something that we have to think about a lot in distributed systems. I'm going to give you three things that you can do to, hopefully, eliminate the stranglehold that order has on your system. If you want to get the right answer, you want to make it so that you get the same answer, regardless of the order that stuff happens in.

Number one. If you want to make something commutative, you can easily do that by using an existing data structure that has a commutative operation. We already talked about sets. Adding stuff to sets is commutative.

Here's an example — if I want to count the number of people at the conference. They're already there. They're already in the conference. There's different rooms that they're in. They're listening to different speakers. They're moving around between rooms.

It takes a while to count them. By the time you're counting them and you get to a person, he might have left and gone to another room. Sometimes, you've counted someone — I've counted the same person as you counted. We're in different rooms.

People are moving around. You might count the same person twice. If what I do instead of just counting the number of people I see in the room, I scan their badges. I just remember every badge ID that I've seen, and you do the same. You're remembering every badge ID.

Then we have a list of badge IDs that we've seen. Then we dump them all into a set. It doesn't matter what order the people were counted or anything like that. All of that goes away. We know how many people were there because we can just count the number of elements in that set.

You could also have a central set that everyone is writing to. We can write to it in an uncoordinated manner. You write to the set. I write to the set. We don't talk to each other. You can see a real-time count as people come in. If people are counted twice, it doesn't matter, and the order doesn't matter even.

That one actually, now that I think about this example, it also confuses idempotence because you're not counting the same person twice. Just know that the order doesn't matter. The final set is going to be the same, no matter what order people are counted in. That's what's important.

There are other commutative operations on data structures that you've got. If you are working with numbers, obviously, addition and multiplication are going to be commutative. It doesn't matter what order you multiply your numbers in. You can pick the order that makes the most sense.

If you're counting page hits on your website, you're just adding one. That's a commutative operation. You can do that in a distributed manner. My web server and your web server were load balanced behind the same load balancer. I can count web hits. You can count web hits to the same central server in whatever order. It doesn't matter. That's nice.

Another thing, and I hinted at this in the example. Number two is another way to add commutativity is to think about identity. Do I identify some kind of identity for your operations?

In this case, the identity was the badge ID. The badge ID was saying...That's also for idempotence. It's allowing you to count people and not worry about the order because the ID is separating people. It's not like I say plus, you say plus one. That's really confusing it with idempotence. These properties are often used in combination.

Another kind of identity is the index. Let's say you do have some order imposed. For instance, I'm sending you packets on TCP packets of a file, they have to go in the right order. I can't guarantee that they will arrive in the right order.

There's going to be some packets dropped. I'm going to have to retry and so the order is just totally messed up. How do you get them back in order? How do you make it so that the order doesn't matter when it really does?

Well, each packet is numbered. Each packet has its order as part of the payload, so you on the other side can reassemble them in the correct order. It doesn't matter what order they arrived in, you can always reorder them to the correct order later. That's another useful thing.

This is called an index because it's like you have an array and you say, "Well, there's zero chunk of work in the first, and the one chunk of work, the two chunk of work, they go into this array. I'm going to send them out, they're going to come back in different orders, but I know where the answer is, go. Then I can tell when I'm done, and then I'll have them in the right order.

Another thing is you could capture the time. If you have, say events where this user's clicking buttons and that user's clicking buttons, and they're going to arrive at different times on the server. If you capture when they happened on the individual machines, you can reassemble a timeline of the actual order they occurred in.

That's hard when you've got time on a distributed-system. I'm not going to go into that. You can fudge time if you need to, and that can give you a consistent way of reordering them when you need to.

The third thing is to think about partitions, are also known as sharding. If you have a bank, the bank's operations depend on order a little bit, but you know that my account and your account are completely independent. We don't have to focus on, if you have a check get withdrawn, I have a check get withdrawn.

It doesn't matter if you withdraw mine first or you withdraw yours first from your account or from mine. It doesn't matter because they're totally independent. We only have a smaller problem to deal with, which is ordering the checks from my account and ordering the checks from your account.

They're totally independent unless we do a transfer between us but please don't take my money. Usually most bank accounts are not transferring money between the others all the time, so you can partition them.

We do this a lot with, say user accounts on online services. The users largely are not. The operation for a single user, do not affect the other users. My account is totally partitioned from your account, and so with that, "Lets us do," is not worry about the order between the different users, we just have to worry about the order for one user.

That is much simplified because if it's one user, it's usually one person. They have one tab open at a time, and so the problem is much more tractable.

On the other hand, if you actually look at the traffic, the web server is getting thousands of web requests a minute in random order. You can have peace of mind that as long as you know what user account that that web request is for, you can figure out that there's not that many per user coming in at the same time.

Let me give you some examples of existing commutative operations that you'd be familiar with. I talked about adding to a set. I talked about addition and multiplication of numbers. You also have AND and OR on bullions. These are commutative, which is a really nice thing to have.

There's a commutativity that I call conditional commutativity, which is where you have, let's say you have two hash maps, typically merge depends on the order. The order does matter. What is going to merge over what, because you might have the same key with different values. If you have the same key with different values, the one on the right is going to merge over the one on the left.

However, if you know that there's no keys in common, because typically when we're using hash maps, we do know the keys. When we're treating it like an entity, we do know the keys. If we know that there's no keys in common, then it is a commutative operation. You're going to get the same answer with, A merged with B and B merged with A.

There's that notion of conditional commutativity, and so I'd also like you to think about whether your operation can have some condition on it that you can satisfy. There's also this idea of commutativity on effects. Can be like, let's say I'm sending emails to customers. I'm sending, I don't know what it is, like a yearly newsletter to every customer.

I'm just sending one newsletter to each email address I have on a list, does it matter what order they get sent out in? The effect is that they're all going to wind up in the user's inbox, or their spam box, or whatever. It doesn't matter because they're not going to call each other up like, "Did you get that at 6:01 or 6:02? Was mine first?" They don't care.

Effectively, they do get sent out in some order, but it doesn't matter, the effect is the same. That's different from if you send the same person two different emails. Then the order does matter. The mail server is going to, or the mail client is going to show them in a particular order.

Maybe it wouldn't make sense. The first email they get is like the second one, so it's going to say, "Hey, I just sent you this thing. Did you get it?"

Then a minute later, they get one that's like, "Hey, this is the first email." It doesn't make sense. The order does matter for an individual person but not across people. It's nice to think about stuff like this this way. Let me recap real quick.

Distributed systems like commutativity because it lets things work in an uncoordinated fashion, just sending answers as soon as they have them. They can be combined out of order and you're going to get the right answer. That's awesome.

It's really about independence. It's this essence of independence. It's decoupling the answer from the order that you got the sub-answers in. It means order doesn't matter. It's an algebraic property. Talked about that.

The how. One, use existing commutative operations on existing data structures. Two, think about order, bundling the order, an identity, or a time with the question and with the answers, so that they can be reordered on the other side. That's helpful. This is how TCP works.

Also, think about partitioning, so that you can independently evaluate the order, like in the case of user accounts where users aren't affecting each other, so the order that the operations happen between users doesn't matter. Makes it much easier to reason about.

I gave some examples. I think that it would be good for you — for yourself — if you found some existing operations. You can use the ones I listed, find some others and play with them. Test them out, like at a rappel or in a little test program.

Think about how these work, and how they achieve commutativity. If you want to go further, you can look at stuff that you need to be distributed and see how you can make the order not matter.

If you can make the order not matter, you often can eliminate a lot of coordination code. Make your system more scalable, more performant, etc. If you found this episode useful, I would very much appreciate if you shared it with other people. They could find it useful too.

You could get a little bit of...a couple of good karma points from them, for that. Like, Comment, do what you do in your app. Also, subscribe, because if you found this one useful, the next episode is going to be similarly useful. You'll be notified and you'll have it on your device when it comes down the pipe.

If you'd like to ask any questions or get in touch with me, suggest another topic, you can email me eric@lispcast.com. I'm also available on Twitter @ericnormand with a D. You can find me on LinkedIn if that's what you use, and we can connect there. Awesome. See you later.