On Composition - Zach Tellman
Your friendly reminder that if you aren't reading Eric's newsletter, you are missing out…
Lots of great content in the latest newsletter! Really glad I subscribed. Thanks, Eric, for your work.
Eric's newsletter is so simply great. Love it!
Zach Tellman is a prolific conference speaker. His talks are always enlightening and deep. He's currently writing a book called Elements of Clojure. The book is a deep essay into what makes Clojure code readable and maintainable.
The book is an ongoing project, so it's too early to pin him down on exactly what the talk will address. But if his prior talks are any indication, it will be a philosophical romp through the material we use every day as programmers.
Slides
Video
Please forgive the poor audio quality. We had technical trouble with the audio recording and used the backup audio. Please use the captions (CC) in the video player.
Transcript
Eric Normand: Our next speaker is writing a book. And it's a good book. I've read what he has done so far. He's reading a lot of books, probably more good books than most of us have read in our lifetimes, in order to research and come up with material for his book. In order to get some of that radiating ideation coming off of him and soak it up, because it's not all going to make it into the book, I invited him to speak.
Our next speaker is Zach Tellman. Please take it away.
[applause]
Zach Tellman: I've been writing a book. The way that I have been explaining this book lately is that it tries to put words to what most experienced engineers already know. This is useful, because we have the capacity to understand the things they'll struggle to explain. This is what Michael Polanyi calls "passive knowledge," something that we only understand as part of something else.
I am standing in front of you here speaking. I know how to speak, but as I'm speaking, I'm not focusing on the physical act of speech. I'm focusing on the words. If you stopped me and asked me to pull back and explain to you what are my vocal cords doing right now, I'd be at a loss.
I haven't had a reason to think about that deliberately. This is OK. Passive knowledge account for most of how we understand the world. It takes a lot of effort to explicitly understand a thing. Polanyi tells us that passive knowledge only works, only suffices as long as the intuition it gives us doesn't steer us wrong.
Once it does, we have to look at the thing directly. We have to understand it explicitly. There's some things that we understand passively in software. We acknowledge that names are one of two hard problems, but we spend almost no time actually studying them directly. We seem to just assume that if you write enough code, it'll stop being a bad thing.
Likewise, we clearly work with these things called abstractions, but I have seen countless conversations between engineers where it became clear halfway through that they had very different understandings of what this word meant. This might even be OK if it weren't for the fact that we write software that does fail.
Those failures are due to names and abstractions, whatever that may be. We can't continue to look past these things. We have to look at them head-on, even if it is a little labor-intensive, even if it is a bit difficult to wrap our head around them.
That's what my book is about. I've been trying to provide a vocabulary. I'm trying to provide a framework for judging these things because it's time. We need it, and so elsewhere I've talked and written about names and abstractions, and I don't have time here to speak about all that, but the big take away for abstractions is that abstraction is treating things which are different as equivalent.
We have here the same tree pictured over four different seasons, and I say it's the same tree even though there are some pretty big differences, right? It's covered in snow or not. It has leaves or not. The leaves are green or not, and despite that the identity carries through. We agree that this is the same tree, and so what we are constantly saying there, what we are presuming is that those things are incidental to its identity. They don't matter.
A leaf falling doesn't force us to re-evaluate everything we knew about that tree. It's crucial because the world around us is constantly changing and we do not have the time or energy to constantly re-evaluate everything we know every time it happens, and so abstraction is key to how we understand the world around us as well as how we write software.
If we acknowledge that abstraction is fundamentally the act of ignoring things, ignoring most things even, we have to acknowledge that we can't just say whether an abstraction is good in some sort of absolute sense. We judge abstractions within the context in which they are going to be used because anything we ignore might steer us wrong in a particular situation.
I care a great deal about whether or not a tree has leaves if I run a lawn care service. I care a great deal about whether or not they're going to fall. Those are central to the tree's identity for me, and so when you talk about composition which is taking abstractions and placing them next to each other, we can see this as beginning to define this context, right?
These different pieces of software placed next to each other define each other's environment, and so we begin to actually be able to judge these things in a concrete way rather than just talk in the abstract about the act of judgment.
I want to talk a little bit about composition. We know composition generally from our grade school math books as taking two functions and making one function, and this clearly plays a part, but it is not the end of the story because if we go and we put together a bunch of pure functions and yield one large monolithic pure function, that's still passive. It doesn't interact with its environment.
It needs something else to actually invoke it, and so we can't just talk about functions and leave it at that. What we're trying to create is some kind of active relationship with the world around it. And so what we're trying to construct is a process. Something which pulls in data, which transforms the data, and then pushes the result elsewhere so someone else knows about it.
And I'm defining a process to be something which is not necessarily what you would call an OS process these days. It's something a little bit more specific. It's something that operates sequentially. All of the actions and that it takes are ordered with respect to each other. It has process isolation, which means that all of the actions that it takes are ordered with respect to each other but not with respect to things which are going on outside of it.
It also has data isolation. Where things the process and only that process can see it. And so the most prominent example is a database thread. And you would also find these properties in a linear chain of callbacks. You can also find it in early Unix processes which didn't have threads. We could also see it in Carl Hewitt's Actor Model which resulted in processes in Erlang. You can also see it in the original iteration of objects in Smalltalk, where objects originally communicated via asynchronous message-passing, though in later versions of Smalltalk, that turned into synchronous RPC, which became something more aking to the OOP that we're familiar with.
And so this is an incredibly common concept, though they don't always share the same name. But it is found everywhere even in the early history of software. And the reason for this is that the process is the smallest unit of code which has value on its own. It can actually do a thing. It doesn't need to be combined with something else to be useful.
But in order for this to be true, it has to do all three. It has to pull in data, it has to transform it, it has to push. And just to inform your intuition for this I want to talk about a few Unix utilities which don't do all three. And look at how they're not useful in isolation.
The "yes" utility, reproduced here in Clojure, will go and repeatedly print out "y" and newline over and over again. This is designed to be placed upstream of a process which has interactive prompts that we want to blindly agree to. And so this can be useful, but clearly has no value on its own. It's only useful when used in concert with other things that have interactive prompts.
Likewise, we can see dev null, which only pulls in data then drops it on the floor. Again, this is useful when placed downstream of something which has an effect beside what it's printing out to standard out, and we want that effect but we want to ignore what it's actually printing. Again, useful, but not on its own.
cat is something that, given a list of file names, will go and fetch the contents of those files and print it to standard out. This is pulling data, it's printing data, but it's not doing any transformation. It's just passing it along. And so this is valuable because a lot of utilities don't read from the filesystem. They read from standard in. And so this acts as a bridge. Likewise, netcat is a bridge to the network. It has value because it bridges two things together but on its own it just kind of hangs there. It does nothing.
Lastly, we can think of a slightly more high-fidelity reproduction of the yes utility. Because yes doesn't just print 'y', it can also print out any expletive that we provide. You may have it print out "n" or "YES" or something. And so we can see this kind of as a transformation. We're providing this initial parameter on the instantiation. It's transforming it into an infinite recurrence of something.
This is slightly more useful. There is some sort of transformation going on here. But the thing that I'll observe is that generally, the usefulness of the output is commensurate with how complex the inputs are. If what we're doing is giving it what is literally at our fingertips when we run this utility, then it's just going to tell us a variation of what we already know.
In general, we want to not have bounds about. By limiting ourselves to only what we have when we instantiate a process does place great bounds on that. There are a few counterexamples you can name, which are largely from the domain of mathematics. Pseudorandom number generators, fractal geometry. Those take small, initial values and yield large outputs. But I will point out that neither of those things are valuable on their own. We don't generate randomish numbers just for the sake of it. We don't generate a fractal images and then not look at it. We need to do this thing because they are useful in place upstream of something else.
In order for this to be standalone, to be useful on its own, it has to do all three.
I said that processes have data isolation. There's data that they and only they can see.
This is true, but consider a case where you start up a process, you hand it some initial values, and just as a bunch of immutable values float in, you can certainly compute things about those. What we can't do is share it with the rest of the world.
Communication occurs via shared, mutable references. In order for us to get data, someone has to update a reference that we can see with new data that we need. In order for us to share data with somebody else, we need to do the same.
Mutation is a necessity. Otherwise, we're just raising the ambient temperature of the room. We want to limit mutation wherever possible, because if we have some sort of data structure which allows for mutation and we introduce some function, and unbeknownst, it shares it with some other process, we've just added an edge to our system.
We have now created this linkage which we probably don't want. Certainly, if it wasn't something that we could have predicted, it's going to make very hard for us to reason about how this system works. We want to use mutation where necessary, but nowhere else.
Likewise, I said that process needs to have execution isolation. They are ordered with respect to their own actions but are not ordered with respect to anything else outside of that. This is also only partially true, because sometimes we need data from the outside of a process and that means we have to wait for it.
Consider what happens if we are waiting for the process. We say, "I would like to have the first block of data from this file," and then we have no choice but to just wait until the file system provides us with that. We can't make forward progress without this data. This is fine.
This is a very simplistic idea of file read happens. What happens if it takes too long? You can't simply just say, "Gosh, I guess the framework of this file system isn't working." We have to reason about what's going on inside of it.
Unfortunately, the truth is a little more complicated. This, by the way, is a hugely simplified view, but at the very least we're passing through our cache. If we have a cache miss, we go to the IO scheduler which will go and try to prioritize into different schedule needs.
It'll then pass it on to the disc controller which is an external piece of hardware that talks to the actual storage unit, which at the end of the day is actually the thing that limits what write capacity or read capacity that we have. In order for us to reason about why are we slow, we can't just say, "Gosh, I don't know. The file system's slow."
We have to actually start to look at these individual pieces and understand who's waiting on who. Who's active here, and who's just passively waiting for the next guy. Worst yet we can't just do this because we also have to think about who else is contending with us for the finite resources of the file system.
In order to debug why is it slow, we have hold all of this in our head. By and large, we are not able to hold this in our head. We can try but we're going to struggle a great deal. We certainly don't want to do this very often if we have to, and so we shouldn't.
What we can't just say is, "We need this to be fast so that we don't have to think very hard." The only control we have is to not care when things are slow. We need to allow for things to be slow because that allows us to be incurious.
Not only that, we need to expect that our low expectations are still going to be disappointing. We need to have timeouts. We need to have code pads that handle those failure modes. This is the only way that we're able to consider our process and isolation, is we make it not terminally dependent on something that's going on outside.
This may seem like I'm advocating for you to give up, because the reason that we do this job, the reason that we're good at this job is because we are quite actually able to hold a lot of moving parts in our head. This is something that we are proud of indeed.
Certainly, we write specs where we say this needs to be fast. It needs to never go over some threshold. It needs to be a reliably fast system. The problem is that we can't hold the system on our head. It's too large. We are going to get things wrong.
The system will grow slow and flakey despite our best efforts. The only way that we can actually guarantee this property is to radically simplify this, and then, in practice, the existing foundation that we've built upon, the operating systems, all this accumulated software over the last half decade, it's still too complicated.
Radical simplification is not within reach and so we have to just have low expectations. I want to give you a little bit more of a practical understanding of our process. I want to run through a few, simplified real word examples.
The simplest example of the process is something that we all know, a REPL. A REPL pulls in data from outside the process which is a code form. It transforms that code form into its evaluated result. It then pushes that result out via print, and it does it all over again. Simple, right? Very easy to reason about, does all three things.
Looking at this, we would reasonably expect that most of the time we're going to be blocking on read. Eval and print run at computer times everywhere else while read is waiting for our big, dumb fingers to actually type something out.
It's easy to come up with counters of this. We could, for instance, ask the REPL to sum all the numbers between one and a billion. At that point, we can probably type in a second command before some evaluating occurs.
Likewise, we could actually get to print out all the numbers between one and a billion, which means that whatever is downstream of our REPL is going to have to go and present all that stuff before we actually can type anything in, which is even less likely. Emacs doesn't like this very much, it turns out.
[laughter]
Zach: The thing that you need to realize is that there needs to be a little bit of coordination if you're into this. When we pull, we're letting the outside world know that we are ready for more data. They're not just throwing it over the fence. We're allowing them to send us something.
Likewise, when you push, the outside world is allowing us to send them something. When I talk about push and pull, I do not mean that as a unidirectional data communication. What I mean is that on net, the flow of data is in one direction or the other.
Likewise, let's consider HTTP handler. We take the HTTP request, we transform that request into a database query, we execute the query, and then we take the query result and turn it back into a response. This is the idealized version of Clojure as a backend server tool.
I want to break down each of the individual steps that are involved here. First, outside of our handler, the web server is going to pull in the bytes off the wire that represent the request. Then it's going to transform that into the Ring request that we all know and love. It will then invoke a handler.
It'll then transform that request into a data query. It will then pull results to that portion of the database, and I'm smoothing over the fact that we're pushing stuff because on net we assume that there's more data coming back that we're sending.
We then transform that result into the Ring response. Then we would transform our handler. Outside of our handler, we then transform the Ring response into the encoded bytes, and then finally, we push it back to deployment.
This is a fair bit more complicated than the REPL. There's more stuff going on. There's no this very clear cycling between pull, transform, and push. This is true for most real world processes. They're not usually that simple.
The more important thing that I want to point out here is that we are not going into the finer boundaries of our process when we write a handbook. There's stuff going on outside of the edges that we need to reason about if we want to reason about the operational quality of our software.
A framework is often defined as code which calls us, as opposed to a library which is code that we call. When you're operating in a framework that means you actually have to understand what it's doing to be able to reason out what's going on in production.
Lastly, I want to talk about a frontend application, which is a simple one, which has a single button, it has somewhere to refresh, and some ajax request. When we click on the button, it will go query some service, and when the service returns it will update the DOM to reflect the new information.
The first thing that I want to point out here is that this is not a process. You can click the button and while we are querying the server, you can click the button again, and both of these queries will be going on concurrently. Which means that they are not ordered with respect to each other. So this is actually a way of spawning processes where each process will go and do a query and update the DOM and disappear.
We might recognize that it's not very valuable for a lot of people to spam this button and thereby spam our backend process. Maybe we want to stop that. So we add a little bit of state. We have an atom that marks whether or not we are refreshing and then each time we click we check whether or not there's already an in-flight request. And if so, we just kind of ignore it. So they can go and click the button but we don't have to get contention.
Conversely, you might want to do something a little bit more explicit. You might say when you click on this, we're gonna disable it. Maybe put a little spinner on it or something like that that says your click is not welcome. And then we enable it when we're done. These are both valid strategies, as is queuing up all the clicks to be executed serially, as is any other strategy you might come up with. There's debouncing. There's everything.
But at the end of the day, you have to acknowledge that sometimes the external demands or external supply is not going to match up with what we want them to be. At the boundaries, we have neighbors that don't behave the way that we would wish them to and we have to come up with strategies to deal with them. That strategy, as a whole, is called an execution model.
I want to call out specifically queues, because queues are very common, a very useful way to connect two processes. They allow for us to go and communicate when there is supply in response to demand and vice versa. But in the most naive use case of this, where a queue is allowed to exert backpressure and we just wait indefinitely in order to allow us to make forward progress, we are now bought into our neighbor's execution model. Their choices become our choices.
Occasionally you will hear people talk about queues as a straightforward thing which allow us to reason about different parts of our system in isolation, which is not true. Because if these things are joined together, they are in a suicide pact with each other to wait until the end of time for the file system to come back. We have to reason about them as a single unit. We have to pull that pair of processes, or that entire set of processes in our head to understand what's going on. The thing that allows us to reason about our code in motion is low expectations, timeouts, and explicit failure modes. That is what gives us separation of concerns.
So when we talk about actually putting a process together, I want to consider the REPL. Let's assume that we're a little bit more configurable. When we start our REPL process you provide functions that do the read and the eval and a print. So it's pretty evident that we want these to all be passed in. They have to be combined into a single unit but we don't want to do this too early.
There's no reason to have this function take a read-eval or an eval-print parameter because these are wholly separate in what they are performing. We might want to read from the console. We might want to read from the network. We might want to print to the browser. These are all separable concerns to the actual evaluation.
Generally, the reason why these things are separate, why we only want to bind them together in the last possible minutes, is because they're solving fundamentally different problems. Push and pull--these phases are responsible for the operational aspects of our code--for our code in motion. The transform phase describes the value that the process provided and it's the only thing we can reason about fairly well with respect to the code at rest.
To illustrate this, I will talk a little bit about sort. According to the first sentence of the docstring for sort, it returns a sorted sequence of the items in coll. The way it actually accomplishes this is to take coll, turn into a sequence, realize the entire sequence, turn into an array, and then perform an in-place sort.
Of course that's not the whole story right? Because if we go and give it something too large, then it will simply run out of memory because it assumes--naively--that whatever we give it will fit in memory. Note that the docstring does not tell us not to do this. This is just assume that you know not to do this, you fool. But this is not intrinsic to the act of sort.
Let's look at the gnu sort utility. GNU sort will take a large stream of data. It will go take in a chunk of that data. It'll sort that chunk, spill that chunk to disk, and then repeat until it's exhausted the input, at which point it'll go take each of the chunks and merge sort them together. This is more complex. Before we were talking about this transforming data and now we have this pulling, the transform, push. And then you have the subsequent pull and push where it is doing the merge sort. This is fairly complex right? There are more moving parts.
But it is robust. It is taking something which was an assumption and turn it into an invariant. And the reason that we've had these invariants is not just because it makes our life easier. If we have an assumption that becomes everyone else's problem, that makes the rest of our system less robust.
We want to go and not delete these assumptions out if we can possibly avoid it. And so the pull phase is responsible for exactly that. It makes sure that the data is of the appropriate size and shape. It enforces these invariants so that the transform step gets to be [inaudible], gets to go and have this simplistic docstring that does not capture these sorts of failure modes. It also has to define what happens when the data is unavailable.
What the pull phase does not do is simply provide a lazy sequence. Because a lazy sequence is something that can go and trip up sort. It is something that can go and fail because suddenly the disk is unavailable or the network connections froze. If we go to pass this in, our transform phase, this thing which is meant to be sort of pure, which is meant to be related to the data as it is, not our code in motion, can throw an exception at any point.
And so this is something that people do a lot. I have done this a lot. I'm not trying to call you all bad people. In many cases, this is appropriate if you are only doing this as a one-off and you know exactly how big your dataset is. It's also valid if you have a config file which will never get too big anyway.
But in the most general case, if you want your code to be general, if you want it to be reusable, if you want the person who has her job three years from now not to eat you, you need to go and actually think about these operational qualities. And you need to put guards at either end of the process.
The transform phase turns well-shaped data into other data. And there are three things we can do to it. We can accrete data. You can take two pieces of data and combine them. We can reduce it, we can take data we have and turn it into something less than its current form. And we can reshape data, which is neither grow it nor shrink it, but turn it into the different representation which is advantageous.
We accrete data when we don't have enough data to make forward progress or when we would like to have more data to operate on. We reduce together when differences in our input don't matter. If we sum together a list of numbers we are saying that any possible list which yields this particular sum is equivalent. You're blind to their differences right? It doesn't matter. And so this is abstraction.
This is treating things which are different as equivalent. And a common example of this is from the field of statistics. This is called the Anscombe's quartet. These are four different data sets which share a bunch of really common metrics. They have the same mean and variance and correlation between the two axes and a few others I forget.
And so by using these metrics by throwing away all the other data. What we're saying is, these all look the same to us. We are blind appearances right? And sometimes that's true. More often it's not. But this is an incredibly thorny issue right? And again I talked about a lot. Abstraction is something that I've talked about at length. And it's important because reduction is often the valuable part of our code.
We accrete value, our data so that we can go in reduce them to a reduction to it. We reshape data so that we can actually go into search productions because that is pulling out one insight among many. When we reshape data, we do this to make it easier for us to accrete or reduce. It's not abstraction. We are taking these different things and we're putting them into a particular representation because that representation is different in a way that we care about. Consider a database versus a floppy drive, where they both contain the same data. These are equivalent in the same way that alternative languages are equivalent. Which is to say not in a way which matters in an engineering context.
It might be some sort of clever way to apply isomorphism, but that is not our database for a job. Likewise, we can go and represent a map as an associative list but most of time, we don't because the map gives us affordances that we like. We like random access. We like random writes. We like random reads. We do this to set ourselves up to be more successful down the road. Because this allows us to do things that we couldn't do it all or it couldn't do in a reasonable amount of time. And reshaping should always be a separate operation it should be a function unto itself.
If our function requires a set to be able to do something efficiently, we should not take some collection and coerce it to a set. We should be loud and proud about the fact that this is what we need. Because it means the people who are using our code know what we need to be successful and can reason about it the trade-offs of coercing data early, re-using that coerced representation, and all these other sorts of things. We should not do that silently for other people's benefit.
Likewise, if we can, we should keep the accrete and reduce set separate. This is not always possible. If we're going and taking a list of values and putting into a set, we are accreting these values but we're also losing information about the ordering of the values and how many duplicates there were.
And so this just kind of all happens at once and it's fine. But if we can get away with it, we should keep them separate because this code doesn't really relate to each other very often. We can actually go and reuse them and if you do them separately, we keep more degrees of freedom for ourselves.
What we're transforming into is a description of the effects that we want to have. What is the next pul? What is the next push? And often, this is pretty straightforward. What we're giving is a description of the data you want to push.
When we call println, we're giving a literal description of the data we want println to print. Plus a new line. And even in a more complex example where we have, say, an HTTP request, this is still a fairly literal description of what we're sending on the wire right? It has to go through a sort of little transformation. There's nothing here it doesn't have an exact correlate to what we're sending to the other server.
But sometimes, there are things that don't fit into that same category right? In clj-http, we have a lot of other parameters. We can add on to it. We can say if the request gets a response, it tells us that this thing lives somewhere else, go follow that redirect. I don't want to worry about it. Or if it is something that is an exceptional response, throw an exception. Don't just give me that error as data.
We might need to go in to define boundary regions. These are the most redirects I'm willing to follow before you should go and throw an error. None of these describe the data that we're sending. They describe how we're sending the data, When we stop trying to send the data, what we do if it doesn't quite work. But we should not mistake this for an actual implementation of this behavior.
Data does not have any semantics. There is indirection between these things and even though we may think we know what these fields mean, because they aren't really self explanatory. Maybe they're inverted. Maybe it's completely the opposite of what we thought. That is defined elsewhere. We have to actually map this data into something which performs these effects. And so that's why we have functions.
Functions have sneaked into this. But functions also remove many of the degrees of freedom you normally have. We cannot reach into a function and pull out a piece of it. We cannot go and reshape a function to perform its actions in a different way. All we can do is wrap our function in another function. We can only increase. For this reason, we do not want to go and turn our data into a function until the last possible minute, because that goes and just reduces our flexibility down the line.
Likewise, we need to recognize that once we turn our data into a function which performs an effect, into an active implementation of the effects we're describing, we are outside the transform phase. The only way that we can test this function is to make it do with it. That means that we have to actually test it within a much more complex environment.
This is why the transform phase could be the biggest part of our code. This should be the chocolate center surrounded by this little, tiny candy-coated shell. This is what allows us to read some of our code at rest. It's what allows us to use the REPL. It's what allows us to come up with pure functional tests. This is all the properties that we associate with Clojure.
A nice development process we've built around this. But the pull and push phases do exist, so we do need to test them. Unfortunately, we can only test them in a place which reproduces the pathological behavior we expect to see in the world.
I am not proud to tell you that often I have concluded that the best way to test something in the facsimile of real world is to put it in the real world and just see what happens. This is all too common, because it is incredibly hard to do anything else.
Maybe other people will have found better success. Maybe you will come up with better techniques for this, but at the end of the day, this is hard enough. We want to keep this as small as possible. We do not want this to own more of our code than it absolutely has to.
I want to just reiterate that a lot of languages or environments or frameworks will go and talk about how they have lots of small processes. You have green threads, you have Erlang's processes. You have those Go's go routines. You have core.async's primitives.
Each of these will talk about how they are many small bite-size processes talk to each other and will talk about this as if it's this great contribution, because now all these complex systems can be understood a little bit at a time.
Again, that is not true, unless these processes are arguing against each other, unless they are not fully dependent on each other, because otherwise, they just become this big amorphous mass and we have to understand as a whole and probably would have been better off just having it be one process.
This is the meat of the talk, "How to build a process," but clearly a single process is not all that we will usually have. Especially, if we're talking about a network. We are then composing these process together into a system.
This is a big topic, really larger than a single talk, so I'm just going to do a little survey of some things here. The first thing I want to point out is that a process is not a value. We cannot pass a process into something.
All we can do is pass in an identifier for that process or a channel which is a means of communicating with that process. In fact, usually, we can use an ID to generate the channel, some means of communicating.
In the simplest system when we go and spin out multiple processes, we can simply tell them about each other upon initialization. If we have some shell pipeline that we're setting up every process upon creation is told, "Here's how you read, here's how you write." We send it out so they're all talking to each other appropriately and just let it go.
In more complex systems, processes come and go. There might be faulty processes or we might want to go and reshape how the system is shaped. If there's any sort of dynamism, then processes need to be able to find out about each other at run-time.
In these cases we go to what is variously called "discovery" or "resolution." Service discovery usually describes us saying, "I want to perform a task, who can help me with that? What is the set of things that can help me with that?" Resolution is usually taking some sort of abstract identifier and you'll get some sort of more concrete identifier. Often that can be used to generate a channel for communication.
Of course, in DNS, we don't actually get a single IP address for our domain, we actually get a set of them. This is often used for geographic load balancing and other things. This is true in most of these cases. Usually it is a one-to-many sort of mapping that we have here.
The idea is that we're going and we're taking something which is abstract and can be modified at run-time such that as things come and go we can be robust, we can regenerate and relearn what the current shape of the system is.
A related concept is routing. Routing is taking a single channel and then having everybody talk to it and having some process go and distribute the messages that are being sent on that channel across a lot of other backend processes at it sees fit. We see this a lot in networking. This is a term that we talk about a lot at a bunch of different levels as the IP stack.
A thread pool is a route too. There's a queue, we put functions on it and those functions are shared between a bunch of other threads that will execute those functions. In fact, most of the concepts you'll see within a distributed system have a core layer within a shared memory local context.
Of course, there are differences. Going and updating a shared reference, means that we know where that the data got where it's going. We know that they're at least aware, or are able to see the thing that we shared about. We don't know when they're actually going to act on it, but we know that they can see it if they try.
When we have an asynchronous, faulty channel of communication, this gets a lot more complicated. It's possible that you're confused about this level. At the end of the day, communication is a means towards action, it's not what we're actually trying to do.
The fact that we got back an ACK over TCP saying, "Oh, yeah, I got your message," means it's sitting in a buffer somewhere, not that they're processing it, not that they've acted on it, not that they'll remember if they crash midway through in the next two seconds.
Acknowledgment over communication is a means for us to build larger systems which acknowledge the actions that we're trying to perform. In both local and distributed context, if we go and throw something over the fence and say, "It's your problem now," that's going to end badly.
We need to go and create higher-order acknowledgments of whatever our applications are trying to perform and only once we get that acknowledgment do we go and actually let out our breath and say, "OK, good. We're done. We don't have to worry about this anymore."
Again, designing systems, especially distributed systems is an enormously complex thing. It is well beyond the scope of this talk, or a book that I'm writing, or anything I really hope to have to deal with in my lifetime.
Typically, when I come up against a topic like this, I try to go and recommend a very canonical book for this. The closest thing I'm aware of is the book, "Distributed Algorithms," by Nancy Lynch, which is an extremely comprehensive coverage of the theoretical results in Systems Theory research.
What it doesn't do and what no book I'm aware of does is give a voice to the intuition that you gain from working on distributed systems for a decade or more. I wish this book existed, because I'd love to be able to recommend it. I'd love to be able to put it in my book as the reference people should go to look for. If anyone's aware of such a thing, please come share it with me.
I think that if it doesn't exist, then this is a clear fact in our industry that this should exist, because this is knowledge that people gain and which is often communicated through gestures and whiteboard diagrams as opposed to a really formal vocabulary that can translate outside of that one very local context where they're having that conversation.
Again, we compose functions to create processes. Processes are the unit that is hardened against this environment, which interacts with this environment and which is actually reusable on its own in different contexts. Often we have to go to compose these processes into systems. That's all I have for you guys. Thank you.
[applause]
Eric: You made an assertion near the beginning of the talk that we need a radically simplified software stack, something like that.
Zach: No. I said, "In order for us to be able to create systems which are both fast and reliably so that they never go above some sort of threshold."
You'll see this in some embedded systems, real-time operating systems, which basically eschew all of the stuff that we generally associate with personal computing, because all those ossified layers mean that we are no longer involved with the system like that, which means that we can no longer guarantee that issues won't arise.
What I was saying was not that we need to radically simplify this but if for some reason you have to do this you would have to. I think that in practice most people don't, that it's not a rabbit hole you should go down and instead you should just learn to be more pessimistic.
[laughter]
Eric: We have a question from Peter. You mentioned naming and that it's one of the hardest things and that we should study naming. How does one go about it?
Zach: Buy another book.
[laughter]
Zach: Seriously. The book is called, Elements of Clojure, again the first chapter is available for free, it is called, "Names," and it is my attempt to go and take the most relevant parts of analytic philosophy and apply them within the context of software. That's the best answer I have.
The book that was referenced in the previous talk, Data and Reality, is one which I think is truly phenomenal and it's criminal that's it's been ignored. It was published in the late '70s and I think has always been a little bit of a thing that a few people will know about, but don't actually read.
I think it's maybe out of print, I'm not sure, but if you can find a copy I recommend it because it's a very lucid and practical look at data modeling which is very closely related to naming. Again, the chapter's free, read it. I make a bunch of different references to different things that I think are ways you can go deeper on that topic. If you have any thoughts, please let me know.
Eric: Data and Reality is actually available as an audio book, so if you want to go listen to it that way.
Audience Member: How many hours is it?
Zach: It's an incredibly short book.
Eric: It's not that long.
Zach: Maybe 120 pages or something, so really not that long.
Eric: I have another question. There was a part of the slide, I think it went by too fast, but you were talking about there are three things and one of them was code in motion.
Zach: Right. I was trying to talk about the difference between operational and functional aspects of code. We talk about the functions, which is what it does and operational is, "How is it done?" More generally, the functional stuff is something that we can understand as a self-enclosed bubble, because other people are taking care of what the big, scary bubble does.
I think it's useful to separate those, obviously, because we're bad about thinking about code in motion and because we have lots of things that are happening at the same time, not just a single sort of ebb and flow of stuff.
I'm just trying to say, basically, we were talking about queues, but specifically we were talking about and saying that queues do not provide separation, because actually they make processes conjoined in terms of how they execute. This is the processes whereby these are all waiting for the other guy, because what are they going to do separately?
Eric: You said something about the pull phase has to take that into account.
Zach: It does, or rather what it has to do if we want our process to be a self-enclosed...
Eric: Module.
Zach: ...thing, it needs to define boundary conditions upon what it will take from the world. If it waits too long that's like saying, "I'm done with this," and the computer spins. If you don't do that, then it means that basically these things are operating in this interlocked way and you have to reason about them in tandem, or as a set.
Eric: OK, we have another question from Tyler. OK, so you're going to have to help me out with this question, because I don't know how to ask it. I'm just going to ask it. What descriptions of systems do you know of?
Tyler: Like languages for systems.
Zach: I'm not aware of any, because I think that it's hard, because a process is not a thing that you can like stab a pin through and say, "It's a process." It's not a butterfly you can put up on the wall. You have to go and talk about it in action.
Certainly there are some tools that work with certain parts of this. Like in, Leslie Lamport's TLA it goes and talks about certain aspects of the interfaces between them, but it doesn't attempt to talk about what the processes are. That's left as an exercise for the implementer.
I don't know of anything which even attempts to go and create a language for talking about this, which is anything less than just executable code.
Tyler: Unless you compose like guarantees of...
Zach: I don't know. I guess the problem here gets back to the idea that reasoning out the system as a unified entity is asking for trouble, so maybe it's better to talk about a process and say, "I actually don't know what the other code that I wrote is going to do, so I'm just going to assume the worst." Then, on the other side it's also going to assume the worst. This creates redundancy on the code.
By putting these things across process boundaries, we're creating a lot of extra work at that edge to make these things separate, but that usually pays off, because maybe somebody else who also starts working on that code, maybe it's flaky in ways that we didn't anticipate and we wanted to not have to go again load all of that run-time into our head just to reason out what's going on.
I guess the idea I'm advocating here is, "Let's not talk about the system. Let's talk about processes which don't force us to think about the system." Obviously, you have to go and be able to chart it out on a whiteboard at some point, but that should be a very high-level exercise as opposed to something which tries to give us holistic guarantees across our processes.
Eric: You mentioned that a process is either an ID...It's never a value. It's always something identifying a process or a channel to talk to that process. Does that mean that processes aren't first-class citizens?
Zach: They're not first-class values, certainly. It's just something which is an amorphous conglomerate of a bunch of different things. Yeah, you can't have like a higher-order process. You might say that a Ring Web server is like a higher-order process, because you can plug something in the middle, but you can't wrap a process around a process, then it's just two processes talking to each other.
Once you have a process, the only thing you can do is compose it with another process, not like reach into it and modify little bits. You're down to sending messages back and forth and trying to affect them that way as opposed to reaching inside them or doing something directly.
Eric: I'm actually out of questions. Do we have any more?
[laughter]
Eric: Someone is asking, "Have you read Designing for Scalability with Erlang/OTP?"
Zach: I've read an Erlang book. I've read the Pragmatic Programmer book. I think that OTP is a reasonable example of what I'm talking about, but I have not gone and paid money to buy Erlang. I know a number of people who have and I've done some little toy projects on my own time.
I think that Erlang gives you the tools to create processes which are not in this suicide pact with each other, but it certainly doesn't mandate it. OTP, "If anything ever goes wrong just shoot yourself in the head and we'll go and create another one of you," is...
[laughter]
Zach: ...a strategy and it does help you...No, no, no, it's not a bad one. It allows you to not have to explicitly say, "Here is my error code path. The error code path is always best." That is actually a greatly simplifying kind of facility there, but it doesn't stop you from creating things which are locked together. It doesn't stop you from having a deadlock.
These are all things that can occur there and it's down to you to make sure these things don't happen. This is not a guarantee that the run-time can provide you with.
Eric: Is there any hope to analytically figure out, "These are the, I don't know how many, different ways that things can get locked together," and...
[laughter]
Eric: ...just like put those 20 things in front of it to protect it on the pull and on the push?
Zach: I don't know. I am constantly surprised by how things can go wrong. Elsewhere in my book I talk about the fact that I think what makes a senior engineer a senior engineer is not their mastery of the tools, but rather their ability to predict what the outside world...
Eric: Their paranoia.
Zach: ...will do to screw things up.
[laughter]
Zach: It's their understanding of the domain and how complex and screwed-up the real world can actually be. I do not think that there's a way to enumerate the failure modes. I think certainly you can enumerate the likelihood of those, but then you have to constantly check whether you have to update your fires there and be like, "Actually this wasn't possible, but now it is."
Now we have a new class of user who has a new class of misunderstanding or the system is flaking in this new way, and again, you can't impose anything on the outside world, so at best you have some assumptions you're making there and you're constantly checking those assumptions to make sure they're still valid.
Eric: Like you mentioned before, Erlang has this simplified, just-die model that I guess all errors become the same, they just become death?
Zach: Well, to some degree. It doesn't stop it from being something which like malicious code which is trying to make you go and do a thing you didn't intend to do. Unless you know that it's bad, you won't die. Again, this is a facility, but it doesn't preclude any failure, it just makes it easier to have a default path you can follow when something goes wrong and you recognize it as going wrong.
Eric: You talked about how a stream, a lazy stream is not actually a good pull mechanism, because it's tied together with what could be an integer stream, or something.
Zach: Right.
Eric: Could you talk about what would be a good pull mechanism?
Zach: Sure. It's not that if it's a lazy-seq it's bad, because lazy-seq is a very overloaded thing in Clojure, but actually just refers to pure operations you haven't performed on the data structure yet, but where that actually represents effects.
Like the point that I made is that if you have a function and that function means effects, you're not actually transforming things, and that's carried out not from the push but the pull, if I go and I pass in a lazy sequence which is performing effects to go and yield data...
Eric: Like reading in the lines of a file.
Zach: Exactly. Something can go wrong there. There can be an IO exception on that disk that are not actually in the transform phase, because we can't actually reason about this in isolation, because there is something that someone else can do to screw us up.
The reason that you don't use lazy sequence is because it fundamentally shrinks it, so that the M&M is all candy-coated shell and sometimes there's chocolate in it. That's not good. That's not what we want.
Eric: Michael asks, "You talked a lot about analytical philosophy and how you're doing a lot of reading in that, but there's another branch of study, human study that talks a lot about abstraction and that's linguistics. Have you looked into linguistics for definitions or as a way of understanding abstractions?"
Zach: I have actually. I'm very familiar with semiotics, which is a cousin of all that. Semiotics, it tends to go over on top of these things in the abstract rather than linguistics, which by and large tries to look at real-world examples.
I think probably I would benefit from this. I took a very winding path to arrive at the definition of abstraction. A week after I wrote it down, I read an old book by Barbara Liskov, which said it in the first three pages.
[laughter]
Zach: I was incredibly embarrassed and annoyed. I was like, "Uh." I don't know. I think that abstraction, unfortunately...
Again, this is talked about elsewhere, is that we use the same word to talk about some very distinct concepts. It's very easy for us to conflate these different concepts because they're masquerading behind the same word. I'm happy with my definition because it works for the sort of things that I'm trying to talk about.
I'm not claiming it's the only possible definition about abstraction, but I think it's one that works well specifically for software. That's my thing, but if anyone has any books they'd like to suggest in terms of things in linguistics, whatever, please do let me know. I always love to have more books on the pile.
Eric Normand: Could you go over your definition of abstraction one more time?
Zach: Sure thing. Abstraction is treating things which are different as equivalent. We're saying that there are differences. The tree has leaves, doesn't have leaves. The leaves are green or red. That doesn't fundamentally change what the thing is for us. Where at the end of the day, it's about identity.
Most data modeling is about identity. This is what Data and Reality talks about a lot, which is what are the things that we're able to ignore, and what are the things that don't? If a baseball team loses a player and gains a different one, is it the same team or not? If they move to a different stadium, is it the same team or not?
The answer to this depends on what sort of problem are you trying to solve? If you're trying to predict the outcome of a game, it is a different team if a player's injured or a player's traded. If you're selling tickets to games, you don't care. These are not things you can talk about in the abstract.
You have to actually go and ground them in some sort of concrete context and then say, "This is good, or this is bad, but also, here's what would make it not good, right, or not bad, right? Here's what would change here that would actually force me to reevaluate whether or not this is an effective abstraction..."
We're well past the end of this. I'm happy to go and take more questions in the hallway, but thank you again.
[applause]
Eric Normand: We have another break. I think we are out of time.
Audience: We have five minutes left.
Eric: Five minutes left on the break.
Audience: No, on the clock.
Eric: On the clock? Oh, we're not over time?
Audience: No, it's not like school.
[laughter]
Eric: Look at that. We do have five...four minutes left. Any more questions?
[laughter]
Eric: We have plenty of time.
Audience: Make him earn his money.
Eric: In the transform phase, you had three different things it could do. Accrete, reduce, and reshape. Do you think that's it?
Zach: I think so. They're pretty broad, right? Either you have more data, or less data, or the same amount of data, just different. Short of an identity function [laughs], I think that's pretty exhaustive.
Eric: And return nothing.
Zach: Yes, exactly [laughs]. You could do nothing, too.
Eric: Is reduce...You said that that one was an abstraction because, your example, if you're summing, you're making all the sequences of numbers that sum to the same number equivalent. Is that the only one that's abstraction? The reduction?
Zach: Yeah, because accretion is saying we need more data. It's not lossy. It's capturing important new qualities of all the data. It's lossless. Reshaping is not abstraction because, again, you're not treating things which are different as the same. What you're saying is, I want to pay close attention to how this data is being represented, and I want to change that.
In both the accretion and the reshaping, you're saying, "There are things here that I care about very much." In the reduction, you're saying, "There are things here I do not care about."
Eric: I'm just going to keep asking questions.
Zach: No, there's one there.
Audience Member: I had a question. At some point, there's this trade-off between abstraction and performance and being able to hold stuff in your head and complication. That's one of the gists I got from you. Is it just a matter of scaling out your unit of what you have to hold in your head?
We have folks working on the Linux kernel, and we have all these experts that are in their small particular domain. There's not one for maybe Linux. There's not really one person that I would expect to hold it all in their head.
It really is just a matter of scaling it out to a small enough unit where somebody specialist can hold it in their head and can communicate with other specialists. In that case, it's like there's this natural tension between performance and simplicity and you're trying to...
Overtime, you have the system blows up because you're trying to manage all these different...These are the file systems and everything blows up in complexity for a reason. All these other things that you're talking about.
It seems to me that there's another side to this where we could scale it out on the human side. I can very simply just look at my small piece of that puzzle and I communicate with the other experts in the field.
Zach Tellman: We can as long as they're not wholly dependent on what they're doing in which case, you can't ignore them. The file system, if you need what's in the files to do something and you have requirements of the file system perform anywhere near it's theoretical peak in terms of throughput or latency then you care a great deal of how that works. You can't possibly ignore that.
The only way you get to ignore that is when you say...If you just pretend its 1970 and we have tape decks that would be fast enough for me then sure, go wild and ignore that. SSD's are a great boon there they give us a simpler random access model to think about in that specific domain.
I'll push back a little bit on the premise of your original question because the modules in a long-lasting system aren't what last. It's the interfaces. The interfaces are what ossify. The interfaces are lasting. The modules themselves, that's on either side can get freely swapped out.
I would expect that at the high level of Linux, there are at least a handful of people who understands all the interfaces between all the different pieces. They don't necessarily understand exactly how the innards of that module work. They all spend a great deal of time thinking carefully about what's allowed to be in or not in the interface.
To your point, yes, often the way that we get fast performance is by adding something to an interface that allows us to go and rely on some implementation detail -- a new piece of hardware or a new approach affords us.
That's a super trippy thing because that makes us less general and ties us a little bit more closely into the specifics of how it's being done now versus how we'll do it in the future. That's a tradeoff. It's basically how rich do we allow interfaces to be versus how general do we want them to be. There's no one good answer to that.
I'll talk about this more in the chapter on abstraction. If you read that or watch the video on that work and have further questions, I'd be happy to go into that.
Audience 2: When did your book come out?
Zach: It is available on Leanpub. You can buy it. You could get a copy. I did that two years ago. I'm incredibly grateful that they've been this patient. This talk here represents the material from the final chapter. I haven't written it all down yet. I hope to do so in the near future to get that out there. At which point, that will be a mostly complete first draft.
Once that's done, I'm going to reserve an unlimited time to go and tinker with it because at that point I feel like I've given what people paid for. Each chapter I've written, I've gone, I've re-written and I've played around with. I expect to do the same here. I'd be surprised if it's anything resembling a final draft before the end of this year.
Eric: How many people have bought the book?
Zach: Thank you.
Eric: How many people have read the first three chapters?
Eric: I have one final question.
Zach: Sure.
Eric: You used some examples that are Unix utilities like cat and grep. Those are surprisingly long lasting. They've been around for a while. Is there any hope that we could find another set of operators that are general and useful enough that we'd be using them in 50 years.
Zach: I guess you could argue that the Lisp family of operators also fit them and in fact are longer lived than that depending on whether you talk about Common Lisp or the Ur-Lisps, right? Again, cons cells are no longer the preferred representation of data.
They're a very general form of data, but they don't map very cleanly with how the underlying hardware works. They require us to jump around memory far too often. Everything in Clojure, except for code representations uses chunked sequences.
Likewise, we can say, "Gosh, Grep is super general, because it says everything is text. Even if it's just raw binary, maybe, it's text." Yes, but we're giving up a lot of freedom by going and shackling ourselves to that sort of foundational assumption.
Notably, we can only deal with flat data. You can't deal with nested data unless you go and use a command line tool, like JQ, which is meant to slightly help us with that, but still, it's very painful.
Eric: You have some kind of fixed point, like with...
Zach: Right, but the point is to that, these sorts of things, they might be mathematically eternal, but they're not eternal in their utility, because that's a moving target. I think that we can be grateful that the guys who did the original Unix were so prescient in creating these things that are so flexible.
I think that the fact that something's lasted for a long time doesn't mean that it's retained its value over all that time. It means it's retained some value. I think that having something be long-lived just for the sake of being able to say, "This is long-lived," is not a goal.
The goal is for it to retain its value for a long time. That's a much trickier proposition as you get into the future.