Out of the Tar Pit
This is an episode of The Eric Normand Podcast.
Subscribe: RSSApple PodcastsGoogle PlayOvercast
In this episode, I read excerpts from Out of the Tar Pit, a classic paper in the functional programming community.
Transcript
Complexity is the single major difficulty in the successful development of large-scale software systems. Following Brooks, we distinguish accidental from essential difficulty, but disagree with his premise that most complexity remaining in contemporary systems is essential. We identify common causes of complexity and discuss general approaches which can be taken to eliminate them where they are accidental in nature. To make things more concrete, we then give an outline for a potential complexity minimizing approach based on functional programming and COD's relational model of data. Hello, my name is Eric Normand and today I am reading Out of the Tar Pit by Ben Mosley and Peter Marks published February 6, 2006. This is a very influential paper in the functional programming community and I'm going to be reading excerpts from it. You know, most papers have a problem that they are trying to address, they are trying to describe the problem, and this one is all about analyzing the principal reasons software projects fail and they are proposing a solution. As you heard in the abstract that I just read, their take is that it is all about complexity. That's why projects fail and they do a very good job from first principles, kind of chopping up complexity, where does it come from, why do we have it, how does it affect programmers and the software and then from there building up this base of understanding, step by step tries to derive what an ideal system would look like, a practical system for developing software with this new understanding of complexity as the major problem. Now whether it's successful at doing that or not is, well I guess it's not really up for debate, I think that people have decided that the solution that it proposes is not adequate but still that process of going back to first principles was very useful. It does make a lot of good assertions, a lot of good analysis of why software is so complicated and what can be done about it and it comes out pretty favorably for functional programming and also for a relational model of data. Before I get into it, I do think I need to preface because I'm going to read excerpts and make comments on them and because I'm making comments on them, I feel like I need to explain my bias. This paper came out in 2006 and I read it not long after that, so probably 2008. To me this is one of those papers that I read early enough in my career and it was contemporary so there hadn't been a lot of commentary on it, that it had a big influence on me and I think on the industry as a whole and so it's something that I lived through as opposed to a paper from say 1972 or something where as before I was even born and it's withstood the test of time, it's a paper that's still important today, so somehow it's been kind of digested already and so I read it before it had been digested by the culture and so I have a lot of biases because I was learning as other people were learning about the paper. Another thing is I'm writing a book that deals a lot with these issues and the issues brought up in the paper and so I've thought really deeply about it so I might be kind of harsh, harsher than I normally would be if I was just coming at this a little bit more fresh where I would just be kind of bright eyed about it, any harshness that is there is simply because I've taken these ideas that were very influential to me and to the functional programming community and gone further with them and so there's going to be things in here that they say in the paper that I think could be said better or they got this particular part wrong and they should say it this way, I might say stuff like that because I'm dealing with exactly this like what is complexity and why is software so hard. It's a good paper, certainly an influential paper, I'll critique the writing, it is very long for what it does, it's very repetitive, the language, I think a gramarian would have a lot of trouble with the way that he's punctuation and stuff, the stuff I actually had to take the paper and copy paste the sections I wanted to read and just cut out all sorts of stuff because they use a lot of descriptive adjectives and parenthetical notes inside of the sentences and right it's just I think it could be like a quarter of the size if you really wanted to, it's not one of these tight papers that you know it's like four pages that just says a ton, it's a you know 50 page paper, how many pages is it, oh 60, 60 page paper that probably could be 15, if you really you know turn the screw on it. But still it's worth reading and it's been a big influence I think on languages that have come out since so things like closure, Elm, even stuff like React and the flow architecture, those things have been definitely influenced by this and then I think it's influenced the community in the sense of like this helps the explanation of like why you should choose Haskell or why you should choose Erlang, stuff like that. So right, let's get right into it, alright, alright so in the introduction, the biggest problem in the development and maintenance of large scale software systems is complexity. Large systems are hard to understand. So there's already we see this, they're trying to define the problem and they're calling it complexity. Fortunately in the paper there's no definition of complexity but they hint very often that it has to do with difficulty, with understanding. And sometimes they get very close to starting to equate the two. So I think that's a common thing is that they talk about complexity so much but they never define it and it's a flaw with the paper, I'm going to say it like that. Now it doesn't, the stuff that they say about complexity and like you know, because they'll name different kinds of complexity and they'll talk about those specifically, all those things are still very valid, okay. The major contributor to this complexity is the handling of state and the burden that this adds when trying to analyze and reason about the system. Other closely related contributors are code volume and explicit concern with the flow of control. Now we're going to go over all this later, this is just the introduction but what you can see is they're already starting to chop up complexity and try to identify where does this come from, why are our software systems so hard to understand, why are we over budget, why are we over time, trying to pick it apart, separated out into its natural seams, along its natural seams so that you can see and address each thing separately. So they're talking about state, they're talking about code volume, flow of control, those will come up a lot. The classical ways to approach the difficulty of state include object-oriented programming and functional programming. These approaches each suffer from problems when applied to traditional large-scale systems so they are going to address those problems, they're going to describe them later. So I'm going to leave it there. It is possible to take useful ideas from both, that's OOP and FB, and that when combined with some ideas from the relational database world, this approach offers significant potential for simplifying the construction of large-scale software. They're using the term simple, simplify the construction and it's really unclear whether they're talking about the same, like the opposite of the complexity they were talking about before. I don't know if it's just the language thing but it's a weird choice of words because they're simplifying the construction, they're not constructing simple large-scale systems. So I'm probably going to bring that up a lot because if you try to read too deeply into it, you get into these kinds of ambiguous and I don't know what you call that. Now, there are going to talk about complexity, major theme of the whole paper is complexity. Complexity is the root cause of the vast majority of problems with software today. Unreliability, late delivery, lack of security, often even poor performance in large-scale systems can all be seen as deriving ultimately from unmanageable complexity. Being able to understand a system is prerequisite for avoiding all of them and of course it is this which complexity destroys. So complexity destroys the ability to understand your system and you need to understand the system to avoid all these other problems. Here they're saying that complexity is causing difficulty to understand. This is the unfortunate truth. Simplicity is hard but the purpose of this paper is to give some cause for optimism. So we have some optimism in this simplicity is hard part. I'm not going to read all of it but they do talk about having to put in more effort to be simple. We'll see a little bit more of that. The type of complexity we are discussing in this paper is that which makes large systems hard to understand. Again, reiterating this complexity is what makes the systems hard to understand. It is the cause. Complexity causes difficulty of understanding. Now they have to talk about understanding. What do we mean? There are two widely used approaches to understanding systems. One testing and two informal reasoning. The two informal reasoning is the most important by far. This is because there are inherent limits to what can be achieved by testing and because informal reasoning is always used. They're saying that informal reasoning, you've got to do it no matter what. Even when you're testing you're doing informal reasoning so that's more important. With limits to the testing, this is another one of those reasoning errors. Maybe it's just a rhetorical error but there are inherent limits to what can be achieved by testing but there are inherent limits to what can be achieved by informal reasoning. You've got to enumerate those limits or don't mention them. If it's not about the particular limits. They're both limited is what they mean and the more important one is the informal reasoning because when you're writing code, when you're reading code, you're debugging, you're even planning the code, you've got to understand you have to use informal reasoning. It is precisely because of the limitations of all these approaches that simplicity is vital. When considered next to testing and reasoning, simplicity is more important than either. Given a stark choice between investment in testing and investment in simplicity, the latter may often be the better choice because it will facilitate all future attempts to understand the system. They're kind of saying that simplicity will make it easier to understand in general or as testing because it's limited, only helps you understand it a little bit. I tend to agree with this but I also think that they're discounting something like TDD let me say the practice of TDD wasn't that widespread in 2006 when this was published. I think that the interplay between reasoning and testing is much more pronounced when you're doing TDD that you are gaining a better understanding of the system and kind of formalizing that understanding into a runnable test. There's something there that's missing here. That aside, I think I agree with everything, I would tend to want a simpler system than a tested but complicated system. All right, causes of complexity. So here, this is where they start chopping it up, they're really going to get down below complexity as this big umbrella term, they're trying to get under it and figure out what is exactly happening here. Complexity caused by state. When it comes to state, we are in broad agreement with Brooks sentiment when he says, "Okay, so I'm going to pause here." So they're talking about Fred Brooks who wrote an article, a paper called "No Silver Bullet" and what he talks about complexity and he uses the essential versus accidental complexity distinction which they use a lot in this paper. This paper is largely, you can see it as a response to this paper from 20 years before 1986 to 2006. So this is Fred Brooks in his paper. From the complexity comes the difficulty of enumerating much less understanding all the possible states of the program and from that comes the unreliability. We agree with this, this is back to the authors of this paper. We agree with this, but believe that it is the presence of many possible states which gives rise to the complexity in the first place. All right, let's break this down a little, it's starting to get to be too much. So Fred Brooks says, "From the complexity comes the difficulty of enumerating much less understanding all the possible states of the program." So the program can be in a whole bunch of states and the reason, according to Brooks, is because the software or the complexity of developing the software, there's a lot of complexity. So there's a lot of states and then it's hard to understand all those states. And that leads to unreliability. The fact that it's hard to understand the states means it's unreliable and the reason things are difficult, there are so many states is because it's complicated. Now the authors of this paper say that they believe it is the presence of many possible states which gives rise to the complexity in the first place. So it's actually the states that cause the complexity. They say that here, I think later they start calling state itself a complexity. So I really think that they got to get their story together. I know how these papers are written, often there's like a round of reviews and papers come back and then you have to kind of patch up your paper to make it more publishable. And sometimes these things slip in where you put in a sentence to try to explain something and they just have a very long paper and it's hard to keep all these things consistent. But they're saying that the states is what causes the complexity. So Brooks got it backwards. The complexity doesn't cause the states, the states causes the complexity. Okay, and then this is Brooks again. Computers have very large numbers of states. This makes conceiving, describing and testing them hard. Software systems have orders of magnitude more states than computers do. Okay, so they're saying that the computers are hard to conceive. So designed to build and then describe and then test them. All these things are hard because of the number of states. They agree with that. And then this other sentence, software systems have orders of magnitude more states than computers do. I do not understand how that could be true. If you have a piece of software running on a single machine there's no way that the number of states the computer is in is less than the software that's running on it. Like the software would maximum consume all the memory on the system including all the registers and stuff and you know, that's a maximum. And the machine is at the maximum complexity, the software is actually much simpler than the whole computer because the software is like special purpose. Hardware is general purpose, could do anything. It's a universal Turing machine. So I'm not sure why that's in there, I'm not sure why Fred Brooks said it and I'm not sure why they're quoting that and agree with it because it seems like I must be missing something that like he must mean software system in some other term that I don't understand. All right, but let's continue, that's minor point. The key problem is that a test on a system or component that is in one particular state tells you, this is the authors by the way, there's no more quotes from Fred Brooks. The key problem is that a test on a system or component that is in one particular state tells you nothing at all about the behavior of that system or component when it happens to be in another state. So you've got this state changing all over the place and you've got this component that you want to test that's dependent on the state, the current state of the system. So what they're saying is you test it in a certain state and then the system changes state, now when you test it again, you don't know if it's going to work the same way as it worked before, you can't rely on this component if there's so many possible states it can be in, a test can only target one state. One of the issues is the exponential rate at which the number of possible states grows. For every single bit of state that we add, we double the total number of possible states that makes sense. Every time you extend the amount of state in your system, one more variable, one more bit, so just a Boolean, one more, you're doubling the total number of possible states. Now this of course is considering an unrestricted state, right? So your language might not allow, you know, that if I set this Boolean, everything else can be unchanged, you know, there might be some kind of dependencies in there and so in that case it wouldn't actually double but they're talking about the worst case scenario and how most like imperative languages work, you just add another global variable and you've, you know, multiplied the number of possible states and then at minimum by two. Consider a system to be made up of procedures, some of which are stateful and others which aren't. If the stateless procedure makes use of any other procedure which is stateful, even indirectly, when all bets are off, our procedure becomes contaminated and we can only understand it in the context of state. All right, so what they're saying is you have these, these procedures and some of them are going to access global state, global mutable state and some of them aren't. But if you have one that doesn't access that mutable state, maybe it's not even global but it's, you know, persistent, like it follows it around. So you have this, this, this procedure that has state, if a procedure that doesn't directly have state calls it, then it itself is going to depend on the state indirectly. I don't, I disagree with calling that original procedure stateless but I agree with the point which is that this use of state spreads, it spreads. So if you've got state at the bottom of your call graph in the leaves, everything above it in the call graph is going to, you know, consider it a tree but it's not always a tree. But like everything above it in the call graph is going to, is going to be dependent on that state. So there's this cascading effect of using state anywhere in your program. And that's, that's a problem. And it's not talked about very much in like imperative programming. All right, complexity caused by control. Control is basically about the order in which things happen. So they're talking about how in some languages and most languages, you specify the order that things get calculated. So if you have a variable X and you say X equals one plus B and you have Y another variable after that, you say Y equals two plus A, you are specifying calculate X, then calculate Y. But if you look at the data dependencies, one plus B and A plus two, they're not, there's no relationship between them. You could execute those in any order. There's no reason why you have to do X first and then Y first, right? Now back to the paper, the difficulty is that when control is an implicit part of the language, then every single piece of program must be understood in context. The programmer must start with the assumption that the ordering specified is significant. And then by further inspection, determine that it is not. So you have this, these two lines, you have to assume that whoever wrote these lines did this on purpose, they put them in this order for a reason. But then you suspect maybe they didn't because you don't want to change the order if they were there for a reason. So now you have to see yourself, does the order matter? Like maybe I want to parallelize this. I want to run one and one thread and one in another. Now I have to use my own reasoning, read them and figure out, ah, these can be parallelized or no they can't. It's an extra cognitive burden that we're given to the user and so it makes it harder to understand and so that's why they're calling it complexity. Complexity caused by code volume. We believe that with the effective management of the two major complexity causes, state and control, it becomes far less clear that complexity increases with code volume in a non-linear way. So one major measure of complexity is code volume. In a given code base, as you add more code, people understand the software development goes down, bugs go up, the speed goes down, I mean, of the software development. The code goes, the number of bugs goes up and so they're saying that if you manage these other things, state and control, then you wouldn't see that. You wouldn't see that, at least it would be linear. So you would have a linear increase in the number of bugs but not some kind of non-linear exponential increase. That's really interesting. That's a bold statement. Of course they're not, they can't substantiate that with any empirical evidence but they're basically saying that lines of code is a proxy for adding more variables and adding more control flow statements that are unnecessary. If you could somehow manage those separately or in a better way, then as you added code, it would not make the complexity go up. Meaning it wouldn't be harder and harder to understand as the system got bigger. Okay, other causes of complexity. So here they describe some other causes but then they're really doing that just for these three principles, so I'll just read the three principles. Complexity breeds complexity. They're talking about stuff like duplication. So with duplication, if it is not clear what exists, if it is not clear that what exists does exactly what is required, there will be a strong tendency to duplicate. So you're looking at some code, it's hard to understand because it's complex and you're not sure that it does what you need or maybe you can't find something that you does what you need because it's too complicated, then you're going to write it yourself. You're going to write your own code to do it and maybe it already exists. Okay, so the other principle we've talked about, simplicity is hard, it takes effort and you don't always have the energy and time to do it and then power corrupts. The more powerful a language, the harder it is to understand systems constructed in it. Good example of this is something like machine code. Machine code is exactly as powerful as the machine can do, right? You can do anything that the machine can do in that language. But then you restrict yourself by having a higher order language, let's say you get rid of go-to's, you get rid of the ability to modify the code, the code that is currently running, you get rid of those things and it's actually easier to understand your software. So, you're reducing power, you're restricting power, but you're making an easier system. They are also saying that if you can do something, this is kind of one of these software development problems, that if your language does allow something and you have a big enough team and the team is developing the software for long enough, anything that it can do, it will do at some point, right? They will modify the pointer, right? They will do some pointer arithmetic if you let them. It will happen, so just restrict it, restrict it, restrict it. That's what they're saying. Classical approaches to managing complexity, all right? They're going to go over three programming paradigms and talk about them. At the time, 2006, object-oriented programming was very dominant, so they spent a lot of time talking about why it's not good enough. From now, 14 years later, it seems like, yeah, we get it, we understand why do you spend so much time talking about how mutable state is not good? Well, this is the paper that made it so clear to the majority of the industry, so we have to give it some credit. This section was so influential that now we think of it as common place, common knowledge, okay, and it's not common enough, all right. An object is seen as consisting of some state together with a set of procedures for accessing and manipulating that state. Common understanding, you have an object, there's state, and there's behaviors inside that object. If several of the access procedures, access or manipulate the same bit of state, then there may be several places where a given constraint must be enforced, so they're talking about a duplication, duplication of code that this method and this method both access the same state, they have to both have the same code constraining or enforcing that constraint. That's the source of complexity. Strongly biased, okay, OOP is strongly biased towards single object constraints, and it is awkward to enforce more complicated constraints involving multiple objects. So, you have this object is encapsulating the state, it's really clear that you have this API for it, and so you can maintain a relationship between this bits of state inside there, but then when you have two objects and you're trying to keep them in sync, you no longer have any ability in this model, right, that you can do within an object you can keep the constraints, but then once you're outside you have two or more objects you can't. Identity and state, each object is seen as being a uniquely identifiable entity regardless of its attributes, so they're identifying this problem of identity. If you have two objects, the default is that they are equal if their references are equal, right, so what that means is two objects that represent the same entity in the real world are not equal, and you can't, there's no language level construct for determining how to compare two objects. You have to go to your domain and say, oh, well if they have the same identification number or the same user ID, something like that, then they're the same user, but otherwise they're different. I mean, the reason that's difficult is because if you have two copies, they're not copies, but you have two objects that are representing the same object in the real world, and they're being mutated, it's really hard to know which one is correct and which one is like, there's no way to compare them, it becomes complexity, you have to add on to be able to compare them, to be able to understand what you've got. So they say this here, it is common to start using custom access procedures to determine whether two objects are equivalent in some domain specific sense. So you have some custom equals method that just compares the IDs, for instance, okay state. So there's object orientation and state, OOP suffers directly from the problems associated with state described above directly, it's just got mutable state, we've already talked about that it's got this exponential number of possible states, you double number of states every time you add one bit, control, this is an interesting paragraph, actor style languages use the message passing model of concurrency, this can lead to easier informal reasoning in some cases, but the use of actor style languages is not widespread. Okay, so now it is much more widespread to have actor style languages, right? So Erlang, since 2006, has become more prominent as has elixir, and even Aka on the JVM. It's hard to know for sure, but I think this paper is one of the reasons why it has become more mainstream to use actor style languages. It's unfortunate that they don't go into this more, but if they had, they would be a deeper analysis of how good this informal reasoning actually is. But anyway, it's interesting to see the effects of this and it's place and time. Before we go on and leave object orientation and programming, I do have to say something about small talk and complexity, small talk being the original language. While I agree with everything they say in this section that I've read, empirically small talk was such a small yet powerful system. I can't say it's not complex, like it does have a lot of state, but there seems to be something about small talk that allowed for such a small piece of code, small amount of code to generate an entire operating system, including the GUI and the graphics system. I think that there's a lot to study there that somehow we need a response to this idea that object oriented systems are necessarily complex. How is it that the small talk system avoided it? Or is it complex? That's a big question there. I don't know small talk well enough to really say that, but I think it's something that we ignore. It's not known that much. If you look at the small talk papers, they talk about such small systems, so few lines of code and the window class is like 20 lines of code or something ridiculous like that. How did they do it? How did they manage to do that? When you look at, I'm using a Mac right now, if you look at the code to run the windows in this system, of course, they're probably more capable. They handle more cases and more GUI elements and stuff, but it's not 20 lines of code. There's got to be something in there that is not being surfaced. It's in small talk experts' brains somewhere, but they don't know how to say it, that there's some principle that they're applying that we're not all privy to, that maybe they're saying it and we can't hear it. They're not saying it in a way that we can hear. If you're a small talker, reach out. Let's figure out why. I'm going to have to go a little bit faster now. Functional programming. They basically say functional programming in state because they're talking about pure functional programming. State, because of the guarantees of referential transparency, it obliterates one of the two crucial weaknesses of testing. The two weaknesses of testing, the number of inputs is large. The arguments to your function are large. A test can only test one possible input set, but then there's also the state. Your function, if it relies on the state of the system, that's even larger. The referential transparency at least gets rid of that state one. The state of the system is eliminated because it's pure. All the things you have to worry about are the arguments. Most functional languages specify implicit sequencing and hence they face many of the same issues mentioned above, so they still have this implicit sequencing that's hard to understand. However, they encourage a more abstract use of control using functional such as fold and map, rather than explicit looping, so they're saying that's a benefit. There's a higher order thing, because fold and map can be done in parallel or what have you. It's implicit in that model. There are also concurrent versions of many functional languages, and the fact that state is generally avoided can give benefits in this area. For example, in a pure functional language, it will always be safe to evaluate all arguments to a function in parallel. I have to mention Haskell. Haskell has a lazy evaluation semantics, and that means that you don't really know the order that things are going to be evaluated in. They didn't have that in there, so I'm going to add it in. They're saying that's a good thing. You don't want to have to specify the order, control. The fact is that we are using functional values to simulate state. He describes a system where you have some counter, and you need to keep the current value of the counter, and then you need to, when you increment it, you need to return the new value, and so you can't, in a pure functional system, you can't keep that state, so you actually have to return two values, and then somehow pass that, one of the values, back into it as an argument. This is simulating state. There's in principle nothing to stop functional programs from passing a single extra parameter into and out of every single function in the entire system, so you can add an extra parameter and an extra return value, you know, a tuple, you're returning two values, to every single function. There's nothing stopping you, and remember, the power corrupts problems. Someone's going to do it somewhere. If this extra parameter were a collection, then it could be used to simulate an arbitrarily large set of mutable variables. In effect, this approach recreates a single pool of global variables. Ease of reasoning is lost. So you can just reintroduce global variables in a functional system. State and modularity, okay, in a functional program, you can always tell exactly what will control the outcome of a procedure by looking at the arguments supplied where it is invoked, the pure functional program. In a stateful program, this property is completely destroyed. You can never tell what will control the outcome and potentially have to look at every single piece of code in the entire system to determine this information. Whoa, okay, this is a very strong statement. Look at all these words here, completely destroyed. You can never tell what will control the outcome. You potentially have to look at every single piece of thing in the entire system, okay. It's one of these statements that I think this kind of statement is what gives people a bad taste in their mouth when people talk about functional programming. The problem is that they're talking about the worst case. They're talking about a failed software where you don't know where the bugs are coming from. You can't find them, there's bugs at every turn. Most software that people write nowadays is not like that. There are bugs and a lot of them they don't know where they're coming from, but it's not people's impression of, "Oh, my system, I don't have to look at every single line of code to know what state the system is in." It's statements like this that they're talking about the worst possible case. If you've got some state that is modified all over, then, yes, the worst case is you have to just look through the code one line at a time and figure out everywhere that modifies it, but usually state is not modified all over. The one thing that saves a sentence is the potentially. It is potentially, you know, you potentially have to look at every single piece of code. I think it sullies it a bit, you know, it's, of course, true. You potentially do have to, but in practice, we rarely do. The fact remains that such arguments have been insufficient to result in widespread adoption of functional programming. I hope it's clear that I'm skipping a lot that things aren't necessarily related when I start a new quote. So we don't have widespread adoption of functional programming. I think since 2006 there's been a lot more adoption of functional programming, so maybe this paper is helping. We must therefore conclude that the main weakness of functional programming is the flip side of its main strength, namely that problems arise when the system to be built must maintain state of some kind. One potential approach is the elegant system of monads used by Haskell. This does basically allow you to avoid the problem described above, but it can very easily be abused to create a stateful side affecting sub-language and hence reintroduce all the problems we are seeking to avoid, albeit, oh geez, inside this is where the grammar police would come in. Okay, so, but it is marked by its type. That's a saving grace, all right, so let me explain this. So he's saying you could use monads, which basically lets you maintain state, there's a thing called the state monad, but you can thread state in without having to add an extra argument to every little piece of code. It's implicit, the argument, but then you're just creating a side affecting sub-language. You just have this thing that's in the I/O type that can basically be an imperative program. That has been my experience in Haskell is that once you start using the monads, certainly the I/O, once you start using I/O, you're now in imperative land again and you're no longer doing functional programming, and the one saving grace is that at least the type system will tell you you are in a side affecting imperative part of the code. And if you're not in that type, you can't do that stuff, so that's nice. Okay, logic programming, so they did OOP, they did FP, now they're doing logic programming. Pure, I'm not going to go talk about this much, but it's important to understand what they mean by it. Pure logic programming is the approach of doing nothing more than making statements about the problem and desired solutions. It offers the tantalizing promise of the ability to escape from the complexity problems caused by control. So a pure logic programming language, you don't specify how to solve the problem. So you're no control at all. It's very declarative, and so this is hinting at the solution that they're going to talk about later. Accidents and essence, alright, this is an important idea in the paper, so we have to talk about it. Essential complexity is inherent in the essence of the problem as seen by users. So they're dividing things into essential complexity and accidental complexity, and they're making a much more objective line than Brooks did in his paper, where he talks about the same stuff, essential complexity and accidental complexity. But they're defining complexity in a more rigorous way, and they're also making the line between essential and accidental much clearer. Essential complexity is the essence of the problem as seen by the users. Accidental complexity is all the rest, complexity with which the development team would not have to deal in the ideal world. So the ideal world is where you don't have to worry about performance, you don't have to worry about network latency or thread pools or anything like that. It's all gone. One implication of this definition is that if the user doesn't even know what something is, then it cannot possibly be essential. So if you've got some programming term, like a thread pool, and the user who is, you know, let's say a graphic designer using Photoshop, they don't know what a thread pool is, and you're writing software for them, it's accidental complexity, because they don't know what it is. Okay? So what they're saying, what they will eventually say, is that as much as possible, we want to get rid of that accidental complexity. We don't want to, as the programmers, we don't want that to mix in with the essential stuff. They're worried about colors and shapes and, you know, stuff that designers worry about. And we're talking about thread pools, and we're introducing all this complexity to the problem. All right, now they're getting into their recommendations, recommended general approach. So they're going to start with a high level, kind of build out a system without the details, and then they're going to go into the details in the next section. So I'll quote, "We're going to need to derive formal requirements from informal ones. Talk to the users, they give you the informal requirements. Their users, they don't, that's not their job is to, you know, be perfectly precise. So they're going to give you informal requirements, and we're going to turn them into formal requirements. Formalization must be done with no view to execution whatsoever. So it's not coding, we're not programming, we are just describing it in an unambiguous way. So we have some language that we describe it in. Given that we are considering the ideal world, it is not unreasonable to assume that the next step is simply to execute these formal requirements directly on our underlying general purpose infrastructure." Okay, so I didn't mention this, but I should. They're talking about in an ideal world, right? This is the general approach, they're saying we need to think in the ideal world, we don't have any accidental complexity, we just have this essential complexity. And so we're just going to not deal with any accidental complexity, just totally eliminate it, and just write out a specification in the ideal world. Within an ideal world, you could imagine the language that we wrote this back in, the computer would understand it and would just be able to run it. This state of affairs is absolute simplicity. It does not seem conceivable that we can do any better than this, even in an ideal world. So this is the point of doing this kind of ideal world scenario. This is your limit, this is the maximum that you can reduce the system's complexity. Okay, you're not writing anything about thread pools or, you know, function calls or anything like that. You're just specifying exactly what the user wants in this concise way that the computer can understand. The very essence of declarative programming. Okay, state in the ideal world. Our main aim for state in the ideal world is to get rid of it. That's the main aim. No, they can't get rid of all of it. All data will either be provided directly to the system, that's the input, or derived. Derived data is either immutable or immutable. Now see, they're kind of just chopping stuff up and saying, "Well, let's deal with the mutable stuff this way, let's deal with the immutable stuff this way." Okay, they're kind of subdividing the world and conquering little bits at a time. All data mentioned in the user's informal requirements is of concern to the users and is as such essential. The fact that such data is essential does not mean that it will correspond to essential state. Okay, so they're making this distinction between data and state. The state is what has to be stored. The data is just some piece of data that may or may not have to be stored. Maybe possible to avoid storing some such data. Okay, now they're getting deeper. Input data. If there is a possibility that the system may be required to refer to the data in the future, the system must retain the data and as such it corresponds to essential state. So if the informal requirements say you need to be able to say this in the future, then you need to store it. It's essential because it's in the informal requirements. Otherwise, so we don't have to refer to it in the future, otherwise the data need not be maintained at all. You just throw it away. You don't need it in the future. It's input, but you don't need it. So that's something like, you know, click here to continue. You get a mouse, click to input, but you don't need to save it. Essential derived data, immutable type. Okay, so immutable essential derived data. Data of this kind can always be re-derived whenever required, which means you don't need to store it. You can just re-derive it. There's some formula for deriving it from some other data. Now essential derived data, mutable. So this means it's derived, but it could change in a later time. If it can change because of the user's input, then it's really input, so it's essential state, but otherwise that is accidental state. Get rid of it. Accidental derived data. So he gives an example of this cache where, for performance reasons, you are caching this long running function, and that's accidental because it's derived, it comes from the function that's running on other data, but you're storing it so it's state, but it's accidental because the user doesn't really care about whether it's cached or not. That's not in their vocabulary cache. Okay, control in the ideal world. The results of the system should be independent of the actual control mechanism, which is finally used. These are precisely the lessons which logic programming teaches us. So basically any control is accidental complexity. That's what they're saying. I think I would disagree with that. I should bring this up. I think that when they're talking about performance and control, there are oftentimes requirements that deal with that. That say, this has to be so fast, or it has to be done in this particular order, and it's part of the informal requirements, and so it should be essential. I don't know how they missed that, but it's been my experience with writing software that sometimes speed is a part of the requirements. They're acting like, oh, speed is just this thing as an optimization, but it's often not. It's often like one of the key requirements. Okay, theoretical and practical limitations. What we're saying would be ideal are executable specifications. They talk about some problems with specifications in the past. The two problems are performance and ease of expression. The way that most languages have dealt with that is by adding in, so performance, for instance, you add in control or state. For instance, prologue lets you reorder the logic expressions and put cuts and things, and that lets you control how things get executed. The problem is that now you're introducing this accidental stuff into it, and they would like to keep that accidental stuff. They recognize it's important, you do need to deal with performance, but you should keep it separate. And then ease of expression, sometimes things are easiest to express with state. I mean, that's just often the case, like, oh, I store this thing and I'm going to change it later and this part of this algorithm, that's often the case. And so then when you have to do it without state, it's this really convoluted thing and it's another problem, and they're going to deal with it, again, by separating that out. Okay. So recommendations. They have this other principle where they're going to avoid the complexity, or they can avoid it, they're going to separate it out. So separate the different types of complexities out and don't let them combine, because then the complexity explodes even more. The overriding aim must be to avoid state and control where they are not absolutely and truly essential. That's the main thing, if you don't need it, get rid of it. But then there are unavoidable parts. Unavoidable accidental complexity must be separated out from the rest of the system. So required accidental complexity, we should restrict ourselves to simply declaring what accidental state should be used and leave it to a completely separate infrastructure to maintain. They just want this thing to be done for them. They specify it and then it just, we just declare it and it's just done for us, sort of like a database, you know, you just declare, I want this table and it's just done for you. So, separating out all complexity of any kind from the pure logic of the system, this could be referred to as the logic state split. So they're splitting the logic from the state. Further dividing the complexity, which we do to retain into accidental, sorry, grammar again. So they're dividing the complexity into accidental and essential. This could be referred to as the accidental essential split. So imagine there's one vertical line to separate the logic in the state, that makes two rectangles. And then to the left of that, they're going to split this other thing that they're adding on to the left. That's all the essential stuff. And then on the left, they're adding this accidental stuff, okay, and that's their diagram. We do not consider there to be any essential control, I've addressed that already. I think there might be some essential control in informal requirements. The system should still function completely correctly if the accidental but useful bits are removed. So you get rid of the accidental stuff and it should be the same, which gives you a nice way to test the system, right? At least test that your accidental stuff is correct, so you run both, you run the essential alone and then the essential plus the accidental and you see if they perform the same. A consequence of separation is that the separately specified components will each be of a very different nature. And as a result, it may be ideal to use different languages for each. These languages would each be restricted to their specific goal. The weaker the language, the more simple it is to reason about. This is a very cool idea that once we've separated these things out, we can have this very restricted language for each part and a different language. We're so used to writing like one language for everything when there might be something to the idea like in a database where the query language and the language we're creating tables and stuff is different from the language we use in our everyday coding. It's more restricted and you can only deal with the data. It's an interesting idea because it lets you have a much less powerful language and the power comes from interacting between them. It will be necessary to specify what the essential state can be, what must always be logically true, and finally what accidental use can be made of state and control. Again, they're restating this three box system that they've built. You got the essential state, what is logically true about that state, that's your logic, and then what accidental use can be made of state and control. What cache is, what threading model you're using, all that's in the accidental stuff. Here's the three boxes. Essential state, it can make no reference to either of the other parts. Like how your database doesn't know about your code, I think there's a lot to this that we do all the time now. It's also known as business logic, it wouldn't make sense for the logic to be able to change the state in any way, it will make no reference to any part of the accidental specification. The essential logic is saying what must be true about the state at all times, so when the state changes the logic has to check it and constrain it, and then there's the accidental state and control, which we've talked about. So they go into the relational model, I'll just read a few pieces here. It is an elegant approach to structuring data, a means for manipulating such data, and a mechanism for maintaining integrity and consistency of state. So notice, they're breaking it up along the same lines they had before. Structure, there's different, there's three parts, there's the structure, it's the use of relations as the means for representing all data, there's the manipulation, a means to specify derived data, so how do you specify data that you need from the data you have, and then integrity, a means to specify certain inviolable restrictions on the data. Okay, so this is maintaining this constraints on it. And then, oh, there's a fourth thing, data independence. A clear separation is enforced between the logical data and its physical representation. This is very much like in a SQL database where there's different database engines that lay out the stuff in the data in different ways, there's b-trees, there's journals, there's all sorts of different ways of storing it and making it efficient for accessing it, but you just write a create table statement. And you don't have to worry about that when you're describing the table. All right, now they're going to describe functional relational programming, which I think is their thing. The approach of functional relational programming, FRP, derives its name from the fact that the essential components of the system are based upon functional programming and the relational model. FRP is currently a purely hypothetical approach to system architecture that is not in any way been proven in practice. Very good statement to make in this paper, very difficult, but once you make it, you feel better about yourself. This is not been tested in practice. In FRP, all essential state takes the form of relations, and the essential logic is expressed using relational algebra with pure user defined functions. Relational algebra to define these relations, and then the logic is, sorry, relational relations describes like as a type, and then essential logic is expressed using relational algebra to derive new data and to restrictions, and sometimes you're going to need user defined functions. That's stuff like you're going to need to sum and count things, and relational algebra does not have that stuff. Okay, now they're going into the architecture, I'm really close to the end now. They're basically now getting more detailed, it's the same structure that we've already talked about, but they're trying to get closer and closer to actually being able to implement. So, now there's four boxes, there's the three boxes they had before, and then one more. So the essential state, a relational definition of the stateful components of the system. Before they just said, that's where your state goes. Now they're saying it's relational, okay, relational definition of the stateful components of the system, essential logic, derived relation definitions, integrity constraints, and pure functions, and accidental state and control, a declarative specification of a set of performance optimizations for the system. Okay, now this is the other box, this is the fourth one, a specification of the required interfaces to the outside world. These may well be defined at least partially in a traditional imperative way if custom interfacing is required. This is your input and output. This is where you're saying this is how the data comes into the system, these are API calls I need to make. Okay, so they're kind of just relegating all the hard parts to this other box. One of the reasons why the approach kind of doesn't work, they're basically saying here's all the easy parts, the state and the logic, we're just going to do those in a very easy way, and then all the hard parts are just going to use an imperative language to do. Okay, so benefits, benefits for data abstraction. The end gets very repetitive, so I'm going to be skipping a ton here, and I'm not going to go into the code where they actually try to explain a lot of the stuff that they did. Unneeded data abstraction actually represents another common cause of complexity. Why are you bringing this up in like the last chapter of your paper? It's really common, it should have been first, in the first thing where you're talking about complexity, anyway. So what they're basically saying that there's this other thing where people would like basically do data modeling wrong, and there's two problems, subjectivity, groupings which makes sense for one purpose will inevitably differ from those most natural for other uses, yet the presence of preexisting data abstractions all too easily leads to inappropriate reuse. So when you're programming and you have some task, you're going to make a data type for that task, and it's going to be perfect for that task, and then you have another task, and it's not quite perfect, but it's already made, and so you're going to try to reuse it, maybe you're going to make it better, augment it so that you can do this new task with it, but it just makes the data representation more complicated, and they're saying that the relational model, because it doesn't have any, because it is specified in this declarative relational, pure way that you can always add the optimizations and the derived data on later, but it's not complicated by all that other stuff, all the potential uses you're going to make of that data. All that stuff is subjective, how are you going to use it? Okay, and then data hiding, the other problem with data abstraction. Data abstractions will often cause unneeded, irrelevant data to be supplied to a function, and the data which does get used is hidden at the function call site. So you've got this big type, big data structure, and you need one little piece of it, but you have to pass in the whole thing, because that's where it's at, it's in there, and you're making these subjective calls of like, oh, maybe I'll need it later, whatever, and so what happens is you pass in this big thing, and now it's unclear what is this, what does this function actually need? It actually needs these two pieces of data that are kind of built, buried in this structure, but we're passing in the whole structure, so it makes things more complicated. It's hiding the data inside, the data you need inside, the data you don't need. Again, the relational model with the derived data, you can specify exactly what data you need, it's very easy to do, so you're not doing unnecessary data abstraction. Other benefits, this is them talking now, okay, one team could focus on the accidental aspects of the system, one on the essential aspects, one on the interfacing, and another on providing the infrastructure. I think this is pretty naive. The idea that a feature would be distributed among these four teams, one for each box that they had, it just seems like you wouldn't want that. You would want, at the very minimum, maybe you'd have an expert in each of the boxes, and the four of them come together to have a cross-functional team to work on a feature. Still, I think it's problematic. The idea that just looking at the direction of the industry, we're going from a system where you have your ops people and your programmers, and the programmers pass their codes to the ops people, and then it's the ops people's job to just run it. We're moving away from that toward the programmer being able to run everything themselves because you want it to be as much like the production environment as possible, same with the DBA. The DBA is going away, programmers create their own databases, so really, I don't see this as happening. It's an interesting thought experiment, though. They mention types. There's a section called types. I think they don't mean types. They mean more like data structures, but I'll permit it. FRP permits the creation of disjoint union types, but does not permit the creation of new product types. Like before with the data abstraction, they were talking about how data abstraction, unnecessary data abstraction causes complexity, that these new product types are grouping data together, and it's kind of just due to laziness, difficult to ungroup them when you need them ungrouped, when you need pieces, and so they're just going to totally disallow new product types. That's also mitigated by the fact that they have these relations, which are tuples, they're sets of tuples, and tuples are product types. They have a way to create groups of data that are related to each other. That's all I wanted to say about the paper. A super influential paper, like I said, I was being kind of harsh with it because I've thought a lot about these ideas and they've been influential in me, so I've evolved past these ideas. But rereading it, I was struck by how this is the source of a lot of these ideas, especially stuff like the flow model in a React application where you have your data, and you're deriving data from that, and eventually that becomes projected onto the DOM by the view. That idea is straight out of this paper. This idea of separating the state from the logic, and then the view, and then the idea that then you have your input coming back into the state, it's kind of a compromise of what they have because I don't think they, in this paper, really deals with interactive systems very well, so they had to patch it, but the main idea is there. I think that there's a lot to this paper. You should read it. I want to do two things, I don't know if I'll actually do them, but I want to. One is talk about Fred Brooks' paper because I think they don't do it justice. When I first read this, I was like, "Oh yeah, Fred Brooks, he was totally wrong," but I've since reread Fred Brooks, the no silver bullet, and he's, he doesn't, he's not saying what they're saying, he said. So I think he has a good point, Fred Brooks does. And then finally, I want to give my, what I would consider more analytical, I feel like they gave short shrift to the definition of complexity, and I want to attempt a response to that, to really nail down what complexity is. Okay, thank you for listening. I like to read these papers and comment on them because I think that papers aren't, especially older papers aren't talked about enough in our industry. We don't know our history, where we came from, and people don't seem to read them, but maybe they'll listen on a podcast. So here I am. If you want to hear this stuff, go ahead and subscribe. You can go to listcast.com/podcast, and you'll find links to subscribe there. Also if you have any questions and your requests, I'd love to hear it. Well, my name is Eric Normand. This has been my thought on functional programming. Thanks for listening. Take care.