On the criteria to be used in decomposing systems into modules

In this episode, I read from David Parnas's important paper on modularity, On the criteria to be used in decomposing systems into modules.

Transcript

Eric Normand: It is almost always incorrect to begin the decomposition of a system into modules on the basis of a flow chart. We propose instead that one begins with a list of difficult design decisions or design decisions which are likely to change.

Each module is then designed to hide such a decision from the others. Since in most cases design decisions transcend time of execution, modules will not correspond to steps in the processing.

Hello, my name is Eric Normand. Welcome to my podcast. Today I am going to be reading and commenting on a memo at Carnegie Mellon University from 1971 on the "Criteria to be used in decomposing systems into modules" by David Parnas.

This is a rather important paper or memo. It was later published, polished up, and published in an ACM publication at the time, but this is the original, and so I thought I would read that.

It's one of these typewriter papers from 1971. The reason I like this paper is because people talk about it all the time. They say, "Oh, back in 1971, David Parnas told everybody what modularity was all about and how we should do it." People are still citing this paper, even like last week, as the definitive reason or method for modularity, and apparently, we're still not doing it.

It's still not widely known or understood, so I thought I would reread this paper. I read it years ago, and now that I'm doing this podcast, here I go. A couple notes on this. It's not the best-written paper like he's not my favorite author, but that means you have to read it a few times to really understand what's going on.

One example of what I'm talking about is, I had to go all the way to the conclusion in order to summarize, what do I read at the beginning. I like to summarize the point of the paper.

I feel like a better paper would have a statement at the beginning of what you're supposed to take from it. This one uses a more mysterious approach like, "Oh, we're going to talk about how to decompose into modules and not tell you how." [laughs] They're going to tell you that they're going to tell you.

You don't get to the actual point of it until much further. That said, the ideas are still good. We're going to go through it. Here we go. This is from the abstract. Again, it's called, "On the Criteria to be used in Decomposing Systems into Modules."

Abstract, this paper discusses modularization as a mechanism for improving the flexibility and comprehensibility of a system while allowing the shortening of its development time. The effectiveness of a modularization is dependent upon the criteria used in dividing the system into modules.

The first thing that we see, the first sentence of this is talking about three benefits of modularization. Improving the flexibility, improving the comprehensibility, and then shortening the development time.

Flexibility probably means ability to change over time, comprehensibility is for programmers to understand what's going on. Then shortening development time, obviously. He's hinting here that there's different ways to break it up into modules, and they're not all the same. It's not just about modularity, you have to break your modules up properly or you're not going to get those benefits.

From the introduction, if programmers sang hymns some of the most popular would be hymns in praise of modular programming. However, nothing is said about the criteria to use in dividing the system into modules.

This paper will discuss that issue and suggest the type of criteria that should be used in decomposing the system into modules. It's still the introduction. We have to give some leeway. You see already at the end of the second page, we still don't have any real information.

I get the sense if I can read between the lines, I get the sense that he's a little hesitant to bring it up at first, that he thinks this might be controversial. Maybe that he wants to ease into it and discuss modularity first. Maybe that's why we're still reading it, and we haven't internalized it.

I should also say, these kinds of criteria are an important aspect of how we model systems, how we build our systems, that I would like to capture in something much more concrete and actionable.

If I'm critiquing the presentation, we do have to remember two things. It is from 1971. A lot has happened, we have had the object-oriented revolution where each object is potentially a different module. They're very modular. We have a totally different understanding of how modules can exist in software.

Also, I'm critiquing it from the point of view of like, "Well, how would I write it in my book?" I disagree, I think that you should start off with a much stronger statement.

Let's continue. He starts with a brief status report. The major progress in the area of modular programming has been the development of coding techniques and assemblers which allow one module to be written with little knowledge of the code used in another module and allow modules to be reassembled and replaced without reassembly of the whole system.

This facility is extremely valuable for the production of large pieces of code. Its use has not resulted in the expected benefits. In fact, the systems most often used as examples of the problems involved in producing large systems are themselves highly modularized programs which make use of the sophisticated coding and assembly techniques mentioned above.

He's talking about how there's all these advances in assemblers. I'm imagining what he's talking about is linking, linkers that let you compile different pieces of your program separately. They have symbols embedded in them in a standard way, so that a linker can come in later and say, "Oh, this function, I'll get it from this module. This function comes from that module."

You can put them together in different ways. We take those for granted today and often we don't really replace our modules. You're not going to say, "I'm going to completely swap out this one C-file for another C-file." Maybe you can modify that C-file and not have to recompile everything else, because you haven't changed the function signatures and stuff.

This last statement here, in fact, the system's most often used as examples of the problems involved in producing large systems are themselves highly modularized programs. He's saying that even the examples of problems are already modularized. They're all broken into little pieces that in theory could be swapped out. They still are examples of all the problems.

Next section, expected benefits of modular programs. Before I go on, he's saying it's still a problem. Modularization has not solved the problem. Having all these tools to allow modularization has not solved the problem. We still have the problem.

Expected benefits of modular programming. The expected benefits of modular programming fall into three classes. One, managerial. Development time could be shortened because separate groups would work on each module with little need for communication." That's something we talked about today.

For instance, in microservices, you work on your service, I'll work on my service. As long as the API is well understood and isn't changing, we don't have to talk. I just call your API and you call mine, and we're all good.

"Two, product flexibility. It was hoped that it would be possible to make quite drastic changes or improvements in one module without changing others." Drastic changes. Not only just change the code in one, but also maybe swap it out for something completely different. It's also one of the benefits we hear about in microservices today — that you can update the services independently.

Sometimes that's true. From my experience, it's not always true. There are many systems where even though they're in modularized microservices, changes still require making changes to different services.

"Three, comprehensibility. It was hoped that the system could be studied a module at a time with the result that the whole system could be better designed because it was better understood." Again, same kinds of things we hear about — that you could understand the one service at a time, one module at a time, and have an abstraction barrier.

You'd have some line, some limit after which you wouldn't need to understand how the system worked. You could just look at the API. This stuff hasn't changed and I think that that's telling. We still have these problems. We're still making each other these promises. Are we really seeing it? That's an open question.

The next section. "What is a modularization?" Trying to define a term. "In this context, module is best considered to be a work assignment unit rather than a subprogram. The modularizations are intended to describe the design decisions, which must be made before the work on independent modules can begin."

That's an interesting way to put it. He's not saying, for instance, a module is a class or a module is a micro service. It is a work assignment. It's some use case or high-level description that you could give to a team and they could work on. He says that you need to have all the design decisions. The modularizations are about where to cut it up.

All those design decisions about where to cut it up, that have to be made before you can start working on them. That's like all the API interfaces. How are these going to talk to each other? All that stuff has to be worked out. That is what he's calling the modularization.

Now he's going to go over a couple of examples. I'll do the first one. He's going to take the same example and break it in two different ways. He's going to do two different modularizations. The first system is called a quick index production system. I never heard of quick index, but I'll read a short description of it.

"The quick index system accepts an ordered set of lines. Each line is an ordered set of words, and each word is an ordered set of characters. Any line may be circularly shifted by repeatedly removing the first word and adding it to the end of the line. The quick index system outputs a listing of all circular shifts of all lines in alphabetical order."

All right. It's some indexing system that lets you quickly lookup stuff. I'm not sure. He doesn't go deeper into why you would need this system, but he continues, "This is a small system. Except under extreme circumstances, such a system could be produced by a good programmer within a week or two.

"Consequently, it is a poor example in that none of the reasons motivating modular programming are important for this system." The main reasons are size, comprehensibility.

It's already small enough that you wouldn't really need to break it into modules, but he continues, "Because it is impractical to treat a large system thoroughly, we should go through the exercise of treating this problem as if it were a large project." The first modularization, and here's the hint, this is the bad way to do it.

"Modularization one." He breaks it into several modules. Module one is input. Module two is circular shift. Module three, alphabetizing. Module four, output. Module five, master control. "It should be clear that the above does not constitute a definitive document." He describes each of the modules in a paragraph, "Much more information would have to be supplied before work could start."

Remember the definition that he gave of modularization. It's all the work that has to be done before you can start work on an individual module. He's saying you need a lot more information. "Only when all of the interfaces between the four modules had been specified could work really begin. This is a modularization in the same sense meant by all proponents of modular programming.

"The system is divided into a number of relatively independent modules with well-defined interfaces. Each one is small enough and simple enough to be thoroughly understood and well programmed. Experiments on a small scale indicate that this is approximately the decomposition which would be proposed by most programmers for the task specified."

He gives a picture. The picture is basically, that you have your input module, it passes it to the shifting module, then you have a sorter that sorts it alphabetically, and then it passes it to the output. It's very much a straight step one, step two, step three, step four. Each step is one of the modules.

He talks about these small-scale experiments. I bet he asked some of his students to break this up. That's all, just an informal study to see what people would do.

I imagine I would break it up in this same way myself. Taking the description of the problem. It's a process. You got to input some lines, you got to generate all of the circular shifts, you got to sort them and then you got to send them to the printer. Maybe you have one module that orchestrates all of that. It's a straight flow from input to output is just like a transformation.

Now, here's the second way to modularize it. Module One, line storage. Module Two, input. Module Three, circular shifter. Module Four, alphabetizer. Module Five, output. Module Six, master control. There's one more module than in the other one. Again, you have to go back to his warning that this is already a very small system.

The only thing that's different from what it looks like is that there is a module called line storage. That's the only thing that's different. The others are all the same. Comparison of the two modularization, there's no comments on it. It's talking about the different, a paragraph on each module.

Comparison of the two modularization. Both schemes will work. Both will reduce the programming to the relatively independent programming of a number of small manageable programs. We must, however, look more deeply, in order to investigate the progress we have made towards the stated goals. Remember, stated goals of reduced time to program, improved comprehensibility, and then more flexibility.

I must emphasize the fact that in the two decompositions, I may not have changed any representations or methods, so it might be the same data structures. It might be the same functions that are running. It is my intention to talk about two different ways of cutting up what may be the same object. It's getting confusing here.

A system built according to decomposition one would conceivably be identical after assembly to one built according to decomposition two. The differences between the two systems are in the way that they're divided into modules. The definitions of those modules, the work assignments, the interfaces, etc.

All right, I'm going to pause right here. He's saying, remember, modularization is about the work assignments. What do we tell people what to work on and define the boundaries between one person's module and another person's module so that they can communicate? Which is an interesting way to look at it, right?

It's not saying, "Oh, your class has to have this kind of module boundary as opposed to that kind of module boundary." It's more about how to write the tickets so that people can work independently and have this flexibility where you can change one without changing another.

I'll continue. The algorithms used in both cases might be identical. I claim that the systems are substantially different, even if identical in the runnable representation. This is possible because the runnable representation is used only for running. Other representations are used for changing, documenting, understanding, etc.

In those other representations, the two systems will not be identical. He's making a distinction that I think we've come to terms within 2021, 50 years later, which is that the runtime of the system is very different from the code that you read. The code has many purposes besides being compiled, you also have the code to be read, to be maintained, modified, and even sometimes be documentation.

This might have been a point. He's spending quite a lot of time talking about it. It might have been back in 1971, something less well understood than we understand it today.

Now in our modern time, we're very used to talking about, for instance, refactoring, where you are essentially leaving the behavior of the system the same, but moving code around, changing names, introducing new module boundaries, that kind of thing. The goal in order to qualify as refactoring is you can't change the behavior.

It's purely in the code for understandability, maintainability, flexibility. We could have this modern AI where this seems like overkill in describing it. All right, now is going to talk about changeability.

There are a number of design decisions, which are questionable, and likely to change under many circumstances, a partial list. One, input format. Two, the decision to have all lines stored in core. This is core memory. Maybe you want to have some on tape or disk or something, not just core. Three, the decision to pack the characters fore to a word.

Character encoding, we might want to change that. Four, the decision to make an index for the circular shift rather than actually store them as such. Do you want to store it as circular shifts or do you need an index? You might be able to say, "Oh, this circular shift starting with this word."

Five, the decision to alphabetize the list once rather than search for each item when needed or partially alphabetized as is done in horse find?

It is by looking at changes such as these that we can see the differences between the two modularization. The first change is in both the compositions confined to one module. The first change, he's talking about his input format. The second change would result in changes in every module for the first decomposition.

The second change is this decision to have all lines stored in core. He's saying that if you wanted to change that, to go from in-memory to on-disk representation, or in the database representation, you would have to go and touch every module because you didn't separate out storage. Remember, in the second modularization, line storage was its own module.

This was a way of taking all those things that might change, all those decisions that you had to make, and moving them, isolating them into a module. The other things don't have to change. In the first decomposition, the format of the line storage in core must be used by all of the programs.

In the second the decomposition, the story is entirely different. Knowledge of the exact way that the lines are stored is entirely hidden from all but module one. Any change in the manner of storage can be confined to that module. I just said that.

He goes on to talk about how there's only really that one difference, that one module extra. That module contains a bunch of design decisions that would be smeared out in all the other modules. That makes all the difference.

Now, he's got a section called "The Criteria." He's finally getting to it on page 17. Remember, this paper is called on the Criteria to be Used in Decomposing Systems into Modules. We're finally there. We got to it.

We're going to find out what the criteria is. The criteria, in the first decomposition, the criterion used was make each major step in the processing a module. One might say that to get the first decomposition, one makes a flow chart. Nowadays, we would call this functional decomposition or procedural decomposition.

You look at the program as a procedure, and you break it into subprocedures, and then you break those subprocedures into more subprocedures, until finally, you have something small enough to code. It's a very common way of programming.

The second decomposition was made using information hiding as a criteria. The modules no longer corresponds to steps in the processing. The line storage module, for example, is used in almost every action by the system. Every module in the second decomposition is characterized by its knowledge of a design decision, which it hides from all others.

Its interface or definition was chosen to reveal as little as possible about its inner workings. This, nowadays, is called volatility decomposition. There's this great book called "Writing Software," which to my mind is the best description of the comparison of these two types of decomposition. The first, functional decomposition, you are saying, "What are the different functions that I have to do?" Let's say in my house, this is what writing software uses the example.

I cook food in the kitchen, and I sleep in the bedroom, and I use the restroom and the bathroom. You find all these different functions, and you break them into different rooms and different modules.

In writing software, he also talks about volatility decomposition, which is the same as the second one. In that case, what you're doing, is you're trying to find things that might change. That's what volatility means. He goes on to explain that...You might not notice this, but there is, for instance, a lot of volatility in each room.

The furniture can change very quickly. The appliances that are plugged in can change. You have modules boundaries between the appliances, and the electricity running through the house. That's the plug. It can unplug and plug easily. You can swap out different appliances and move appliances from room to room.

You can imagine if you said, "Well, the kitchen is for cooking." You hard-wired all the cooking appliances in there. That's not very modular, but that's how we code today when we do functional decomposition. We're not thinking about what could change quickly.

This idea of hiding the knowledge of the decisions is the correct criterion for decomposing a system. It's not about cooking and sleeping. Those are your functions. A room in a house could change function. I'm working. This is my office. I'm in my house, but this is technically a bedroom. It's not for sleeping right now.

All right, I'm not going to go over the second example. It's, I don't know, not as interesting. He's going to talk about hierarchical structure. You can read the example. We can find the program hierarchy in the sense illustrated by Dijkstra in the system defined according to decomposition two.

If a symbol table exists, it functions without any of the other modules. Hence, it is on level 1 at the bottom because it's not referring to any other module. Line storage is on level 1 if no symbol table is used, or on level 2 otherwise. Input and circular shifter require line storage for their functioning.

Output and alphabetizer will require a circular shifter. Since circular shifter and line holder are in some sense compatible, it would be easy to build a parameterized version of those routines, which could be used to alphabetize or print out either the original lines or the circular shifts.

In other words, our design has allowed us to have a single representation for programs which may run at either of two levels in the hierarchy. Not a great writing, a picture would have solved this pretty quickly. He's talking about how you can analyze the system. Each module dependencies and build a hierarchy.

The stuff that doesn't depend on anything goes at the bottom. Then, as you depend on stuff, you have to go up a level. It creates this hierarchy with the master control up at the top. In discussion of system structure, it's easy to confuse the benefits of a good decomposition with the benefits of a hierarchical structure.

We have a hierarchical structure if a certain relation may be defined between the modules or programs. That's relation is a partial ordering. The relation we are concerned with is uses or depends upon. It is conceivable that we could obtain the benefits that we have been discussing without such a partial ordering, E.g. if all the modules were on the same level.

OK, he's saying that modularization does not mean a hierarchical structure. These are two separate things that give different benefits. The partial ordering or the hierarchical structure gives us two additional benefits. First, parts of the system are benefitted by, or simplified, [laughs] ...He uses that parenthesis. It's hard to read. All right.

First parts of the system are simplified because they use the services of lower levels. Second, we are able to cut off the upper levels and still have a usable and useful product. For example, the symbol table can be used in other applications. The line holder could be the basis of a question-answering system.

If we had designed a system in which the low-level modules made some use of the high-level modules, we would not have the hierarchy. We would find it much harder to remove portions of the system, and level would not have much meaning in the system.

This reminds me a lot about stratified design. I'll refer you to that episode where we talk about "Lisp — A Language for Stratified Design," in which I go over the paper that defines that term the best. He's talking about how this hierarchy is different from the decomposition.

It allows you to because these modules at the bottom are not dependent on other stuff, they are reusable. He's talking about reuse. We talk a lot about reuse in the stratified design. If you didn't have that stratified design thing, let's say you had a generalized graph, where this module talks about one, and then they talk in cycles and stuff.

It would be much harder to like pull one out and replace it with another one. All right, I'm going to read the conclusion now, then we'll discuss a little more. It is almost always incorrect to begin the decomposition of a system into modules on the basis of a flowchart.

We propose instead that one begins with a list of difficult design decisions, or design decisions which are likely to change. Each module is then designed to hide such a decision from the others. Since in most cases, design decisions transcend time of execution, modules will not correspond to steps in the processing.

I feel like I skipped a section and I don't know what happened. I see. He's got a section on efficiency. I did skip it. I skipped quite a lot, because a lot of the pages had nothing but code, and I accidentally skipped this one. Sorry, I apologize. He's talking about efficiency and implementation.

If we're not careful, the second decomposition will prove to be much less efficient. If each of the functions is actually implemented as a procedure with an elaborate calling sequence, there will be a great deal of such calling due to the repeated switching between modules.

To save the procedure call overhead yet gain the advantages that we have seen above, we must implement these modules in an unusual way. In many cases, the routines will be best inserted into the code by an assembler. This is something that we talked about in another episode, specifically about lambda, the ultimate goal-to. Back at this time in early '70s function calls were expensive.

You would have these really long functions because the idea of breaking something into small functions at all called each other was considered wasteful. Function calls were expensive, it required a bunch of instructions. You have to set up the stack, save all the registers, do a whole bunch of stuff.

It wasn't until lambda, the ultimate go-to that the author's showed that it doesn't have to be so expensive. Most cases, it's a jump, and maybe setting up the arguments and then a jump. In other cases, it's still not that bad. You have to say the return value, as in return pointer and jump.

This is something that I guess we solved. The size of modules and how often they're called, it's not something we think about anymore. I want to bring that up because he thought it was an important issue that we would need some advanced assembler that would reassemble it to look it hadn't been divided into modules even though it had been.

I want to re-emphasize the main point of the paper because I think it can be lost, especially in the discussion of all these historical things. When we break something up into modules, we often think about the functional decomposition. This thing handles the user, this thing handles this kind of document, this thing handles log in.

We think of all the functions that our system needs — the features basically — and we make a module out of it. That's the wrong way to do it. We should think about the volatility, what might change over time. "Righting Software" — the book — gives a useful mnemonic or a useful mental tool question to ask yourself to figure out what the volatility is.

Because it is actually really hard to spot the volatility. The idea is you ask yourself, will users change over the time? The same user, will they want different things? You might say, "Well, the user might want to log in using an email address today but in next year they are going to want to have a Google sign-in," or GitHub sign-in or some single sign-on provider that they want to use.

That's the first question. Does the single user change over time? It's like in a house, does the single user, an occupant of the house, the owner of the house...I live in this house. Will I change the use of a room over time? How would I expect to change that?

I would change the furniture, I would move the furniture around. I might have a different function for that room. I might turn my guest room into an office or I might have another kid and so, I want a second bedroom for the kid.

There are all sorts of different ways that a single user would want to change things and so you have to modularize around that.

The second question to ask is, would different users have different purposes? Not the same user over time but different users. You might say well we have admin users who need to log in this way and we have, let's say, three-tier users who log in this way, and paid users this way, and then enterprise customers log in a third or fourth way.

That kind of login, those decisions on how they log in, have to be modularized around. Because you are going to need different modules for each of those and some way of encapsulating that difference.

Those are just some basic examples. I feel like he nailed it. He just needs better examples that maybe correspond better to modern software and Righting Software — the book — righting with R-I-G-H-T-I-N-G. It really does a great job of explaining this paper, what is the proper way to decompose stuff into modules.

OK, thank you very much. My name is Eric Normand and thank you for listening. As always, rock on.