Code Hibernation and Survivability

Eric Normand's Newsletter

Software design, functional programming, and software engineering practices

Over 5,000 subscribers

We've got a problem. It's called bit rot. We have some working code. We don't run it for a few months. And when we try to run it again, it doesn't work. How did that happen? The system it was running on changed enough (packages upgraded, libraries deleted, passwords changed, permissions, hardware, IP addresses, ........) that it just doesn't work. The only way we know how to keep software working is to pay someone to run it all the time. It's a bit like in the Dark Ages where you had to have monks copy books to keep the skill of reading and writing alive, and also to stave off the natural entropy (fires, rotting, losses, etc.) that would befall the books. There has to be a better way.

Here's George's abstract:

Can we store code and expect to ever get it to run again? Perhaps for a year or two but what about 10? 20? How we can begin to reason about the problem domain? Taking a page from climate science, this talk explores models to better frame the problem which could provide insight into how to design systems that can outlast us.

George Kierstein will talk about this problem and a solution she came up with to mitigate some of the challenges with archiving software for the indefinite future. We're spending more and more time creating software. If it's so important, maybe we should think about how to keep it.

Slides

Download slides

Video

Please forgive the poor audio quality. We had technical trouble with the audio recording and used the backup audio. Please use the captions (CC) in the video player.

Transcript

Eric: My next speaker, I invited her because I've had this experience where I've written some software and put it aside for six months. When I came back to it, it didn't run. It didn't work.

Digging into it, I realized my whole machine had changed, updated, things were different. It didn't run anymore. That was only six months. We're spending years of our lives building software. If stuff breaks after six months? We're putting years of our lives into this. How can we let this happen? I don't want to give the talk away. Please welcome to the stage George Kierstein.

George Kierstein: Thank you. Hello. My name is George Kierstein, and the work I'm presenting and the talk in general is based on my time at the National Climate Data Center, which is now the National Centers for Environmental Information, which is the nation and probably the world's archive and main research center for climate-related topics.

The talk also gave me great opportunity, like Will Byrd did, to nerd out about some problems I'm really fascinated by, one of which is something I've coined as a generational-scale problem.

What's a generational problem? A generational problem is a subset of these terrible and difficult problems in which the time scale and the time course of change is slow enough that we are cognitively blind to it.

We therefore have no intuition in which to ground how we're going to approach these problems. They're really tenacious and very difficult to solve. I like to call them dragons all the way down.

The other part that I find really fascinating is how the origin of civilization and what it can teach us about the evolution of organizations. Let's get started.

Hibernation. It came out of a project that I was assigned to do, which seemed to the people who gave it to me kind of simple.

The story is, someone was about to retire. In fact, I have notes for this. Professor [inaudible] started this project in 1995. A couple of years ago he was about 80 and he was retiring.

He was the main contributor to one of the biggest and most important climate data models called the Global Surface and Atmosphere Radiative Fluxes model -- ISCCP. There's a lot of climate data models, but this one is what most of them are built on top of. Satellites measuring how dense the cloud cover is, among other things, all over.

A satellite database. He's been refining this model since 1995 and he's retiring. The next version was due out, but the satellites haven't been launched yet. What do you do? He has this incredibly large code base that basically only he knows. Grad students, other contributors, sure, but it was his baby and he needed to retire and put it down.

This is a pretty common scenario in, I think, generational-scale problems. Some of the code bases we're building for climate data research go back over 20 years, easy. They thought, "Let's figure out a way, (and it's your job) to figure out a way to put this away so it'll run really great."

In some sense, I thought, "How hard could it be," knowing that the short answer is, "Of course, it's not really possible." That example's great. Can we really expect that at this time? No.

But some of the things that were really interesting that I thought I could base some kind of solution on was some of the constraints. Constraint one is not the Will Byrd time scales. We're not talking about 5,000 years. Maybe a couple years, maybe a decade. If we're really lucky, decades, or unlucky, as the case may be.

Some of the things that pose real problems here are that the original author is not available. When someone's set to bring this back up to life, there is no domain expert to handle it.

They are pretty much on their own. The software is ostensibly a black box, and for our organization it was also no VMs or containers. That was at the time a technological constraint.

But I kind of think if you step back and look at the problem, that VMs and containers really are kicking the can down the road, and in some way hiding complexity from the person reviving this that is worse, not better. That's a debate we can have later.

Given all of that, I came up with, essentially, a protocol. Instead of any sort of software, it was a simple set of directives about how to package things up and what to include in this package that I thought would give the person reviving this the best chance they had.

It's a human-scale solution. There was no attempt to try to think of "What good automation can I do that will solve this problem," because in a very real way, again, it's dragons all the way down. People have never really tried to do this before.

The chances that an organization that is underfunded and not technically as sophisticated as others are going to really understand any kind of automated solution, that would have to be maintained too, seemed not possible.

The best we can do with this type of problem, I think, is to come up with a human-based solution. The human-based solution basically leveraged, and I'm not actually going to tell you the details of this, because I think in reality they're not that interesting.

The basis of my talk and what I'm going to try to argue here is not for any particular solution, really. It's whether we can reason about this problem better than we have, which is to date really not at all.

Comprehensibility seemed like a fundamental principle and I kind of took the approach of my future self. Everybody, pop culture -- think of your future self. This has happened to me in practice and probably to many of you.

Luckily for me I was at a company a really long time. Years later, I had to fix these bugs and look at very old code. I got really upset about the design decisions that were made and the styling until it sunk in that I wrote all this code. I forgot.

I had no idea. I looked at it fresh, and I was mad until I knew it was me. The future self model is not a bad one in this case.

Comprehensibility is really important. Can you even comprehend the basic structure of the layout of this thing that you're trying to get to know and get to work again?

Simplicity in that sense is essential for comprehensibility. How else could you try to figure it out? If it's super-complicated all the way down, this is going to take a lot of time. You might not be able to do it.

One thing has been referenced in a lot of talks and kind of a general theme, I think, is context is important. Because context really matters, so we force people to frontload as much information as possible.

Go through great lengths of documentation and architectural decisions that had been made that they never wrote down in order to provide as much context as we could for whomever had to get this to run again.

That was the main thrust of how I approached this problem, which led me to feel pretty good about myself. I actually won an award for my protocol, which was kind of ridiculous, because essentially it was just "Please, have good hygiene. Do things sanely. Be sympathetic to whoever comes next." Yet that was still a novel idea at that time and in that organization.

This actually brings us to the bulk of the talk, which is, well, that worked out well for them, we got his skiff put down. We did the test to make sure we could revive it at least once with someone other than me in production, and that went as well as could be expected, which took a month, by the way.

Which tells you something about the time scale. Even in a very controlled circumstance, when we literally had the guy who knew how to get it to run sit down with me and do things the minimum we needed to to revive this code and put it away. It still took somebody a month. Organization, like dev ops took weeks to respond to us. Dependencies were missing that we didn't even notice at first, with the way they set up their deployment machine versus our development environments.

That's pretty sobering already, and speaks to the level of complexity and how bad or how difficult this problem space really is.

Of course, my intuition when they said, "Hey, now that we've got this fancy thing, we're under budget and we don't know where to put all our data and we have like 70 new datasets coming in.

"Why don't we — brilliant idea here, just brainstorming — why don't we just replace all of actual data that we take so much time to have provenance for in the Dr. Sussman provenance way with the program and its inputs? It'll save us so much time, so much space. It'll be great."

Of course, horrifyingly, "No -- no" was the response. Trying to articulate "no" with something more reasoned than "don't get me to do it". I'm not going to help you. And hand wave no. It kind of led to a lot of the rest of the top which to next so can we do better than just say "no". Let's give it a shot.

So I'm gonna leverage some points of inspiration from sort of the generational problem perspective and the historical perspective of how civilizations have been formed and how they organize themselves to solve big problems.

So let's do what everybody does. Let's go back to the Newton age. When we have an awful problem we don't know anything about, let's just define it so we're gonna say survivability is s.

Let's try to constrain the problem space a little bit. Because survivability of any kind of system and software systems, in a very real sense, care conceived of as an ecosystem that sort of lives and breathes and thrives and dies. There's a lot more to it than we can really lay out and tackle all at once. But there is some part of it that we might be able to come up with a better model to actually reason about with. I call that resilience.

So what do I mean by resilience? In the sense that software is flexible and adaptable, resilience is this property in which you know is resistant to change or can cope with change well. So resilience in my mind and the intuition is inversely related to how fragile something is. That's almost definitional from the English language. Let's call resilience the inverse of Fragility. So now we've got a model. We're on our way. So I'm reminded of Tim Gunn who says "steal from the best and make it your own".

So industrial engineering, for example, is a field in which they've really looked at this problem of how you go about looking at how things break systematically in some kind of production system. Now of course emotionally, I hate this and I'm like well, I'm not an industrial engineer. I'm kind of a special person just like all of you are above average. I'm an above average programmer like everybody else. And so I didn't like this very much mostly, but let's give it a shot because they've done a really good job.

So they kind of have when you pick up an early text on industrial engineering, they talk about defect rates. They have nice models and they say, "okay, we can model a defect in a system like so". Kind of like Dr. Sussman's intervention of skew, I defined it as drift. So how fast and far does the system--the parts of the system, loosely--that caused things to break in this respect drift?

In our problem space for this type of problem where we put code away and it's well-defined, we bring it back up and we just expect it to work. We avoid a lot of the other kinds of problems like adding things or what happens when we send new inputs to it. Since that's not our concern at this point, this is really drifting dependencies that break. Like what really costs these systems to fall down. Well in practice, it was the dependencies. You know the algorithm or the nature of the thing itself, that didn't change between the time we put it away and when we're pulling it back up again.

So let's just call them D. That's Drift and it's about dependency drift. Maybe if we're lucky, we could figure out eventually a statistical model. Now of course, that seems ambitious and dubious at best right? Because who knows if any kind of simple statistical model, like Poisson distribution, which is kind of the baby steps of Industrial Engineering textbooks, has any application of what we're talking about. In fact, maybe a stochastic sort of statistical approach might be a better fit and possibly even acknowledging that this is really a complex system and we have to start talking about the phase spaces of its stability and all the rest of that.

There are some interesting books on the relationship between stability and complex systems of computability. So who knows right? But, given that on some level, like my Game of Thrones book club's house motto. "I do not know". So we might as well start with something simple that we can reason about and try to put our intuition into a form that will be more useful and argue about it later. So let's call this resilience now a proportional relationship to the inverse of the drift of our dependencies, which for our problem is more or less describing the dependencies break and the chances of that system being resilient is inversely proportional to how bad and fast they break.

Well, let's start to put some nuance to it, which is the magnitude. We kind of don't want just the number of the potential threat breaking. We want to characterize also how bad this is going to get. So the magnitude of impact here is where what we define as that. And then here's my Papers We Love moment. Can we put a bound to that magnitude right? Well there's this great to an online paper called the mathematical limits to software estimation--which I love. He had a background them to see in CMMI. So did I interestingly. And I'm not going to get into what that is, other than that it's kind of eighties level, first wave attempt to put any kind of systematic reasoning around and modeling for software estimation. We kind of would put it aside exactly places like the government and we've kind of gone with the hand wavy estimation by analogy. But this paper uses Kolmogorov complexity to try to analyze whether or not that's even sane or possible. and the upshot here that I want to get at is then, ostensibly, no it's not.

If you're doing an estimate, your estimate can be off by as much as rewriting the whole thing. Which is a sobering thought. In a lot of ways kind of harkens back to (I love referencing people because everyone's been so great presenting things that are super interesting) Kim Crayton, talking about the myths of programming and how we should change our culture to be science-based about it. One of the myths that I had a whole rant--that I took out of this talk--it's around the ten times programmer and estimations. You know, kind of our cultural bias towards believing we can. Perpetuating some of these myths. But basically, the upshot of that is we don't believe these estimates and that should be really a sobering kind of criticism of this ten time programmer. Because I pad things times two, times three, most people do, and as manager, I took other people estimates and did the same thing.

So now we've typically always kind of just by nature pad out against this order magnitude that people are supposed to have. That seems pretty obering, bad, but that does give us some kind of bound. Well in our problem, the worst case for us is we have to completely replace the dependency. That, theoretically, is a task we can understand that it's a certain amount of time. So let's run with that. Let's just say that we have a linear set of linearly independent set of dependencies, which course is dubious, but again, simplicity and who knows.

So now, we have this nice series here that will sum all the dependencies and give us some kind of number. So we've got a metric. We've got a model. And I'm going to segueway now because where's that D gonna come from? Is that even accurate? Who knows how you're gonna find out? So, I'm gonna start to talk from the history of organization perspective. Kind of jumped forward to sort of one way in which early science went about solving a generational scale problem that in weird ways bears a lot of resemblance to this problem space.

Here is the Great Pyramid of Giza, built approximately 3rd millennium BS. Nobody really knows. The interesting thing here is that, that obviously took an enormous amount of time and effort. This is the vulture stone. I can't pronounce that I'm not gonna try. And it basically has this paleolithic or Neolithic--an amazing piece of art, but is also depicting what used to be a controversial piece of art. Because it people thought, you know the crazy people were positive that it was like "Okay this is depicting when comets changed the weather. They slammed in at that time 10,000 BC." They brought about this sort of period in history and that caused this climate change problems positive big shift in human behavior and is the origins of organization and anger right? Well nobody believed in until very recently some researchers at the University of Edinburgh. Klaus Schmidt basically used the model to back calculate where the stars would be--and not as rough. Their symbolic depiction of that is really literally down to 10950. They can back calculate exactly when the comets came. They were right and that really is amazing depiction that is the historical event that has a lot of accuracy for development like astronomy at a time. And also apparently they also managed to calculate long term changes in the Earth's rotational axis from those comet impacts and they put them on these stone things from the early writing.

So astronomy and our kind of monument building is really old and our attempts to reason and put things down go back quite a long ways. And basically, every monument building culture were amazing astronomers. So this yeah basically all of them were really amazing astronomers and they had sophisticated methods to try to build these massive monuments. The city of workers here, from the Great Pyramid of Giza, unlike our previous assessment, which is that Charlton Heston and slaves built the pyramids. In fact, they've been able to determine that it was about 10,000 very skilled workers that lived right next to the pyramids and that in a very real way, they looked at this kind of like an Amish barn raising thing. So they had their entire civilization built around and kind of been geared up to build these great pyramids.

And so these monument builders. These are the moai on Easter Island. They were the monument builders that basically built themselves into existence. That is their kind of story right? They do their whole civilization and off to build these things and in doing so, they eradicated all the plants and the trees and they went extinct and but this wasn't that long ago.

In one way, I argue that this is a pattern for how humans organize solving problems. And we're doing it again on a much grander scale. This is the Foxconn factory in Shenzhen.

So what are what are some of the takeaways here? Well in a sense that I think can be meaningful for us and useful for us, a civilization is defined by these monuments. We build monuments. We need to be careful. Sussman pointed out about what monuments we build and how we build them. Because at the end of the day, they come to define us in a lot of ways, but on all levels of civilization.

Every single person in there kind of has cultural sort of attitude that brings them to work. And you know, these monuments get built and here we are today. So another thing that's really fascinating is basically everything I showed you previously. No one really understands how they pulled this off. I mean let it sink in the Great Pyramid of Giza is an immense engineering marvel. The precision that we can't really easily replicate unless we're highly motivated, right?

No one has any idea because no written record of their building techniques are left. You know just like the papyrus. You know no one knows how this was done and in a lot of ways this inspires people, I think, to look to aliens. And I think that that is a huge mistake and I have a Facebook group chat called "Pure Idiots". It is where friend of mine really loved the serious alien sort of thing. But I really argue with them and can't go there because really, I think we should never underestimate the human nature of how badly we are inspired as nerds solving problems. And that I think can serve us in cases as inspiration and a warning, to be able to tackle problems that we have right?

This is BG from earlier. The earlier this--or yesterday and he said this great quote "at scale all problems are people problems". Well, I think somebody and I were talking about it later. All of our problems are people problems, all the way down. And I think we have a historical basis and fragrant reference to understand that and that gives us a lot of grounding and a historical reason to take cultural criticism seriously and to do the work to change it. When we understand from a p principles way that it's wrong and it's building something better.

So let's get away from the long past and talk about something more modern which is another type of generational problem. Which is a global closed system much like software. At this point, just you know, discounting aliens it pervasively impacts all aspects of human life and civilization and then this timescale causes us to sort of reinvent the wheel with certain patterns and I'm talking about climate change right? So I want to give you a quick overview, Just a quick history list of the history of weather and climate. What happened when. The highlights.

One of my personal heroes, Joseph Fourrer, calculates basically the greenhouse effect. That the world would be basically far colder if it lacked an atmosphere. By 1849, the Smithsonian, with the invention of Telegraph have given out instruments all over the nation to get people to collect and telegraph them weather reports.

By 1870, we had a National Weather Service. Our government funded this whole effort because everybody really cared about that. Of course, the war happened and all the rest of this to motivate this a little bit more and but by 1834, right here in New Orleans, the first archive and tabulation unit for all this data was founded. And by 1957, it moved to Asheville and has been there ever since. So that gives us a really interesting time scale for how people have gone about solving this intractable problem.

So what's really useful for us? Well I think that we need to really have an active recognition of the pervasive impact that this problem has in our life. And the way we do work. And the way we prioritize within an organization. And data collection was a necessity and a passion because of that again. Culture is built to build monuments. We build it to improve our lives and mostly to build monuments. Because, one thing that I didn't show you, I forgot to mention from the from the vulture statue, is recent research has shown that we basically have--civilization starts, and urbanization started not because everybody thought that it would be a good economy of scale. That happened the other way around. So we started as monument builders which started urbanization. We did all these techniques, after the fact, in order to build better monuments.

So here we are, building a new monument. Data collection is important. Thomas Jefferson upper-left. He was an avid weather recorder. This is an example from 1907 which is an observation. A little bit more systematic and that is sort of other takeaway point.

Systematic collection. What they knew was relevant is what they did. And that's true for the observations of Copernicus and Galileo, as well as for climate data. So I think this was a pep talk and that we have it easy.

We already have this huge corpus of data. It's disorganized. It doesn't really have provenance. But we do have a lot of data and we're starting to talk all the time and a lot of talks touched on this on the conference that we are connecting these dots. We have classifiable types of problems that are pervasive, people have seen them before, but we're reinventing them again. And we have the ability to, luckily this time, have a much more immediate turnaround time and the impact on a problem space we're trying to understand. So that's great. So I promise I'll get back to my little model.

What can we do about this? So whatever solution that is effective that we come up with really will have to affect how I, as a person, contributor, part of society, dimension, and a organizational scale one. So we've got to find a way to "what do we do and how do we motivate people to do it?". So we need to start with some simple taxonomy and it almost doesn't matter what it is.

I'm arguing that it just can be as simple to be useful to create a true historical record that we can then automate, do machine learning against, or whatever kind of technique we want, to find that letter D and describe it better. So we can just systematically add a dependency annotation. It's this check-in because of a dependency. That's all we really have to do to start a real historical record. Certainly from the open source you know argument that will turn what is now a disorganized corpus that can't be searched and we have no way to carve out how much of these problems that happen when or because of this particular issue that we're talking about for that and a specific resilience problem.

Okay, so how do we motivate organizations to go along with all of this, assuming that people in general care enough about their sort of struggles with dependency and recognize the utility that these historical examples have, and that dependencies, as a class of problems, are worth pay attention to. And that's not all we know, but that's okay. We can start there or we should. So let's test our model because we need a way to communicate all of these ideas to other people. and from that I'm kind of going with the Hume...Wait, I'm so bad at names. Bear with me. Kuhn, Kuhn, "The Structure of Scientific Revolutions."

I think if we can evaluate this model from the point of view that it has got explanatory power and it is better than our current model, which our current model is being teenagers and not wanting to clean up, honestly. That's the attitude and that's how people treat problems like dependency management and other kinds of things like that.

Maybe we can re-characterize code depth. We can use this model in a real way given our assessment that we have to replace dependency. That can be one. How do we characterize code depth and how we don't really, if somebody says, "It's bad, like, bad." We have to do something about it, but then someone further up would say, "Well, we can do that next quarter, maybe."

If you could characterize how expensive it would be, then that would be a very compelling argument to look over time and say, "OK, are we actually driving ourselves out of business by ignoring this problem?" Like, "We're going to have to pay this eventually, so how much is it actually going to cost?"

Which is typical for the types of organizational specializations that like CTOs evaluate risk management as a financial problem. We can give it a shot, because in some sense, we've got enough historical record where we can actually calculate those costs.

If building that library dependency into your project took somebody, they got paid X hours or M hours, well then, that's an M that has a legitimate real-world value from that point of view.

Then of course, we have domain experts. Here you would ask anyone, "How long would that take you?" They'd say, "I don't know," but then they'd pad it by 10 and say, "Oh, a month." Even if that's not accurate, at least it helps the management and the organization to reason about this problem in a more principled way, I suppose, so I guess our model has utility, which is great.

I'm not saying, "This is a great model," but we have to start somewhere. If nothing else, it should motivate people to think that we can take all of this and actually solve some of the corners of these problems based on like an old-school, early science way.

What else can we do? I'd argue that we need to build an archive, but I mean that in that historical way that includes real provenance.

We need companies and organizations and actually our entire civilization ultimately and idealistically to value this enough to say, "We should pay for all of everything. Hire people, train people, let people study this. More importantly let's contract it all and do it well and with a lot of systematic consideration."

One thing that would be really useful, because we need every citizen to participate, there's a lot of corporate citizens with code that's out of the way, enormous amounts of our code is mostly hidden from us. That's the biggest criticism of the private versus public open-source ecosystem that we're in now, because that data is going to help us solve our problems.

We've got to have a way for them to contribute it to us that is not frightening to them. Motivate them to say, "We can strip all of our intellectual property out of this and maybe anonymize names of functions and stuff, so that we can feel safe about contributing at least this much information."

That's not that hard and that's a problem that we can solve and we know how to solve. Other people, like Zach Tellman, who had a great talk on and was arguing for a simple taxonomy and the power of names, well, I think they're right. Somebody should really start to do that.

It doesn't have to be complicated and we shouldn't bicker about exactly what...Like, just calling it a "dependency," is really good enough to get started until we know how to better refine our ideas about these things.

Finally, that code archaeology is probably a sub-discipline that should exist. We have an enormous amount of code. When I first started looking at Poisson distributions, I thought, "It's data," and started looking at what I could fit, none of it made any sense.

A lot of the check-ins and things, like some people don't have public issue tracking, so you don't even know which problems relate to our check-in. Some people have terrible documentation in their check-ins.

We don't necessarily need that much refinement, although it would be very handy. If we at least could look at a code base and realize there is some problem related to our dependency in an automated way, so I could troll all of GitHub and start to put timelines and ascribe life cycles, that would be great.

It would also provide us with a more historically based and native based argument for how often problems happen and whether or not we should do something about them. That is the big takeaway. Thank you.

Eric: Thank you, George. Please, volunteers bring questions. I had some questions about the monuments. You had a few monuments that you were showing. What are some of the lessons we can learn from say the building of the Egyptian pyramids? Is this something where we have to become slaves to our software, our data?

George: I think the takeaway is that our tools and how we use them and in order to build these monuments is what we have to learn about and what we should reason carefully about how we're doing it. It gives us a more compelling argument to liberate ourselves from pedantic tools.

Like, I hate the idea of daily stand-ups. I think there's no real evidence that that's actually an effective way to run a team. I personally go for organized chaos and that seems to work for my group of people. I think there's sufficient evidence, actually, to back that. We don't have to become slaves to our tools. Instead we should re-engineer our culture from the ground up at every level to do a better job at building monuments. We have done work to analyze the theme of what we actually want to have at the end of the day.

Eric: Someone asks about VMs and containers. You said that you didn't use them. Can you elaborate on why?

George: To start with it was a pure organizational constraint. I was just not allowed to use them, so it wasn't even a consideration.

Even when talking and thinking later about, "Let's just replace all our data with code." I think because even now container wars are ongoing...

Like, "Should we have VMs?" "No, containers are better." "Which container?" "I don't know, I hate Docker," or even going, "Who knows what containerization technology is going to look like?" or, "How far is that VM model, when that goes way, or what happens when that breaks?" I think that just is a type of problem where you just kick the can down the road.

At least if the piece of code you're putting down ostensibly is mainly the main points of breakage, are more directly related to the purpose of that code we have a better chance as someone who has to become a code archaeologist to get it to run again, the reasons off the systems to requirements.

Eric: I have another question. I'd love to have other questions, too, so please get them up here. I worked on a system, it was in C++ and I'd never done C++ before. I worked on it for two weeks to try to get it to compile on my machine.

I believe it was a bootstrapping problem where someone had configured their machine over years and years and it just worked on theirs, and so their environment was in sync with the software. Then, once you just copied everything in a certain directory on my machines it just didn't work.

Of course, they hadn't documented everything they did to the machine. Is there any way to recover that?

George: I think this speaks to the provenance problem. Maybe we need to stop using the word "documentation" and replace it with "provenance."

In a way, if we can describe exactly where these changes came from, and why people are motivated to do them and have any level of systemizations, do it systematically on any level, then that becomes a corpus of useful data that harkens back to the turn of the century, those physicists...What's that term when a person does more than one big problem?

Eric: Polymath?

George: Polymath, yeah, the more popular term, but polymath scientists you hear of. They addressed these problems, first of all, by making up definitions and then collecting data they knew would fit, so the best of their knowledge at the time that was relevant.

They had to do it systematically otherwise, it was useless. We're drowning in our data now, because it's really hard to make sense of it. In that sense, the problem is the problem is pervasive across, "How do we build software when it's something where you're not going to change our entire culture from the ground up to reassess it and make it that?"

Eric: Whenever I look at those tables, it always seems like, "Wow, they really knew what data to collect." Why is it then when I think of logging stuff, I always make it way too complicated instead of building one or two things that would be useful?

George: Sure. I'm a "willful optimist," I call myself. I would liken where we are, if we're going to compare ourselves to big efforts to solve a huge amount of the problems in the past, we're right at the beginning of what we're doing. While we're really, rightfully, proud of ourselves for how far and how fast and what amazing monuments we've built, the reality is we still barely know. Wisps of benefits.

We're not that far along in this process. I think it's OK to cut ourselves a break and go with the very primitive basic things we are starting to identify as problems. Dependencies are these weird class of problems that have all kinds of side effects that we would be better off not dealing with.

We don't understand them at all. We understand them in isolation, our little corner of them but not in any systematic way.

In some sense, and I think this has something to do with how corporations have played this huge role in bringing computing to where it is today, and I'm not saying that with any critical overtones, but because that's been driven, they were very focused on certain problems and they don't have to share.

There isn't this civic-minded attitude towards the monuments we're building, which older cultures had. Even omniscience, people shared, Thomas Jefferson was recording data because that was a great thing to do, "Let's contribute to science."

Eric: This sounds a lot like Elana's talk about Debina, this project that we will contribute to and they do all the work and maybe churn the build even if...

George: Totally. I'm glad you brought that up. I wanted to reference her talk too. We have these entirely rich datasets already. That is one great example of how we're really far better off than our predecessors might have been in the early days of scientific problems, because we have an amazing amount of data.

It's not organized. We don't know how to make good judgments or reason about what's in there, because we've been ad hoc about the whole thing mostly by understanding what it says.

Eric: You mentioned ad hoc, are there any ideas...? Is that a question? Are there any of ideas of like a doctor is required by law to have certain kinds of medical records, keep meticulous records about every patient that they see, is there hope for that in computing, like something like that?

George: I would be very careful about taking metaphors from how other disciplines have approached how to collect data and how to look at it and what records you really need to have, because their needs drove them to that place over time.

The organizations they built which had the precedents, totally human nature, reflect that. Some of them are bureaucratic needs and litigious needs. I think if we keep it simple and just confine ourselves to the simplest version of the thing that we understand, which at this point is just noticing this is a dependency problem, that we will eventually come to have the right balance of the types of things that we store.

Some of these AI questions have been brought up around how to codify ethics and rules and so forth and these become problems we need to have some reasoning about and then out of that will come actual record-keeping requirements.

Eric: I think I mentioned this in Will Byrd's talk. It seems like our software only runs because we just never shut it down. We keep running it until we have like a class of people whose job it is to keep the machines running.

George: That's an idea. We're worse off in a weird way than the medieval scribes, because at least they only had to literally copy something. We're told often, "I want it just like that, but throw in this and paint it pink and draw three perpendicular lines." Like, "Wait, what?" From my favorite YouTube video, "Three perpendicular lines."

Eric: I can't read the handwriting on this, but it's about the NASA moon landing? Yes, please.

Audience Member: The NASA moon-landing tapes, the majority of them, or certainly the highest quality ones were lost, because the format was not preserved and so the tapes themselves, physically are lost because nobody can read the format. Something analogous to a Docker container would have been able to preserve the ability to read those tapes. Wouldn't that be a big improvement over not having them?

George: Don't get me wrong. I have nothing against container technology. I, in some sense, wish I could have used it. I really want self-healing code that can fix itself for me and to put in place instead of these ad hoc primitive methods that seem to be the only sane thing that you can do facing all of this.

That would be great. I think we can do better than that. I think we can't just assume that's the solution to this problem, because history makes it pretty clear that that's not how people approach these problems and that's not how they end up solving them.

Eric: Isn't a Docker container or a Docker image just another format you have to make sure that they'll be able to be opened?

Audience Member: As with anything pedagogical, you build one on top of the other. You have an entire stack. We use that word every day. It works.

George: Sure. That's a certain style of solution that has utility that we have an intuitive, emotional, like, "This is great, look at this monument. My stack's 10 deep," or whatever. I don't think that that necessarily does us a service of trying to step back and look...

Audience Member: What's the alternative? That's my question. What's the alternative?

George: I think the alternative is to continue forward with the things that work, but try to create a corpus of data that can directly address some of these questions, so that we can form a simple taxonomy. Like, dependencies model.

Other people, there are much better qualified and probably with better thoughts about this could probably build a beautiful, simple taxonomy that looks at our class of problems and can annotate them properly. If we all go ahead and do that and are systematic about it, we could have a chance to reason about why that works so well and when it falls down, how and why does it.

There was an argument in another talk, earlier, that was about, "We need systems language." We probably do, but that's an intuition that this is a class of problems that could be better reasoned about with a better tool.

Container technology and stacks have taught us a lot about how to cordon off these dependencies and these levels of abstractions. I think that's incredibly useful. We couldn't be doing what we're doing today without some level of it.

Again, I think we need to be a little humble and to face that they don't solve all our problems and we don't understand enough and that we could come up with a better solution, but we need better data.

We need to take that other intuition that they don't always work, more seriously, in a systematic, cultural way and actually get people who are motivated to deconstruct why and how and compare them to other classes of problems. I think when you do that and we don't look at each problem like it's in isolation, we can start comparing them to similar classes of problems.

Do containers have the same problems as VMs? That's an interesting question I don't think anybody's...I'm not an expert in this, but I think that's an interesting question. How could we have thought about that? We just came up with this stuff.

We don't have all these answers, and that's OK, because I think we're starting to figure it out, but we need to compare notes and we need to do that systematically. That's how we'll come up with solutions.

Eric: Any other questions? Yes?

Audience Member: In terms of just day-to-day first steps about trying to go about conscientiously writing your program and getting to archive and annotate, what's your advice on that?

How would that look from assuming that what you are writing will break in six months and reasoning about predicting why.

George: Predicting why it does is the holy grail of what we're doing with the theme collection.

I think the historical record and being conscientious about the providence of these types of problems you personally are encountering with the code base goes a long way to let you build tools to help you reason about, "What happened and why it didn't happen last year?" or, "Is it really this type of library that's to blame?"

"Maybe it's the way we built our API," but from that you have to do simple things. Like if you really have constant dependency problems, you should, every check-in that's related to it do the simple thing and put "dependency" on it.

Whatever it is, you should be systematic about it and treat it with a certain level of enthusiasm that eventually this will actually help you search through your code base and your history and be able to start to better reason about the problem space that you encountered.

Eric: There's an idea that the changelog, as a standardized point, like practices about adding stuff every time you do a release. Are you thinking something like that?

George Keirstein: No, I'm being very literal, systematically add some dependency added to all the relevant check-ins. We add a check-in box. If it's related at all conceptually to you from your problem space, it should have a dependency in there, in that comment.

That's the simplest thing you can do. If that point has a time scan, we have a lot more information about what changed when, and finally, we know figuratively how, without trying to look at every check-in, we'll remember what we did, that it's due to some check-in like a dependency platform.

I think other classes of problems, given a simple taxonomy -- you could start your own -- can help you carve up these problem domains. From there, you can start to reason about whether or not, like this part is what you suspected was going wrong and suspected was influencing this actually is happening.

Because right now we just have our intuitions and we roll with that, based on experience. But the truth is, those are often wrong.

Eric: I have another card, from Hunter. "How do literate programming approaches come in?" I love the idea of literate programming, but it's been a really long time since I looked at it. I don't know. I haven't really thought about that, but yeah, that is speaking to we need to rethink what we mean by documentation. I think Knuth was right about that.

Again, here we are, years later, and I didn't bring his books anyway. Didn't have that right background and hack the hacker without it. That sounds great, let's do that. But we do need to rethink documentation and we need to rethink how we look at what we want to report and how we want to do it.

Sure, maybe we can rethink how we program altogether. Because again, the tools that we use to fill the monuments are symbiotic and in line.

Male Audience Member 1: But in that sense, literate programming, taken as a practice -- I've never been super-successful at it — is not just documentation. It's how we approach the actual construction of the software and the organization of the software.

George: Yeah, that's great. Again, I don't really know anything about it, but at the end of the day, that's more evidence the code archaeology to some discipline should be treated like a legitimate thing and maybe people should start doing that.

We can get value from it and justify the changes that people advocate and the solutions to certain parts of the problem domain that people are really excited about.

Eric: Yes.

Male Audience Member 2: All that historical stuff...People were able to date with the stars. They saw these events and so now we can guesstimate better when things happened. Is there any kind of log we could have, in computering, that would provide, "OK, I can guess that the dependency's happening between this and this"? Or "These things should work together"?

George: That's a great question. I don't know. I think that's what we're looking for when — or historians try to analyze our own computing history and people try to put together talks about how to systematically think about systems versus processes versus these other things.

Yeah, it would be wonderful to take something that all the dependency management stuff that Elana is working on and tie it into an actual narrative that it's got descriptive powers, say, "Oh yeah. This class of libraries suck. We should really stop trying to do it."

Eric: OK. I think we're going to end it there. Thank you so much.

Code Hibernation and Survivability - George Kierstein

Slides

Video

Transcript