What if data is a really bad idea?

This is an episode of Thoughts on Functional Programming, a podcast by Eric Normand.

Subscribe: RSSApple PodcastsGoogle PlayOvercast

In this episode, I read from and discuss a comment thread between Rich Hickey and Alan Kay.

Transcript

Eric Normand: What if data is a really bad idea? Hello. My name is Eric Normand. This is my podcast. Welcome. Today, I'm going to be reading and discussing a "Hacker News" thread. This one happens to be an AMA. Ask me anything, that Alan Kay did back in 2016, about five years ago.

It's interesting to me because Alan Kay and Rich Hickey get into a little discussion. This is like catnip for me. This is candy. I ate it up back then. I've probably read it 10 or so times since then to try to understand what happened in this discussion. This topic was actually suggested to me by my friend Yehonathan Sharvit who's writing a book on data oriented programming. i think it would be a great topic. Let's do it.

Allan Kay is being asked, a lot of questions. He is also asking questions himself. One of the questions he was asked was, "What are the things that you're questioning today? What kinds of questions should we be asking?" One of the things he said was, 'What if data is a really bad idea?"

Rich Hickey responds. I'll read excerpts. I'm not going to read the whole thing. You can read the whole thing. It's interesting. I just want to highlight a few sentences. Pick out things that are worth talking about.

All right, this is Rich Hickey. He's trying to get more information from Allan Kay. He says, "I find data hard to consider a bad idea in and of itself, i.e., if data equals information, records of things known or uttered at a point in time, could you talk more about data being a bad idea?" He was asking for some clarification, "What do you mean data is a bad idea? It just means information." Alan Kay says, "What is data without an interpreter? When we send data somewhere, how can we send it so its meaning is preserved?"

This is Alan Kay answering a simple question. Rich Hickey says, "Data without an interpreter is certainly subject to multiple interpretation. For instance, the implications with your sentence weren't clear to me in spite of it being in English. Some metadata indicated to me that you said it. Should I trust that and when? These seem to be questions a quality of representation/conveyance/provenance, which are he agrees is important rather than critiques of data as an idea. Data is an old and fundamental idea. Data, something given seems the raw material of everything else interesting. Interpreters are secondary and perhaps essentially varied."

Like I said, I've read this like 10 times, so I know more context than what I'm reading, because I've read further down and can reinterpret what he means by all this. Of course, I've watched a lot of Rich Hickey talks and read other stuff he's written, likewise for Alan Kay. I'm going to try to integrate all of that.

Of course, this is up for my interpretation now. That's why you tuned in to the podcast. Let's continue. Rich Hickey is the person who goes to a dictionary, when he realizes that he's unsure of what a term means. To me, I think that he perhaps is not being generous with his interpretation of the question with Alan Kay.

He is saying, "Well, the fundamental idea of data is that it's information." This is an old idea of record keeping. It goes down to at least as far back as the Sumerians. Writing down stuff or keeping stuff, maybe even not in writing, keeping records in knots and strings and stuff.

This is an old idea. He's implying also that it's a good idea. It's good and it has stood the test of time. He's not saying because it's old, it's good. He's saying it is good. It has been refined over the years into an understanding of why it is good and how to work with it.

He's still asking this same question. How can it be a bad idea? Alan Kay responds, "There are lots of old and fundamental ideas that are not good anymore, if they ever were." I feel like at this point, the conversation is breaking down. Alan Kay is using this rhetorical argument that, "Oh, here's one little thing that you said that data is old and fundamental. I'm going to say, that doesn't mean it's good."

I don't think Rich Hickey was implying that. Alan Kay is also being ungenerous with his interpretation of what Rich Hickey was saying. From my experience, reading other stuff on Alan Kay, this is common that, Alan Kay is a very smart person, is very well-read, and is used to dealing with people who are less well-read and have thought less deeply about stuff than he has.

He has a bias to interpret what people say as not having thought as deeply as he has on these issues. The bias probably serves him well most of the time. With someone like Rich Hickey who is honestly trying to form an understanding of how could data possibly be a bad idea?

It's a mistake to use these rhetorical arguments about nitpicking one little point in his two paragraph response thinking, "Oh, you must think all old ideas are good." Which is ridiculous. Alan Kay continues, "The point here is that you are able to find the interpreter of the sentence and ask a question. The two were still separated. For important negotiations we don't send telegrams, we send ambassadors."

Then he goes on to talk about objects. Of course, Alan Kay invented object-oriented programming. He says, "Bundling an interpreter for messages doesn't prevent the message from being submitted for other possible interpretations. There simply has to be a process that can extract signal from noise. This is particularly germane to your last paragraph. Please think especially hard about what you are taking for granted in your last sentence."

Again, the last sentence that Rich Hickey said, "Data, something given, seems the raw material of pretty much everything else interesting. Interpreters are secondary, and perhaps essentially varied." Alan Kay in this part, he is trying to answer honestly. I'll jump ahead. They are fundamentally disagreeing about what constitutes data.

Alan Kay, from other readings I've done is using data as a programmer uses data, meaning bytes in memory, basically, charges of electrons in your RAM chips. It's all signal. It's some signal that gets interpreted as ones and zeros. It's an analogue signal, but it gets interpreted into a digital signal and then collected together into bytes.

Then those bytes has some meaning in your program. They could be numbers. They're all numbers. They're binary numbers, but then some of those numbers are representing integers. Some are representing pointers. Some are representing floating-point numbers. Some are representing characters, etc.

He's using data in that data structure way, in the way we think of it as programmers. Rich Hickey is talking about data as this fundamental idea that goes back to, like I said, that prewriting. I think that Alan Kay is not acknowledging that Richie Hickey has said interpretation is necessary, and you can't understand the data without an interpreter.

What Rich Hickey is talking about is that the idea of data, of trying to write something in a way that can be understood given certain minimal levels of common understanding of interpretation is a good thing. Meaning you're going to write numbers a certain way. I'm going to write with these 10 digits, and the most significant digit goes on the left, and the least significant goes on the right, or I'm going to write in English using these letters. I write them in a way so that other people can read them.

There's a kind of minimum, and this is the idea of data that we have some agreed upon way of recording facts. Over time, we've refined that, and the interpretation is a secondary idea. He goes on to refine this idea later, so we'll get to that. I think that this sentence that Alan Kay says is very important, "For important negotiations we don't send telegrams, we send ambassador." This, to me, shows one of the fundamental disagreements or misunderstandings that they're having.

Alan Kay is trying to talk about a way of scaling systems by sending something more intelligent than a piece of JSON. Something more intelligent than pure data over the wire and higher-level communication than you can get from a single message getting sent. This shows to me that what Alan Kay is doing is trying to get at a fundamental research question. Often when you're doing research, you ask these almost silly questions and try to uproot some fundamental premise that you've made.

If you've made this idea that we can start with data as given, and that's what we'll base our whole network on or whole system of computing. If you start with that, you get a certain system. What if we go back and think, what if we don't pass data between systems? What if it was a bad idea? What else could we do? He's trying to use this analogy or a metaphor that of telegrams versus ambassadors. If it's an important enough to negotiation, a king will send an ambassador and not just a message.

Although the discussion is going off the rails, I think that he's asking an interesting idea. We shouldn't just say, "Oh he doesn't understand the actual dictionary definition of data." Of course, it's a good idea. I think it's worth ask. The discussion is going off the rail, and in fact, ends after several long responses. Here Rich Hickey is going to answer. He's trying to say, "Without the idea of data we couldn't even have a conversation about what interpreters interpret." Again, this is a rhetorical question.

An interpreter can't interpret anything if there's nothing called data for them to interpret. How could it be a bad idea? We're talking about interpreters. We're both saying that, yes, interpreters are important. Rich Hickey says maybe they're secondary to data, but Alan Kay is saying, "But what about the ambassadors?" Rich Hickey is saying, "The ambassadors are working with data as well." How can it be a bad idea? Even if you don't use data, like the idea of data is bad? Rich Hickey is honestly trying to explain what he means by data.

Take a stream of data from a seismometer. The seismometer might record a stream of numbers. It might put them on a disk. Completely separate from that, some person or a process given the numbers and the provenance alone might declare there is an earthquake coming, but no object sent an earthquake coming message. The seismometer doesn't know an earthquake is coming, so it can't send a message incorporating that meaning. There is no negotiation or direct connection between the source and the interpretation.

This is a good explanation that there's some signal that is getting recorded, and that record is the data. That record can be interpreted by something. A person or a computer program that can interpret that as an earthquake is happening or coming or whatever. There's no direct connection. Another person could interpret them in a different way and then declare, "No, there's no earthquake."

He's saying that idea of writing down what the seismometer said separate from the fact that a certain person decided it must mean there's an earthquake coming. Separating those out, that is important. That is the idea of data. That is good. How can that be a bad idea?

If you meant to convey data alone makes for weak messages/ambassadors, OK. Richer messages will bottom out at more data. Data remains a perfectly useful and more fundamental idea than message. In any case, I thought we were talking about data, not objects. I don't think there is a conflict between these ideas.

He's saying even the ambassador and the other person that the ambassador is talking with, they're exchanging messages with data. Message is going to be a higher level idea than data. It all is data at the end. It's just words that are going back and forth between people. It's got to be a useful idea.

We're talking about whether data is a good idea or not, not about whether there might be some other good idea, like objects. That there's no conflict. You can have two good ideas. Alan Kay response. Second paragraph. "How do they know they are even bits? How do they know the bits are supposed to be in numbers? What kind of numbers, relating to what?" He's commenting on the second paragraph, which is that paragraph about the seismometer.

Alan Kay again is pointing at this fundamental idea that, you might have a signal, but you need an interpreter. All the way down to the very bottom, you need something to interpret. Like I was saying before, in your RAM you have transistors that are carrying the charge. A positive charge is a one and a negative charge is a zero. That's already one level of interpretation, whether it's positive or negative.

On top of that, you're interpreting this collection of charges as a byte. Then, maybe you put four bytes together to make a number. There's interpretation after interpretation after interpretation, all the way down to you're basically figuring out what charge is happening in a certain piece of material, like a microscopic piece of material. You're interpreting it. Then, of course, your program's interpreting that number in a certain way.

Eventually, you're dealing with some domain terms. That is cool. That is magic. Rich Hickey is going to respond, he's like, "That's not what we mean by data. That's not certainly not what the dictionary means by data." Let's see what he said.

Rich Hickey, "It contravenes the common and historical use of the word data to imply undifferentiated bits/scribbles. It means facts, observations, measurements, information, there is /observations/measurements/information. You must at least grant it sufficient formatting and metadata to satisfy that definition."

It's finally coming out more clearly what Rich Hickey means by data. He's assuming because the dictionary definition that it doesn't just mean bits in memory. It means an actual fact recorded in such a way that you can read it. It has understood how it should be read.

Of course, you can interpret it in different ways. You can interpret, as an example if you write down, let's say your weight every day, which a lot of people do. You record your weight. You could interpret, "Oh, my weight is going up. Oh, I'm losing weight." That's one interpretation.

Another interpretation that someone else could draw is, this person was concerned about their weight. They took the time every day to write it down. There's always multiple ways to interpret it. You write it down in such a way that it is clear that it is a weight. If you have at least something in common with the other person. You can read numbers. You can read the units and stuff like that, but that is what Rich Hickey was saying.

He's saying any other issues whether you can read the numbers, or whether you understand the units, let's say, in a thousand years, they're not using kilograms anymore, and you don't understand what a kilogram is. Which happens, you read an old book, and they use some weird measurement system like furlongs, and you're like, how long is that? You would still say that's data, it could be interpreted. He's going back to the dictionary, and relying on that.

Again, of course, he's correct given the interpretation that he is making up the word data. I think that he's absolutely right. Is he digging too much into, do we have the same definition, and not reading into what could Alan Kay mean by this? To be fair, he was trying at the beginning. He was asking a lot more questions like, "What do you mean? It's a bad idea? How could that be possible?"

Then Alan Kay was picking on interpretation, it always requires interpretation. It finally comes out. He's like, "Yeah, but that kind of base-level interpretation is included in the definition of data." He gives an example, you can't write the number 42 and call that data. It doesn't contain enough information to interpret what that might mean.

At this point, he's right in a technical sense, but he's wrong conversationally. He should realize that, "Oh, wait, maybe my definition of data is not the one that Alan Kay is using." Instead of giving up, which I feel is what he does. He should instead try to piece together what Alan Kay is trying to say, if he wants to continue.

Alan Kay stops in this thread for a minute, or for the rest of this thread. Someone else jumps in and tries to explain what Alan Kay is thinking. What Alan was getting at is that what you see as data is, in fact, at its basis just signal and only signal. A wave pattern, for example, but even calling it a wave pattern suggests interpretation.

What he's trying to get across is there are a phenomenon being generated by something, but it requires something else, an interpreter to even consider it data in the first place? Rich Hickey understands that. That's what he's saying. Then, Richie Hickey says, "If we can't agree on what words mean, we can't communicate. This discussion is undermined by differing meanings for data to no purpose."

This clearly states that Rich Hickey thinks that they're disagreeing about the term, but he never asks, "What is your definition of data." He's digging in and is like, "I looked it up in the dictionary. I can't be wrong." Technically I think he's right, that this is a very good definition of data. It is a very useful idea, and that he shouldn't dig in. He should instead say, "OK, we have a different definition of data. What is your definition?"

Rich Hickey continues, not the way I suggested but in a different way. "The defining aspect of data is that it reflects a recording of some facts/observations of the universe at some point in time. This is what data means and meant long before programmers existed and started applying it to any random updatable bits they put on disk."

Interpretation of those observations is completely orthogonal. Data is not merely a signal. What constitutes minimum sufficiency of data is a useful and interesting question, e.g., should data always incorporate time? What are the trade-offs of labeling, being in or out of band per datum, or data set? How to handle provenance, etc.

That has nothing to do with data as an idea and everything to do with representing data well. He goes on later and says this, "Sometimes we want the facts, and other times we want someone to discuss them with. That's why there's more than one good idea. Data is as bad an idea as numbers, facts, and record-keeping. Science couldn't have happened if consuming and reasoning about data had the risk of interacting with an ambassador. Ambassador can lie. They might not have the facts. You have different facts. You want to do science by looking at the data."

I agree with Rich Hickey on his technical points. Data is a good idea and it's worth answering the other questions that he said were interesting. What are the trade-offs of labeling being in and out of band? In band labeling means you have a JSON where you've got the keys as strings, so you can interpret them in band and then otherwise something like protocol buffers where there's a schema that defines what each thing in the structure means, but you have to look at the schema if you're a human to figure out what to do with it.

These are important discussions, and they're all about data as an idea, but the fact that there's better and worse implementations of data is separate from the idea of data and whether it's a good idea. Rich Hickey got a lot of attention there because he wrote more and it was a good idea and worth digging into but Alan Kay does answer other people's questions as a response to his same initial question. Remember, that initial question is, what if data is a really bad idea?

If we look at some of the other stuff that Alan Kay has said, one thing he says, in the early history of Smalltalk, which there is a podcast episode on, is he talks about wanting to get rid of the idea of data structures. From my perspective in 2021, it sounds equally silly of an idea to get rid of data structures. Data structures, that's a important idea.

We couldn't have good computer science and good high performance programs without them, but it's useful to look at it. There's two things that I've learned about this line of inquiry that I think I should bring up. First, Alan Kay is not as careful with his words as Rich Hickey is. His mind operates quickly and with a synthesis of ideas from all sorts of places.

Also, from within the context of his own time of when he wrote these things, so him writing, wanting to get rid of data structures. If he wrote that in the early '80s, but referring to a time in the '60s and '70s, maybe that means something different at that time. Data structure to me is something like HashMap, LinkedList, even something like an ArrayList. These are things that I'm used to having given to me as part of a library.

I have to know their basic properties, it has linear sequential access, this one has random access, etc. If we look at it from his perspective and we have to interpret what he was talking about. Programs at the time that he was thinking about these ideas, a data structure was a fundamental part of your software. If you were writing a custom piece of software, you might have to write, let's say a routine to do spell checking.

This routine would have to quickly find given a word or even a misspelled word, quickly find, does this word exists in the dictionary, and then what are the words that are near it in terms of edit distance? This is a hard problem if you have a slow machine and not a lot of memory.

You might write a custom data structure, meaning, you are writing basically a tree, pointers and little arrays of memory that are structs that represent different things, and they point to other things. Your code has to know how to walk along these pointers and locate the piece of data that you need that is stored in this structure.

You're going to have this, maybe a for loop that walks along this structure. Then one day you say, "Oh, I know. We can cut out a lot of the search if we do XYZ." You go into your for loop, and you add a little if statement that says, "If this is the case, then we can cut out this, and we'll skip the left branch, and we'll go right to the right branch. Otherwise, we have to go to the left branch."

You are writing very custom logic into the walker, into the for loop that walks along this data structure. That is what he means by, get rid of data structure. That's the data structure that he's talking about when he says get rid of them. We don't want our business logic with the word we use today to be mixed in with the logic for walking pointers. Everything has to be custom.

We want some modularity between, you're going to get a linked list but it's totally generic. You're going to get a B-tree, and it's totally generic. We'll give you an interface for walking it. You're not going to develop your own custom ones every single time with their own custom algorithms, as you keep adding to and finding special cases for and modifying.

That's where we go wrong. That's why nothing scaled because we're always just in our for loop. We need scale. Maybe, you can find a better data structure than the B-tree for your particular problem, but it'll have the same interface. You'll be able to swap it out. That's the magic of the modularity that you get with object-oriented programming. It's not that there aren't data structures, it's not the thing that you focus on. It's not where all of your code winds up.

You write the B-tree once, and you reuse it, and it's fine. This was a hard problem. No one knew how to do that back then. It took creating this whole system of classes with constructors, that they knew how to build themselves and initialize themselves. Then message passing so that you could interpret the message.

Say, the message is basically, "Give me a sequence of all the nodes. Walk the nodes in order, like depth-first search. I'll tell you when we're done if we find what we're looking for." This is an interface onto a B-tree, but you could imagine that same interface working on other data structures as well.

Your code has to know how to work with that interface and not how to follow pointers. Someone once said that one of the things object-oriented got rid of was the idea of point. You're not dealing with structs in memory which point to each other anymore. You're dealing with references to discrete objects.

It's not continuous memory that you have to wade through. It is discrete objects with references to discrete objects with discrete API. That's a good way to think of it. With this interpretation of what he meant by, get rid of data structures, could he possibly mean something similar? If we were generous with our interpretation of his question, "What if data is a bad idea?" A bad idea for what? There's not a lot of context there. Later on, he does have some contexts.

Here's one of the responses to one of the questions. "This is why the objects of the future have to be ambassadors that can negotiate with other objects they've never seen. Think about this as one of the consequences of massive scaling." This is more context, massive scaling of software system. We want our systems to be able to scale more than linearly and so far, this is my interpretation, we're always working on a super linear curve that has an elbow when we're scaling our system.

We're writing systems and at some point it slows down to the point where no new features happen. All we've been able to do so far is slightly push out the elbow. We're still in that linear, pre-elbow part of the curve. At some point, even all the best practices, all the best languages, at some point still hit an elbow and go super linear.

We need to find ways of pushing out that elbow. He's trying to push out the elbow to galactic scales. What if data is the thing holding us back? That's more context. Of course, it's not all in here. I've read a lot of stuff from Alan Kay, a lot of his talks and everything. Let's read another response.

"Elsewhere in this AMA, I mentioned an example of this. A resurrected small talk image from 1978 off a disk pack that Xerox had thrown away, that was quite easy to bring back to life. It was already virtualized for eternity. This is another example of trying to think about scaling, in this case temporarily when building systems. The idea was that you could make a universal computer and software that would be smaller than almost any media made in it."

Maybe he does go into this in another section. He talks about I'm not going to read it. I'll comment on it. He talks about TCP/IP, and how that was a great idea. It allows computers of different architectures and different machine codes and everything, to communicate. It's a very small speck. He said one of the secrets was that the message did not have any structure except for the envelope. It had the packet and find how to interpret a packet. You could put anything in there.

In other systems for efficiency or general lack of vision wanted to put stuff in there, like more structure. TCP/IP succeeded, because they said, "No, we don't know what people want to send. We don't know what computers want to send. We can't possibly think a thousand years into the future what they will want to send. We can come up with this general envelope and make hardware software that can open up the envelope and have this packet."

Then there's also the slash in TCP/IP. IP talks about the packets. TCP has a very lightweight process for doing a handshake, and keeping a socket open, and reconstituting a stream of data from individual pack. It deals with packet loss. It deals without of order. It deals with repeated packets. In general, it works to reconstitute this.

What he's getting at is, we need another thing on top of that to deal with computation. We've got the "data layer" figured out. We know how to get streams of data whizzing around the Internet. What we don't know how to do is to have something that is the actual interpreter of that data. Yes, it would be built on data, fine. We shouldn't be building these bespoke pieces of software that run on a particular platform, and then send data to each other because now you're back down to the data layer.

What we want to be able to do in order to scale better, is have some other thing, let's say a small virtual machine that could safely run software from somewhere else, that would ride on top of the TCP layer, and allow for a richer way of communicating, of computing, than simply sending data. That is what I think he means by, what if data is a bad idea? We have the way to send data. What we don't have is a way to send the ambassador, this interpretation. Here's some data. This is how I think you should interpret it.

This is a way to understand this data. More to the point, going back to what Rich Hickey was saying that the idea of data has to assume that it doesn't bottom out with signal, that there's some basic idea of meaning in it, somehow encoded in it in some way that you expect to be able to read back in. That's assumed in there. 42 is not a fact, but 42 pounds is a fact. Better would be, Eric weighs for, I don't weigh 42 pounds. At this time, this person weighed this much on this scale. You can elaborate what a good record of that would be.

What Alan Kay is saying is, "Yes, in human terms we've all agreed. We've been to school. We know how to read. We know how to read numbers, and interpret dates, and English. We know the units that we're all using. What if we didn't have to? What if we could have a system where you could communicate, let's say with another planet?"

"Like people on Mars develop their own network. People go to Mars, and then there's some divergence. They have their own computer factories there. We have computer factories here. They have different gravity on Mars, maybe weighing using pounds is not an option there. There's all sorts of stuff. You're diverging, and now you want to communicate again."

He's thinking galactic scale. You don't want to send a stream of data, you want to send an interpreter. Look, these are bytes I'm sending. TCP understands the idea of bytes. We've got that covered. The idea of pounds, the idea of a date, the idea of that, all of this has to be agreed upon out of band. How can we and computers, they're diverging, whereas in some ways human education is converging. We all go through the basics at school.

We're all taught to be similar enough to communicate. Computers are diverging, what is the minimum amount that they need? What is that minimum schooling that a computer needs to be able to communicate that one level higher than data? That's my interpretation of this question. He didn't phrase it that well. What if data is a bad idea?

That invites all sorts of discussion about the definition of data. What does it mean to be a bad idea? Can something be a bad idea but still worth questioning? Those are the things that Rich Hickey was trying to explore and open up. I have this trouble a lot with Alan Kay talks where he'll say something like this. He means something big. The words don't express how fundamental he's trying to get. It's a very good line of inquiry. It's a very good research topic.

How could we scale the computation of the Internet, not just the data sending abilities? That's basically what it has now. In one of the responses, he talks about how like, "Well, you could think of JavaScript as a bad VM, but at least it's universal. You could compile down to JavaScript. You could define your VM on top of that in terms of JavaScript, at least because then it's universally runnable." All these things are adding up to this interpretation that I have. He's asking something, this unanswered question.

The next step in what the people who developed the Internet were trying to do, which is build on top of a very small layer. It's got to be small, on top of the stack that we have, the network stack that would allow programs to move freely. How can you do that safely?

These are all very open questions that we don't know yet. They're good lines of research. What would it look like? How would a program that's sent over the wire and reconstituted, started some running, how would it communicate with the other things that are running on the same machine? How would it find what to communicate with? Because it can't be a complete program.

We know that any program has all these dependencies and stuff. You don't want to send everything every time. Maybe you do, maybe that's an open question as well. If you're a genie out of a bottle and now you need to get stuff done to fulfill your purpose, what do you do? How do you communicate this?

He talks about stuff like you could start to establish common baseline things and some kind of basic protocol that's defined in that virtual machine. All these things are worthy research topics. I'm glad someone is banging their shoe on the desk to bring these back up.

Thank you so much for being there, for listening, for watching. My name is Eric Normand. This has been another episode of my podcast. You can subscribe. That would be great, because there's other episodes coming soon. I'll end it there. Thank you for being there. As always, rock on.