Tension between data and entity

There is always a tension in our programs between raw data and meaningful information. On the one hand, data is meaningless alone. On the other, we want to treat it as a thing with real semantics that constrain its usage. How do we live with both of these at the same time?


Eric Normand: Isn't there a tension between treating something as just data, because that's what it is — it's just data by the thing — and treating it like an entity with its own semantics?

Hi, my name is Eric Normand. These are my thoughts on functional programming.

Yes, I do think there is a tension. I think that that's where a lot of functional programmers live, in that tension. We recognize that we're dealing with information systems which means that, at the end of the day, it's just some data that we're dealing with.

We're processing data. We're storing it. We're transforming it, making calculations from it, all sorts of things like that. We also recognize that this data is meaningful that it's supposed to have some kind of semantics. At the end of the day, an integer is not just an integer. It's somebody's height or a string of somebody's name. It's not just a string.

There is a tension between these two things. We have to live with that. All programmers live with this tension. The question is how do different languages deal with it? How do different paradigms...? Maybe the paradigms has something to do with it. Maybe it doesn't.

A reader or a listener wrote in with a question that asked me to talk more about the tension between the two and how in the previous episode, maybe a couple of months ago now, I was talking about putting some interfaces around your data so that you could treat it like entities.

This is very close to a Clojure-specific recommendation. In Clojure, we tend not to make new types, new classes. We tend to use HashMaps to represent our entities. If you leave it as a HashMap, you leave it as data.

The problem that you run into a lot is you have all these HashMaps, and you don't know what you have anymore. You just have a bunch of HashMaps. Sometimes, they're deeply nested. People get into confusion. They forget what keys are valid. They forget where they're at in the tree of HashMaps, the big, nested data tree. They don't know what they've got at each level.

What I suggested there was that that problem shouldn't happen if you carefully abstract your entities with interfaces. Meaning, someone put it really well at one of the Clojure conferences. I had a conversation with him. He said that he doesn't get into this problem too.

The problem is people create these big types. If you were going to describe a nested Clojure data structure as types, they would have this really big type description. It would be a company has multiple people. Each person has an address. The address has a street. The street has a number and a street name.

You're starting at the top. You are describing this whole path down into the tree. That's just part of it. The address also has a zip code. It also has a city, and a state, and maybe a country. It's got all these information in it that is described from the top.

Now, what you want is something that is shorter, multiple short things. You say a company has people, end. A person has an address, end. An address has a street number, and a street name, and a zip code, end. You have these short types that are easy to picture, easy to see in your head, easy to read.

You never, at any point, have to think down into them. You never had to think too deep into them, maybe two levels deep. I think that one is enough. My suggestion was that you should start thinking of those individual pieces and what operations you can do to them.

My suggestion was you take those entities — the company, the person, the address — and you think about the operations that makes sense for those particular entities.

A company doesn't have any operations on it that can change a person's address. A person can change their address, but a company cannot change a person's address. There's a semantic barrier there. There's a level of meaning where this can't happen. That's where the entity boundaries come in. That's where you put your interface.

That isn't to say that you still have the exact same nested data structure in the end, but when you have a company, you can only do those operations that you have specified for a company.

If you're like, "Oh, give me the list of addresses of all the people, of all employees," that might not be an operation on company. You would have to go and ask for all the people. Then you would get an address for each person. You never have this problem of something like a get-in where you got this long path going in there.

You should never know the path because if you're starting at a company, it doesn't know two levels in that people have an address. I'm trying to re-explain that situation and why I suggested that you should be thinking about an interface.

Usually, it's an interface made of functions, but it could just be yes, and you can access these keys like functions. You don't have to totally wrap it up, but by doing that, you're able to have the best of both worlds, the best of that tension.

You can treat it like an entity. You can treat a company like a company even though it's just a HashMap. You get to treat it like data. You get to treat it like a HashMap. All your functions, like Select keys, List keys, you can turn it into a seek of key-value pair. All of those things are still possible.

At the end of the day, you are still able to treat it like data, but when it's at the appropriate semantic level, you can treat it like a company, not treat it like a nested tree of HashMaps.

I think that what happens is people are acting at the wrong semantic level when they get into this problem of, "I don't know what I have anymore. I've got these deeply-nested paths. If I change something deep in here..."

Let's say you changed address. You no longer separate out the street number from the street name. You just have one string called Street One. You have another one called Street Two. You can put the apartment number in there or something. That's very common. You changed the structure of address.

Before, all of the code that deeply knew the whole path of keys from a company through the employees into the address and to the street number, those now have to be changed. You have to go find those things. I'm arguing that they should never have existed because of that because you're coupling this address thing with a company.

A person's address and a company just don't...They're not related enough to justify knowing this path along it. They're working at the wrong semantic level. When they're thinking about it as a company with employees who have addresses, they are treating it like a HashMap. That's a low semantic level. That's like the language level.

They haven't built-up another layer of meaning where it's a company with operations that make sense for companies. That tension there is always there. What semantic level am I playing at? It's your job as the programmer to always try to be separating these things out, these levels out.

Let's get into the stratified design that I've talked about before. If you're treating an entity like a HashMap, that's a smell. If you got these really long path to get data out from a certain part of the tree, you're probably working at the wrong level.

I think that there's a law in object-oriented...I want to say it's Law of Demeter, but I might be wrong, this thing where you're not supposed to ask for the company's employees' addresses' street name.

You're not supposed to have this long chain of getters. That what you're supposed to do, if you need to do that, it probably means that there's a semantic operation for getting like a mailing list for employees.

Employee mailing list, it could be a valid operation on a company. You shouldn't be embedding this long chain somewhere else. You should just call companies one place, the employee mailing list. It's a similar idea. You're trying to work at the level of meaning where you're at in your problem.

You should be always asking yourself, "Is this a HashMap problem? Why am I doing a lot of assoc and gets? Why am I doing a lot of updatings and get-ins?" This is a Clojure speak here. "Why am I doing a whole bunch of array accesses in JavaScript here when I know this is a company? It should have operations that are specific to companies."

The thing in Clojure is that we separate out the behavior from the data. A lot of times, you're getting to this trouble where you're thinking, "Well, it's just data, so let me just always treat it like data." It's in an objectorian world, you have a place to put the behavior, namely, the class, and you start adding semantics into the class.

The problem with that is it's all encapsulated. You now no longer have all of the HashMap semantics that you had before. I'm starting to ramble now, so I'm going to sign off. My name is Eric Normand. This has been a thought on functional programming.

Sorry for the light streaming in. Hope it didn't interrupt too much of my train of thought in your ears. If you'd like to ask a question, I like answering questions. I feel like now I'm at the point in this podcast where I've got enough audience, where I'm getting a regular stream of questions, which is actually pretty nice.

You can reach me on Twitter @ericnormand. You can also email me, eric@lispcast.com. Finally, on LinkedIn, follow me if you're into the LinkedIn. Awesome. See you later. Bye.