Two kinds of data modeling

Through conversations, I've realized that there's two kinds of data modeling that are distinct. They have their own constraints and needs and, consequently, their own techniques. In this episode, I explore what makes them different.

Transcript

Eric Normand: In this episode we are going to talk about two different kinds of data modeling. My name is Eric Normand and I help people thrive with Functional Programming.

Data modeling is an important if somewhat advanced skill. In my book, we don't get to data modeling really until part three so about Chapter 16 let's say. I realize when I've been taking about data modeling in this podcast and online that there's actually two different kinds of data modeling.

Sometimes we'll have a conversation where I'm talking about one kind of data modeling and the other people are talking about another kind. That's why the conversation is not jelling like we're not agreeing even though, they're saying stuff that I agree with, and I think that they would agree with me if they understood where I was coming from.

To help clarify this, to help understand this, working out these ideas because part three I have not really started on. It's just ideas right now. It's not very organized. That's what this podcast is about, me organizing these ideas in functional programming.

Let´s get into it. What are the two types, the two types of data modeling? The first one is the modeling design decisions, naming, thinking about the structure of data that is exposed through your API. This is public-facing data.

Whenever you may get JSON API, you have to have some structure to that JSON that is going to be able to express everything that your API can do with that endpoint. Then people are going to make clients for that API, for that JSON endpoint.

They're going to need to be able to structure a JSON document in a way they get the API do what they need it to do. There's an art, and a craft, to building this API JSON, such that it has clear names, that the structure makes sense.

You choose the types correctly, for instance, if you want to have a DateTime, how do you make it clear what format the DateTime should be in? Is it a string? Are you using a Unix time stamp? What is all those choices that you have to make?

You're making this developer-centered ergonomic data structure that is all about communicating intent to the server. Likewise, there's one as the response that's going to come back. There's some kind of JSON that comes back as the response that has to explain, "This is what happened and here's the state of the system."

It's going to have all this. You have to design it mostly for clarity. It has to be complete, it has to be able to express everything the API might need on the stuff coming in. Then it has to give back everything that the client might need in a way that's somewhat convenient.

Mostly, it's just for clarity. There's a lot of ergonomic concerns because you want your API to be somewhat easy to learn, somewhat easy to work with. You want it to be self-describing, so that errors or other weird conditions are clear, like what they mean.

There's a lot of constraints on it, and a lot of those constraints are human constraints. It's a programmer who has to use it.

The computer can generate JSON all day, but as the programmer, you have to figure out, "How do I generate this JSON that this API needs so that the intent is clearly expressed to the server?"

That's the first type. I said JSON, but any kind of data format. It could be XML, it could be Eden, whatever you have. It works the same.

I'm going to call this the API model. It doesn't have to be an API like on HTTP, it could be a library's API and what is the data that it needs and gets back. That's not important.

What's important is that it's where data comes out of the system and becomes globally available. It's interpretable by other people, by other systems, might be transferred over a wire.

It's all about public broadcasting of information, or public collecting of information where you don't have control over it, and you want it to be clear. The other kind of data modeling is what I'm calling "Internal."

There's External — that's the first one — and Internal. This is the data modeling where it's not something that's going to be exposed to the outer world. It's really just for use in an algorithm, maybe multiple algorithms.

You create a data structure that is more convenient for the computer's processing. No one's going to see it. You're never going to print it out. You're never going to convert it to JSON and transfer it over the wire.

It's just there for some intermediate representation in the algorithm. Usually, it's done for some kind of performance, like a performance guarantees, algorithm that complexity guarantees.

For instance, if you need a small cache based on some key in the value, you throw it in a hash map. You might have a little bit of data modeling that you need to do like, what are the operations I'm going to use on this data? How do I get stuff out?

There's a little bit of data modeling going on there to make sure that the algorithm that uses that cache can do so efficiently. You make sure you put an interface on it so that the constraints all hold.

In this case, the names of things aren't quite as important as when you're doing a public facing. I'm going to say they're decidedly less important. You could use one-letter names. It's not going to leak out, and so it doesn't have to have human readability, the same level of human readability as you have in the External data modeling.

Recently, I've been talking a lot more about this Internal data modeling and how in Clojure systems where we don't have types. We just have basically hash maps, vectors, keywords, things like that. We build these data structures up.

These data structures we build up are often not designed. They're just kind of, "I have a hash map. Oh it's so easy to just throw another thing in there." Over time, little bits of code here and there as you add new features, add new things, nest them deeply.

It's easy to forget what's in there, what you can expect in the certain parts in the code. Same trouble that you would have in a system that had mutable data. What's changed it up to this point in the code? Well, you don't know.

What is in the hash map by the time I get it? I don't know. There's no design going on. It just happens organically over time. What I say is you need some data modeling there. You need to sit down and model that, like what can we expect? What can't we expect? Is this the right structure that we want? That's all Internal.

This is stuff that's flowing through pipelines, maps, filters reduces. It doesn't have any kind of interface, any kind of system to it. It's whatever happened to be thrown into that map and whatever keys. It's not ever going to get sent over the wire.

This is all Internal stuff. It's not for some tight-loop algorithm, but it is like the flow through the system. At the end, it's going to get turned into something that fits some kind of spec that the API...As you know, it's a response to an API request.

It's going to match the JSON that needs to go back out, but internally, it diverges. It goes wherever it needs to go. What I'm suggesting is we don't do enough of that internal data modeling. I have a lot of suggestions for how to do that.

A lot of it is just sit down and apply some proper discipline to it. I don't think you should confuse those techniques, not doing too deeply nested, like reaches down into data structure.

Those tied together the place where the thing lives and its meaning. The path into the data structure is the place where it lives, and the meaning is what you can do to it. It ties those two things together and you want to separate those out.

What you can do to it can go in its own function. You name it. You collect it so that all the things that you can do to that piece of data, regardless of where it lives, are collected together. Those you can collectively called the interface, and then where it lives, you should somehow shorten that.

Usually, what you would want to do is have each layer of nesting, know what's inside of it, and not look too deeply into it. This is called like the Law of Demeter in object-oriented programming where you don't want to have these really long method chains, where you know what's in that object and in that object and in that object.

There should be some limit to how deep you can reach into objects because that really complex all the objects together.

OK, I'm getting deep into this. I've talked about this in prior episodes, but all those suggestions are really about Internal data modeling. They don't apply so much to the External data modeling.

There, it's all about clarity, readability, writeabilty, universality. You want some name that's never going to change, that can be used in any system. There's all sorts of other constraints on it.

When you think about it, the interface is the least important part, because you're not going to be composing this thing up as much as you want to have the thing that you can send. Maybe like a template, like this is what we're doing in this API. It's a totally different animal.

I'm going to recap now. Two kinds of data modeling. They share a lot of principles but in the end, they have pretty divergent uses. One is External, for public-facing data. Either data you're expecting or data that you're providing is supposed to have a certain shape to it a certain structure.

You need to meet that structure and specify it for another programmer so that they can understand what they're going to get and what they need to provide. There's a lot of ergonomic concerns in there. Want it to be self-describing, if that's possible. All that stuff.

The other kind is the Internal data modeling, where you've got some data that's used in an algorithm. You need that algorithm to run efficiently or you need that algorithm to not be a mess. [laughs]

Those kinds of design concerns where each feature takes longer to develop and so we need to...It's harder and harder to know what kind of data we have. We've never actually sat down and designed it.

You need something that helps you wrangle all of the unknowns of that data. That's what the Internal data modeling is. When you use that, it's not about the ergonomics so much as about making sure that you can decomplex and you can separate out the different pieces without feeling like everything is going to break.

If you liked this episode, you can find all the past episodes at lispcast.com/podcast. There you'll find audio, video and text versions, text transcripts, of all of the old episodes. You'll also find links to subscribe on in any of those three formats, and links to social media so you can find me, get in touch with me and discuss this.

I've been having some really nice discussions about this very topic. That's what prompted me to have this episode. I've been thinking about, why is it that so many people disagree with me on this?

I don't disagree with what they're saying, but that has nothing to do with that. What I'm saying is, OK, there's two different things that we're talking about. We're both calling them data modeling. I just wanted to clarify those.

Obviously, this is fruitful for me, for the other people involved. Send me your questions. Send me your comments. Disagree with me. This is a great stuff and I'm really happy with all my listeners who are out there. Thank you so much for listening.

This has been my thought on functional programming. My name is Eric Normand. Thank you for listening and rock on.