Focus on the data first

What should we design first to make sure our software will last without having to constantly rework our code? We should focus on the data first because it is the most timeless.

Transcript

Eric Normand: How do we ensure that when we write software it will last as long as possible, that we won't have to keep updating it, rewriting it and revisiting it all the time? My name is Eric Normand and these are my thoughts on functional programming.

This is a very common problem where we write some software. Either we just live with the problems with it, because we don't have time or a chance to go back over it and fix it, or we're constantly fixing it because there's just something messy about it. It's not right. We can't find a way to clean it up.

In my theory of functional programming which divides things up between data, calculations and actions, in that theory, it's very clear where you should start. As you move from the top, actions through calculations down to data, data is actually the most timeless thing.

https://twitter.com/ericnormand/status/1051850867077464064?ref_src=twsrc%5Etfw

Actions, obviously by definition, they depend on the time. They depend on when and how many times they are run. Calculations don't depend on when or how many times they're run. They're timeless, except they're opaque and they're in code.

https://twitter.com/ericnormand/status/1024308772146229249?ref_src=twsrc%5Etfw

You may want to switch languages in the future. There is always this possibility that that code will break somehow, that you won't be able to run it anymore. The data is much more timeless. The only risk is that the data format becomes impossible to read or the data format becomes unclear what it's for.

That's why I say we should start with the data. It is the thing that is inherently the most timeless. If you start with the data and start designing the data, you can increase the chances that it will survive for the long-term.

https://twitter.com/ericnormand/status/1045025606260453381?ref_src=twsrc%5Etfw

Meaning, you design it, you think about the names of the parts of it. You think about how to communicate or making it clear when you're out putting this data, what things mean and what they are used for. That is something that I've learned over the years that the data is the thing that is the most valuable. Once you know you're going to record some events that's timely.

Some event happens like the user clicks a button, you're reading a sensor, or you're getting an image from a camera. Once you capture that, you can't go back in time and capture it again. You want to capture it in a high fidelity format for your purposes.

It's very important to make sure that that data is future proof. As things change that you can still use that thing. That is never going to change. You can make a new piece of data from it that might be more future proof. That piece of data itself is inert. It's done. It's dead on the disk.

The way you keep it alive is by writing code that knows how to interpret it. That data is going to last way longer than any code you write. It could be in a database. There's just storing all these records for everybody for all time. Basically as long as your company is around and you hope that that lasts a while.

Data is where it's at. That's where we need to start. How do we design the data? What is important about the data? What do we want to capture? The first thing is that you want it to be at least somewhat human readable. Doesn't mean that a human is going to read it and interpret it themselves. It means that someone in the future can read it and write a program to interpret it.

It needs to be clear what the pieces are. You would like the names to be timeless. One problem that we have a lot in software is when we write a name down for something. It is unclear. We write the name down in code. Then we'll later on change the code to do something else with that same name.

https://twitter.com/ericnormand/status/1056924105402998784?ref_src=twsrc%5Etfw

It might not seem like a problem at the time but often it is. You have something like you say, "First name is a string." You just treat it like a string. Later you say, "We weren't actually strict enough. First name needs to be a non-empty string with no spaces at the beginning and end of the string."

Now, you're more restrictive. You may have first things with empty strings in them from before you started restricting that. You've actually changed the meaning of first name in your code. You can't read those old records. They're not readable before.

What's worse is if this is part of an API, clients of this API might not change. They might not change in lockstep with your changes. They have data that they've been processing through this API. Now, it's not going to go through.

What we want is to say, "First name is whatever it is we first set it to." If we just say it's any string at the beginning, we can never change that. This is for data that's going to be exposed on the outside or an interface that's going to be exposed outside. That encompasses quite a lot more than you might expect.

If you write a JSON to disk and make a backup, your computer crashes. You need to restore from backup. You're running your importer. The importer is different from when you wrote the stuff to the backup. It now restricts the first name like you've broken your system.

You have not made a future compatible change to the meaning of first name. What we need is names that are actually longer than first name. We need first name, then first name v2 and first name v3 or better yet some way of enforcing uniqueness across systems.

You could say, "Hey, this is my first name." You have your first name. You have a name space or something worth like Eric/first name. You have Mary/first name. We know that they might not be compatible with each other. We can always identify which one we mean later.

None of them can change. That's the important thing. Some committees will come up with data formats. I used to work in geospatial organization. They had a lot of data formats. They wanted to exchange geographic information like where is the city located, what's the latitude and longitude. They had very well-defined formats.

We also have stuff like URLs have a standard for what's allowed in a URL and how to parse it. Same with email addresses. We have these standards sometimes. Sometimes we don't have standards. I don't know of a standard for person name that is a universal and widely used.

If we had one — maybe there is one, I don't know all of them — we should use those. If they're doing their job, they're not making these incompatible changes. They make it so that everyone can rest assured that if you write to this format, you'll be able to read it in later. Other people will be able to read your data. You can read other people's data. That's what we want to ensure.

Design your data first and make it timeless in that sense that you could read this database in 20 years, in a hundred years. Make sure that it's clear what it means and that it's human understandable. The names make sense that you're not changing the meaning of those names over time. That is how we need to approach it.

https://twitter.com/ericnormand/status/1076145473566769153

I hinted before in another episode in the one on variants. I hinted that what tends to happen, even if you do start designing your data first, is we get these super complex deeply nested data structures. Clojure programmers are notorious for being guilty at this. They will just throw data into maps and then deeply nest those maps.

Then they complain about forgetting what keys they have and what data things are expecting and how to get out of what past things are at in deeply nested in their data structures.

In the next episode, I want to talk about that problem and its solution. I love getting questions. I'm @ericnormand on Twitter. Please send them over. Please subscribe and tell your friends about this podcast, if you like it of course. All right, see you later.