Programmer as Navigator
We read and discuss the 1973 ACM Turing Award Lecture by Charles W. Bachman.
Bertrand Russell, the noted English mathematician and philosopher, once stated that the theory of relativity demanded a change in our imaginative picture of the world. Comparable changes are required in our imaginative picture of the information system world.
Hello, good morning. My name is Eric Normand, and this is my podcast. Today I am excerpting and commenting on the 1973 ACM Turing Award lecture by Charles William Bachman.
The lecture is titled, "The Programmer as Navigator." I was not familiar with Charles Bachman. As I often do, I read the bio and comment on it. There is a very interesting person here.
All right, so just some statistics or basic facts. He is from United States, born in 1924 Manhattan, Kansas, New York. He earned this award in 1973 when he was 49 years old. He is interesting because he was not an academician, he did have a master's degree in mechanical engineering, but he worked mainly in industry.
I'll just read it. He got the word for his outstanding contributions to database technology. Basically, if you want to put a real short sentence, he invented the database that we know of today. He created a thing called Integrated Data Store that established the concept of the database management system.
I'll excerpt from this bio, "During a long and varied career, he ran a chemical plant, created cost of capital accounting systems, headed an early data processing group, pioneered the application of computers to manufacturing control, led efforts to standardize database and computer communication concepts, won the highest honor in computer science, and founded a publicly traded company."
Definitely, someone out in industry, commerce doing very interesting stuff. I'll read some more.
"During the 1950s, hundreds of American businesses had rushed to order computers. There was a lot of hype about potential benefits. But getting the machines to do anything useful was much harder than expected. They often ended up being used only to automate narrow clerical tasks like payroll or billing.
"By 1960, management experts realized that to justify the huge personnel and hardware cost of computerization, companies would need to use computers to tie together business processes such as sales, accounting, and inventory, so that managers would have access to integrated, up-to-date information."
You have all these departments to a company. So far, computers have only proven useful for running payroll, doing billing, printing invoices, and stuff. But you need information from all the departments. You need some way of gathering that all together so that you can see what's going on in the whole company. Let's continue.
"Each business process ran separately with its own data file stored on magnetic tape. A small change to one program might mean rewriting related programs across the company. But business needs change constantly so integration never got very far."
They had these proprietary data formats. Every time you wanted to change a format, you had to change a program. Then, you had to go and change it all over the company. It just wasn't working.
"The crucial invention, operational by 1963, was Bachman's Integrated Data Store or IDS. IDS maintained a single set of shared files on disk together with the tools to structure and maintain them. Programs responsible for particular tasks, such as billing or inventory updates, retrieved and updated these files by sending requests to IDS.
"IDS provided application programmers with a set of powerful commands to manipulate data, an early expression of what would soon be called a Data Manipulation Language."
He basically created the database, but he was the one who said, "Now we're going to access this database in this standard API that's a service-oriented database. It's not going to live in your system. It's going to live in one place and all the other systems, the billing, the counting, payroll, HR. Anyone who has to access the data is going to go through this API.
"This made programmers much more productive. A crucial step towards the integration of different kinds of data, which in turn was vital to the integration of business processes and the establishment of the computer as a managerial tool. The programmers became more productive by the decoupling.
"The programmers no longer have to worry so much about managing files and writing their own custom search routines and stuff to those files. Then it gets the data integrated. You can make queries across different departments and stuff.
"By the end of the 1960s, the database management system programs, such as IDS were being called, was one of the most important areas of business computing research and development." Remember, this is in 1973 and by the end of the '60s this DBMS, the database management system, was known to be very important.
"In 1973, Bachman became the eighth person to win the ACM Turing award. At that time computer science was a young discipline. Its leaders were struggling to establish it as a respectable academic field with its own areas of theory, rather than as just a technical tool needed to support the work of real scientists such as physicists."
All right, so I'm going to stop there. It's interesting that this is stated here because I don't think I've read anything in the previous Turing award bios or the lectures about, still in the '70s, trying to establish computer science as a real science, a real field of study and that their programmers weren't just people who supported some other field.
Bachman was the first Turing award winner without a PhD, the first to be trained in engineering rather than science, the first to win for the application of computers to business administration, the first to win for a specific piece of software and the first who spent his whole career in industry.
That's interesting that they would pick someone like that, but I don't know if we'll see this happen again. That's what I'm wondering. Will we see someone else from industry have this kind of effect?
He stood in opposition to the ideas of Edgar F., known as Ted Codd, a mathematically inclined IBM research scientist whose relational model for database manipulation had attracted a growing band of supporters and was beginning to legitimize database systems as this theoretically respectable research field within computer science.
A debate between the two and their supporters held at an ACM workshop in 1974 is remembered among database researchers as a milestone in the development of their field. When we think of big database systems, we often think of relational databases.
It's interesting that Bachman was opposed to Codd like there seem to be some kind of antagonism between them although they both participated in this debate. Usually that means that the antagonism is purely intellectual. They both have similar goals.
I think that that's really fascinating that there was this debate that academics wanted to take over the database [laughs] with more theory. It's pretty interesting. Modern relational systems continue to follow the basic template for the database management system invented by Bachman and his colleagues in CODASYL.
A complex piece of software managing data storage, enforcing access restrictions, providing interfaces for both application programs and ad hoc queries, and providing different views on the same data to different users.
Yes, the whole idea of SQL database...It's a relational approach, but as we'll see in the lecture, it's mostly just a different kind of API on the same basic principles that were already in the system. By the way, I think the relational model is actually pretty cool. It wouldn't work without the work of people like Bachman who made them practical and fast. They did the hard engineering tasks.
Again, The Programmer as Navigator. I have to say this. I always forget to preface this. Here I am, reading the title. I need to preface. I don't know much about databases, but I have used them and they are something I studied in school. I've also used them in jobs. They're there in the background.
I easily take them for granted. This lecture was actually really interesting that he had to introduce the idea to the audience and justify it which, to me, was...Be honest, from my perspective, almost 50 years later after this lecture, we don't need to justify having a database and having standard ways of accessing the data and stuff like that.
This is, to me, most interesting as a historical perspective. The idea that this introduction of the database was a turning point in the industry and so there's going to be a lot of references to that in the lecture. This perspective shift and also...I'll bring up the history a bit.
Here we go. The Programmer as Navigator by Charles W. Bachman. This is published in November 1973. "This year, the whole world celebrates the 500th birthday of Nicolaus Copernicus, the famous Polish astronomer and mathematician.
"In 1543, Copernicus published his book concerning the revolutions of celestial spheres which described the new theory about the relative physical movements of the Earth, the planets and the Sun. It was in direct contradiction with the Earth-centered theories which have been established by Ptolemy 1,400 years earlier.
"Copernicus proposed the heliocentric theory that planets revolved in a circular orbit around the Sun. I raise the example of Copernicus today to illustrate a parallel that I believe exists in the computing, or more properly, the information systems world. We have spent the last 50 years with almost Ptolemaic information systems.
"These systems, and most of the thinking about systems, were based on a computer-centered concept."
He's hinting here pretty directly and perhaps a little bit too arrogantly. Bringing up Copernicus in terms of...It's a big stretch.
He's saying that we need some similar shift from a computer-centered to something else and that what's we're just going to go into here and that this would be almost like a revolution like we had the Copernican revolution where we started thinking in terms of the Sun being at the center of the solar system.
"Just as the ancients viewed the Earth with the sun revolving around it, so have the ancients of our information systems viewed a tab machine or computer with a sequential file flowing through it. Each was an adequate model for its time and place but after a while, each has been found to be incorrect and inadequate and has had to be replaced by another model that more accurately portrayed the real world and its behavior."
Again, he's going through this idea of this tab machine or a computer with a sequential file flowing through it. It's going to go over that again. It's going to come up but it's something important that the computer was at the center and data was sequentially entering, usually through a tape or something.
It's like coming in and you just have to read it, to read what the tape was telling you and decide "OK, I don't need this," but you couldn't really jump around very easily on the tape. It was linear access instead of random access and so the computer was at the center. Do I get more tape? Do I keep reading?
"A new basis for understanding is available in the area of information systems. It is achieved by a shift from a computer-centered to the database-centered point of view. This new understanding will lead to new solutions to our database problems and speed our conquest of the n-dimensional data structures which best model the complexities of the real world."
This is something that probably today is not too controversial, that we have n-dimensional data meaning you don't just look someone up by their primary key, like their employee ID number. You also have all these other dimensions that you might want to look them up by.
Then you can sometimes have relationships between different types of data, different tables. You can traverse those relationships. You're creating this space, this n-dimensional space. He's talking about this as this model the world better than this linear sequence of data entering in through a tape. He's going to expound on that a little bit.
"The earliest databases initially implemented on punch cards with sequential-file technology were not significantly altered when they were moved, first from punch cards to magnetic tape, then again to a magnetic disk. About the only things that changed were the size of the files and the speed of processing them."
He's talking about early databases. They moved from punch cards to magnetic tape. Both of these are linear. You have a stack of cards. They're fed into the machine and read one at a time in order.
Magnetic tape, which is a big spool of tape. It's fed through a reader. You just feed it through and read it and you can stop when you're done. Then, we change the magnetic disk because there was existing software didn't change much.
"In sequential file technology, search techniques are well-established. Start with the value of the primary data key of the record of interest and pass each record in the file through core memory until the desired record, or one with the higher key, is found." Let's imagine a simple scenario. You have a tape. It's just full of records. Each record is a certain number of bytes, let's say 500 bytes.
You can easily go through and if they're in order, they're sorted. They're sorted by, let's say, employee ID. You can just start reading. You find a record. You read its first field is the employee ID. You compare it to the one you're looking for. If that's the one you're looking for, you're done.
If it's less than the one you're looking for, you haven't reached it yet. Read the next record. Maybe that's it. Oh, that's not it. Read the next record, until you either find it or you find one that's greater than it, which means, "Oh, we don't have one because now we're past that ID because they're sorted. Now we don't have it." That's the main way you search. It's just linearly through the files.
"The availability of Direct Access Storage devices laid the foundation for the Copernican-like change in viewpoint. The directions of in and out were reversed." I'm going to stop here. Direct Access Storage, what he's talking about is a disk, a spinning disk, which if you don't know how it works, I'll describe it briefly.
The disk is spinning. The files are laid out linearly on rings on the disk. You have this disk, it's spinning really fast. If you want to read a record at a certain point in the file, let's say you know it's at a certain block, that's where your record is, you can move the read head from the circumference along the diameter, or along a radius of the disk, from the circumference all the way to the center.
You can move your read head to the correct ring, and wait for the block to come under your read head because it's spinning. The disk is spinning. You wait a small amount of time. Then you can start reading the block, the exact block you want. Now, it's still linear. It's just everything happens faster. The read head only has to move a short distance from along the radius of the disc.
The disc itself is spinning fast. Instead of a tape, which basically can only be spun out or spun in, wound up, or unwound, and you have to read linearly like that. You can more quickly pinpoint an exact place on the disk that you want to start reading. That's what they're talking about with direct access. You can directly jump right to a spot.
"The directions of in and out were reversed. Where the input notion of the sequential file world meant into the computer from tape, the new input notion became into the database. This revolution in thinking is changing the programmer from a stationary viewer of objects passing before him in core into a mobile navigator who is able to probe and traverse a database at will."
Apologies for the masculine pronoun there. It really should be before them or something, but those were the times. Let me make sure this is clear. He's showing a historical change. Right now we're on the other side of the change. I've never written a program that read tape.
I can't imagine or I can only imagine what it must have been like to think of this computer as pulling...Basically unspooling tape and reading it, looking for something linearly. This idea was now into the computer to see if that's the record you wanted. [laughs]
This new concept which I feel is true today still is that when you want to get data into the system. It means into the database because that's the center. There's all these other programs, these other...you might think of them nowadays is like microservices or requests, or what have you, are going to be reading out of the database.
That's where the out comes from. That's where the data is. To get data into your system, it means recorded into the database. There's a change from into core memory to now into the database. When I read something like this, just to make it more interesting, I try to put myself in the shoes of someone working on a giant machine with tape readers.
Think about what kind of mindset I would have to be in to have that perspective. Then switch back, "Oh, yeah. Today, we're doing the database. It is true. We are thinking of the center of it as the database." So, interesting.
"Direct Access Storage devices also opened up new ways of record retrieval by primary data key. The programmer who has advanced from sequential file processing to either index sequential or randomized access processing has greatly reduced his access time because he can now probe for a record without sequentially passing all the intervening records in the file."
You can make an index. He talks a little bit about this. The index can point to the block on the disk where the data is stored. It's very fast. Once you've got the index, you can look up a particular primary key and read the record off. However, he is still in a one-dimensional world as he is dealing with only one primary data key, which is his sole means of controlling access.
He's faster. He can access any particular record in any order quickly. He's still in this one-dimensional world. He's going to take a little detour for a second, so that we can understand stuff a little bit better.
"I want to review what database management is. It involves all aspects of storing, retrieving, modifying, and deleting data in the files on personnel and production, airline reservations, or laboratory experiments. Data which is used repeatedly and updated as new information becomes available."
He's talking about a service. This is a service for managing all of those files where we store records. Storing, retrieving, modifying, deleting, the stuff we would think of today.
Database management has two main functions. First, is the inquiry or retrieval activity that reaccesses previously stored data in order to determine the recorded status of some real world entity or relationship." We know this is the query, you want to query your database, get some data out of the database, so you want to read it.
"This retrieval activity is designed to produce the information necessary for decision-making," very clear. "The second activity of database management is to update, which includes the original storage of data, its repeated modification as things change, and ultimately, its deletion from the system when the data is no longer needed."
The updating, he's putting update on all the other stuff that modifies the database, storing a new record, modifying it repeatedly as things change. Then, deleting it when you don't need it.
"The updating activity is a response to the changes in the real world, which must be recorded. The hiring of a new employee would cause a new record to be stored. Reducing available stock would cause an inventory record to be modified. All of these are recorded and updated in anticipation of future inquiries." It seems so obvious when he states it today.
This is a very simple statement of the responsibilities of the database. It's not easy to find such a simple statement. He's going back to try to talk about this n-dimensional thing now.
"The sorting of files has been a big user of computer time. It was used in sorting transactions prior to batch sequential update, and in the preparation of reports. The change to transaction-mode updating and on demand inquiry and report preparation is diminishing the importance of sorting at the file level."
This is interesting. This seems like a little digression, but it's still very interesting to think about where most computing time goes at any point in time as an indicator of something of what is the priorities of computing systems. What are they being applied to?
It's interesting that sorting was such a big user of time. We think of sorting today as...It's kind of been solved, but even it was a solved problem back then, imagine having to sort from one tape to another all of the records for the employees of a large company.
Those records don't fit in memory. You have to sort using some other mechanism. Imagine how much time it would take, not even in the processing time, the CPU time, but even in the spooling and unspooling of that tape. It must have taken forever to get this going, and probably was worth it because it let you look for employee records.
For instance, if you don't have an employee with a particular ID, when you search for that ID, you can know when to stop. You don't have to read the whole tape. It's probably important to spend a lot of the time upfront sorting these records, so that you could access them more quickly later. Think about what we spend our time on these days. It's probably putting stuff on the screen.
"In addition to our record's primary key, it is frequently desirable to be able to retrieve records on the basis of the value of some other fields. For example, it may be desirable in planning 10-year awards to select all the employee records with the year of hire field value equal to 1964. Such access is retrievable by secondary data key."
You have your primary key, which is the employee ID. You want to look them up by something else. You want to look up all the people who were hired in 1964.
"With the advent of retrieval on secondary data keys, the previously one-dimensional data space received additional dimensions equaled to the number of fields in the record.
"With small or medium-sized files, it is feasible for a database system to index each record in the file, on every field in the record. In large active files, it is prudent to select the fields whose content will be frequently used as a retrieval criterion and to create secondary indices for those fields only."
Obviously, if you're going to look something up by that key a lot, the secondary key, you're going to want to index it. Then you have this burden of maintaining the index as things are changing, and that can get pretty big.
"The distinction between a file and a database is not clearly established. However, one difference is pertinent to our discussion at this time. In a database, it is common to have several or many different kinds of records.
"For an example, in a personal database, there might be employee records, department records, skill records, deduction records, work history records, and education records. Each type of record has its own unique primary data key and all of its other fields are potential secondary data keys.
"In such a database the primary and secondary keys take on an interesting relationship. When the primary key of one type of record is the secondary key of another type of record.
"This equality of primary and secondary data key fields reflects real-world relationships and provides a way to re-establish these relationships for computer processing purposes." He's talking about foreign keys, what we call foreign keys today. It's interesting that he's talking about them as relationships.
You have relationships between these different entities in the real world. You need to model that somehow in your database. You do that by having this equality of the secondary key of one entity is the primary key of another.
That gives you a graph where the edges are the relationships and the nodes are the entities. That's actually pretty cool that they had that notion like early on. Smart people.
"There are many benefits gained in the conversion from several files each with a single type of record to a database with several types of records and database sets. One such benefit results from the significant improvement in performance, when all redundant data can be eliminated, reducing the storage space required."
I'm trying to read partial sentences here, but you are able to have "these primary and secondary indices gain access to all the records with a particular data key value." You can then eliminate redundant data because all the files are stored in the same areas. You can put relationships between them that you don't have to have the record on different machines. It reduces the storage space required.
"Another significant functional and performance advantages is to be able to specify the order of retrieval of the records within a set based upon declared sort field or the time of insertion." He's getting into the nitty-gritty details of like, "Why you're trying to justify? Why you would want to centralize this into one service?"
He's talking about performance, but it's also performance in terms of speed, there's performance in terms of the amount of storage required. Then there's also what was talked about in the biography of really decoupling the software from the storage mechanism. The software that ran the report now does not have to manage the files.
You can easily change the record by adding a field to the record, and you wouldn't have to change how the file was read in, and some other software that needed to read it. I think that's a more significant thing, especially as computers have gotten faster and disks have gotten cheaper.
"In order to focus the role of programmer as navigator, let us enumerate his opportunities for record access." He goes over seven different kinds of...they're boring. They are the obvious ones. I'll read one or two.
"He can start at the beginning of the database and sequentially access the next record in the database, until he reaches a record of interest, or reaches the end." Yes, he can do sequentially. He can do random access by primary key, random access by secondary key. He just lists a bunch of these.
"It is the synergistic usage of the entire collection, which gives the programmer great and expanded powers to come and go within a large database, while accessing only those records of interest, in responding to inquiries and updating the database and anticipation of future inquiries."
You basically think you need a bunch of these different ways of accessing. When you can start mixing, matching and using them, composing them in the ways you need, that is when the magic happens.
I think I'm skipping quite a lot here in the section, because these things are kind of obvious to us today. He is doing a lot of justification and he has a scenario. Anyone who has used the database, this would just be like, "Yeah, I did that yesterday." [laughs]
"There are additional risks and adventures ahead for the programmer who has mastered operation and the n-dimensional data space. As navigator, he must brave dimly perceived shoals and reefs in his sea, which are created because he has to navigate in a shared database environment. There is no other obvious way for him to achieve the required performance.
"Shared access is a new and complex variation of multiprogramming or time sharing, which were invented to permit shared but independent use of the computer resources. In multiprogramming, the programmer of one job doesn't know or care that his job might be sharing the computer, as long as he is sure that his address base is independent of that of any other programs.
"Shared access is a specialized version of multiprogramming, where the critical shared resources are the records of the database. The database records are fundamentally different, than either main storage or the processor, because their data fields change value through update and do not return to their original condition afterward.
"Therefore, a job that repeatedly uses a database record may find that records, content, or set membership has changed since the last time it was accessed. As a result, an algorithm attempting a complex calculation may get a somewhat unstable picture."
That was quite a lot to read. I really appreciate the poetic language navigating, dim, braving, dimly perceived shoals and reefs in the sea. That's cool.
He's talking about when you have multiple pieces of software, reading and writing to the same database at the same time, you get a bunch more of challenges. We're aware of those. These concurrency challenges, especially if one is reading and another is writing, and maybe there's multiple reading and writing going on at the same time. You can get an unstable picture.
He talks about something called set membership. I've skipped this part. He was talking about how when you have a secondary key. It returns a set, not just the single record, which primary key would always return a single record because it's a primary key. It's unique, like employee ID.
Once you start asking for employees who were hired in 1964, you're going to get a setback. That set could be indexed itself. That set membership and that set of people hired in 1964 can change.
He is talking about shared access and the challenges of it. I think this section is actually really cool. It shows how deeply he was thinking about this problem and the influence it still has today.
"One's first reaction is that this shared access is nonsense and should be forgotten. However, the pressures to use shared access are tremendous. The processors available today and in the foreseeable future are expected to be much faster than are the available direct access storage devices.
"Furthermore, even if the speed of storage devices were to catch up with that of the processors, two more problems will maintain the pressure for successful shared access. The first is the trend toward the integration of many single purpose files into a few integrated databases.
"The second is the trend towards interactive processing, where the processor can only advance a job as fast as the manually created input messages allow. Without shared access, the entire database would be locked up until a batch program or transaction and its human interaction had terminated."
There's a lot here, so let's stop there and talk about it. He is saying that the processors are today faster and will be faster than storage. That's still true today. That creates a pressure that you are going to have multiple programs running, accessing this data.
I don't know why that would be, but that's what he is saying. It seems that the bottleneck is the speed of your disk. One way to mitigate the problem of that bottleneck is to just multiply the disk. You have two disks. You would then split up your database into two separate databases instead of having multiple access on the same one.
I'm not quite sure where he is going with that. The second one is maybe more important. Two more problems would maintain the pressure. The first is the trend towards the integration of many single purposed files into a few integrated databases.
Instead of having files with single record types in them, for single purpose software, we're seeing this trend. People are putting all of their records into one or a few databases. The different departments at your company are going to want to access that database at the same time.
The second is interactive processing. I think this trend maybe doesn't apply anymore. Basically, you would get a screen and you would type in, "I want to look up this record and do this," and then you'd read it from the database and it would show you the result and you'd say, "OK, now modify the first name and the last name."
While you are doing that, some other software could be reading and writing to the database. If you wanted to say, "Lock the database, it's mine for now," you'd have to go as slow as a person who's typing. That's unacceptable. You can't wait for people, they are too slow.
"When there is a queue of requests for access to the same device, the transfer of capacity for that device can actually be increased through seek and latency reduction techniques. This potential for enhancing throughput is the ultimate pressure for shared access."
This is interesting. He is talking about throughput bandwidth as the thing to maximize here. If you have a spinning disk and you want to access a block, that's somewhere else on the disk, you have to physically move the read head and then wait.
That's actually called the seek. You are doing a seek on the disk, to go seek out the ring where that block is written. Then wait for it to come under the read head through spinning. That seek time is long compared to the amount of data that could flow through the wires that you could read off the disk.
If you can queue up a bunch of read accesses, you can actually reorder the queue so that the number of seeks are the distance you seek is minimized. That lets you get faster throughput because you are reading more of the time and not seeking so much.
Nowadays we do that, it just seems like an optimization that we've solved. We don't think about it so much anymore. He was saying that's actually advantage to centralizing your database because you get to queue up more of these file read operations. It's actually something that the operating system does too, so it's weird.
"Of the two main functions of database management, inquiry and update, only update creates a potential problem in shared access." That's interesting. "An unlimited number of jobs can extract data from a database without trouble. However, once a single job begins to update the database, a potential for trouble exists. The time will come when two jobs will want to process the same record simultaneously."
Here is the classic problem with mutable state. If you only have reads, there's never a problem. The only problem is having to wait, because you're all contending for that disk and the seeks, and everything.
Once you have a single process modifying the records, now you've got this big problem to manage. He's going to break it down. This is pretty cool.
"The two basic causes of trouble in shared access are interference and contamination. Interference is defined as the negative effect of the updating activity of one job upon the results of another. When a job has been interfered with, it must be aborted and restarted to give it another opportunity to develop the correct output." I'm going to stop here. We're talking about interference.
You have process A is reading from the database. Process B is writing to the database. B is interfering with A because it's having a negative effect. Its updating activity is having a negative effect on A. Negative effect meaning its records are being written to as it's reading them, something is out of sync.
What he's saying is, you just restart A. Stop A, abort it and restart it after B is finished. That way it can get a clean look at the database without modification.
"Contamination is defined as the negative effect upon a job which results from a combination of two events. When another job is aborted and when its output has already been read by the first job, the aborted job and its output will be removed from the system. Moreover, the jobs contaminated by the output of the aborted job must also be aborted or restarted."
Contamination is this other way. Remember, he's talking about two basic causes of trouble. You have job A. Job A is reading something. Meanwhile, job B is writing to the database, and job A reads it.
Job B isn't done and somehow it gets aborted because, let's say, it was receiving interference from C. It got reported, aborted, and restarted. A has already read B's writing to the database, which should be rolled back and undone. Now, A has the wrong data in. It needs to be restarted. This is the two types of the interference and contamination.
We, of course, know that, eventually, I don't know if they have this, but transactionality becomes a real subject of study, like what does it mean to be transactional? There's different kinds of transactions and stuff that give you different guarantees like this one will guarantee that you won't get contamination, this one will guarantee you won't get interference and what...
There's different strategies. You could lock the table while someone is reading from it or you could redo the restart, etc. There's a whole field of study of how to mitigate these problems. It's cool that he had the problems laid out, he knew what they were.
This is one of the most interesting things about databases. You want to have this kind of shared access. "The Weyerhaeuser Company's shared access version of IDS was designed on the premise that the programmer should not be aware of shared access problems." Now, that is cool.
This is the big innovation, the big invention really is, we want to hide this problem from the programmers, from all the different departments writing their own software against this database. They're not going to have to know if their query got restarted, or their job was contaminated and had to be aborted and restarted. They're not going to even know.
That creates a really nice level of abstraction. In fact, today, the database is still the primary maintainer of data consistency, because that illusion is maintained, that there's a real cognitive abstraction barrier where you don't have to worry about how the transaction works under the hood. That allows the database to off. It allows the database engine to use different techniques to different times.
You might upgrade Postgres and it has some totally new table-locking mechanism. You don't care, because you have never had to deal with that as a programmer.
That system...When I say that, I don't mean like sometimes you have to go into database optimization mode. You do look under the hood a little bit, but your software doesn't have to change. The way you program shouldn't have to change.
"That system automatically blocks each record updated and every message sent by a job, until that job terminates normally, thus eliminating the contamination problem entirely." It blocks each record updated. Nothing can read those records until the job is done. That's one mechanism. Like I said, you can get a textbook on all the different ways of maintaining that consistency in a database.
"About 10 percent of all jobs started in Weyerhaeuser's transaction-oriented system had to be aborted for deadlock. Approximately 100 jobs per hour were aborted and restarted. Is this terrible? Is this too inefficient? These questions are hard to answer, because our standards of efficiency in this area are not clearly defined."
It's cool that they know how many were aborted, but 10 percent? It's interesting to think of it this way. Maybe there have been studies. I'm not aware of them. I don't know. I don't know if a thing about databases. To compare the actual throughput when you take out that 10 percent.
The 10 percent that were aborted and restarted, that means that potentially you had to read twice that from the database from the disk.
If you could read...You're actually losing some read throughput on your disk. I wonder if that overwhelms the throughput you gain by doing that batching of seeks like we talked about. You order your seeks so that they minimize travel of the read head.
Probably not, because he wouldn't be talking about it. Just the things I think about when we're talking about at different levels of the generality of the abstraction there. An optimization at one level, what it does to the optimization at another level. It's interesting.
"Would the avoidance of shared access have permitted more or fewer jobs to be completed each hour? Would some other strategy based upon the detecting rather than avoiding contamination have been more efficient? Would making the programmer aware of shared access permit him to program around the problem and thus, raise the efficiency?
"All these questions are beginning to impinge on the programmer as navigator and on the people who design and implement his navigational aids." In fact, I believe that as technology has changed and evolved over the years, these questions have had to be reexamined continuously. Going from a spinning disk to a solid state disk where you don't have a seek time anymore.
That changes things. Being able to have the whole index in memory, potentially, because you have a lot more RAM, that changes things.
There's been studies that show that if you did the transaction on the database sequentially, you don't have to do all the locking because the locking has a lot of overhead. There's all sorts of different ways of moving around the problem. Moving around the pieces. Where do you solve this problem? How do you solve that one?
Finding a configuration that optimizes something like these questions that he's asking. Optimizes something for a particular database setup. The data you have as it is, with the indexes you have, and the hardware it's running on, all that stuff matters.
Of course, this is an engineer. It's the thing he's thinking about. "My proposition today is that it is time for the application programmer to abandon the memory-centered view, and to accept the challenge and opportunity of navigation within an n-dimensional data space.
"Bertrand Russell — the noted English mathematician and philosopher — once stated that the theory of relativity demanded a change in our imaginative picture of the world. Comparable changes are required in our imaginative picture of the information system world." We've gone through these changes.
We even have trouble imagining what it was like before databases. "A science must be developed, which will yield corresponding minimum energy solutions to database access. This subject is doubly interesting because it includes the problems of traversing an existing database.
"The problems of how to build one in the first place, and how to restructure it later. To best fit the changing access patterns." All things that database engineers deal with. "It is important that these mechanics of data structures be developed as an engineering discipline based upon sound design principles.
"The equipment costs of the database systems to be installed in the 1980s have been estimated at $100 billion at 1970 basis of value. It has further been estimated that the absence of effective standardization could add 20 percent or $20 billion to the bill."
The equipment costs of these systems estimated $100 billion, of 1970 dollars. That is a lot. Wow, that's really amazing that they estimated that much. I wonder what it is today. There's a lot more databases out there today. Our systems are cheaper. I don't know.
"The universities have largely ignored the mechanics of data structures in favor of problems, which more nearly fit a graduate student's thesis requirement. Big data systems are expensive projects which university budgets simply cannot afford.
"Therefore, it will require joint university-industry and university-government projects to provide the funding and staying power necessary to achieve progress. There is enough material for a half dozen doctoral thesis buried in the Weyerhaeuser system waiting for someone to come and dig it out."
There's a little dig at academia that it's been ignored. It's mostly because the problems are too big for graduate students to work on for their thesis requirement. We're going to need this cooperation — university-industry, university-government. It's interesting that he talks about a dozen doctoral theses.
I didn't look this up. I should have done more research. My impression is that's how Postgres advances is someone needs to do a thesis. They come up with some new index or something and add it to Postgres. That's how I think it advances. I'm not sure. Don't quote me on that.
He's going to take a dig at the ACM. "The publication policies of the technical literature are also a problem. The refereeing rules and practices of communications of the ACM result in delays of one year to 18 months between submittal and publication.
"Add to that, the time for the author to repair his ideas for publication. You have at least a two-year delay between the detection of significant results and their earliest possible publication." That's all I wanted to read. I don't know why I chose to put that dig at the ACM. He's talking about this thing is happening fast.
We can't wait two years for every little thing that we were doing to publish it and let other people know about it. There's hundreds of billions of dollars at stake, probably in the trillions to use today's dollars in 2021. It's a lot. [laughs]
I also think it's interesting. I wonder how much of those trillions of dollars could be spent to pay people to develop better database techniques. How much is being spent? I don't know. This one was an interesting one for me, simply because it's for historical reasons. I always try to find something interesting.
Otherwise, this becomes very boring to me reading these old papers. Sometimes I have to dig a little deeper. In this one, I admit it. I had to dig a little deeper. I'm not that interested in databases, but I am interested in history. That's what I clung to.
This historical change in perspective. It's fascinating that he saw this. He was there, seeing people write these programs like reading tape and he's like, "Why don't we centralize all this?" It's not the computer center, it's the database. That's where all the good stuff is. You're going to read and write to that.
Thank you so much. My name is Eric Normand. This has been a podcast episode. Thank you for listening, and as always, rock on.