Eric Normand Newsletter 477: Encapsulating collections

Eric Normand's Newsletter
Software design, functional programming, and software engineering practices
Over 5,000 subscribers

Reflections ๐Ÿค”

Encapsulating collections

I was reading a thread on Twitter today, and I had some comments. It's a good thread, and I wanted to do it justice by writing a longer-form response. The thread is by GeePaw Hill. I will quote from it. The topic is summed up in the first tweet:

I usually encapsulate the native collection classes -- List, Set, Map, et al -- within minutes of using them, and sometimes even before I use them.

The rest of the thread justifies why he does this and recommends we do it.

The argument is well expressed. He comes at it with a lot of experience. And his solution would work. I merely want to express another solution to the same set of problems.

You see, the follow-on consequences of using native collections directly have bitten me in so many ways so many times, that I just shudder when I see them exposed for more than a few minutes.

I really like this. He is using his experience to inform his practice. We should all do this more. However, I do disagree with this particular practice.

Where's the biting come from? It comes from change. When you use, say, a Set<X> instead of a List<X>, or a Map<key,X>, you are making a deep and binding commitment to a decision. When that decision confronts change, undoing that commitment is often both tedious and expensive.

Again, I agree. When you determine the type signature of the method, that's part of the contract with the callers of your method. You should be careful what you put in the signature. As your method is used more and more, that means there are more and more places that will break if you change the signature. It becomes increasingly expensive to change it. At some point, it becomes so expensive you will never do it. Choosing to return a Set is probably a commitment to always return a Set.

Your client code, the code that's actually deferencing your native collection field, almost never cares what abstract platonic form that collection takes. It's usually interested in just four questions about that collection:

  1. How many are in there?
  2. Is this in there?
  3. Can I see all of them one by one?
  4. Can I try to add something to them?

I don't have trouble believing this. He's saying that the calling code only needs count, contains?, seq, and conj. Agree so far.

But then he goes on to say we should wrap our collection in a new class. I don't agree with that. I mean, okay, it will solve the problem, but it contributes to type proliferation.

If your language requires you to declare a type, you should instead say exactly what you mean:

(Countable, Containing, Seqable, Conjable)<X>

Can you say that in Java? I am not well versed in Java's generics. But that's what you really want to say, and some languages do allow it.

But let's assume that Java doesn't allow it, and you do have to write a custom class. That's fine. But if it really is such a common pattern, why not make the class once? Why make a new one for each X? You should just make MyCollection<T> and be done with it.

Or, if you do need different behavior (such as storing/not storing duplicates), you could make it an interface and extend it a few times.

I don't mean to pick on anybody, but it is so common in the Java world to fail to see the abstraction. He sees that it's a super common thing, but instead of encapsulating that thing in code, he writes a "Design Pattern" that he uses over and over.

He continues:

There's an actual and substantial advantage to wrapping collection classes early & often. When you wrap a native collection, you create a surface on which to provide not just the limited parts of the API, avoiding TMI (too much information) issues, but also extra parts, custom to your client code's actual interests in your collection.

I agree that this is a great advantage. As you use the API, you'll find operations on this collection that actually should belong to the API. I don't mind putting that stuff outside of the type. I'm a functional programmer, so a function that takes the collection as an argument is very natural. I can see how it is not so natural to OOPers, though. Still, it's a common pattern in Java to have "utility" static methods hanging off of a different class (often with the pluralized name).

Examples:

You'll see nearly identical manipulations of the data structure all over the place. This is the client code saying it wishes you had an API for this.

He's arguing that you will find places where the caller is doing the same thing to your collection in different places. That's probably true, but:

  1. Why doesn't this belong to the caller? Can't the caller deduplicate its own code? Why extend the contract with a new method?
  2. By the same logic, wouldn't you find identical manipulations to collections of other types?

Let me illustrate that last point. Let's say you have a collection called Users that lets you do stuff to Users. And you also have a collection called Documents. Is it weird to think that there might be some operations in common? Wouldn't you wind up duplicating those to the different collection classes? And isn't that what the built-in collections are for?

Listen, I love my client code, but I ain't about letting it run willy-nilly through my dataset, doing anything to that dataset in any way it wants, without me knowing what it's doing. And when I expose the List base of that dataset, that is exactly what I'm doing.

Oh. I forgot! He's talking about mutable collections. I don't mind if a caller runs willy-nilly through my data. That's part of the contract of immutable data structures. Maybe he's dealing a deeper issue with the mutable collection pattern.

My client code needs to know what my service code is willing to tell it, and nothing more.

Agreed. But then why don't you return Collection<X>?

Unfortunately, GeePaw Hill does not explain the downsides of his approach.

The first downside is class proliferation. If you have a User class, then you need a Users collection. That basically doubles all of your "entity" classes. But then, according to his stories, sometimes you want duplicates and sometimes you don't. Does that mean you need a UsersWithDuplicates class and a UsersWithoutDuplicates class? What about ordered collections and unordered? Do you need UsersOrderedWithDuplicates?

Or if you don't have such a proliferation, and let's say you only have Users for collections of Users, you see the second downside. Does that mean you need Users to work for everyone? It sounds like a recipe for the "god class" antipattern. That is, it will turn into a giant class with lots of methods, each one solving one client's problems.

The third problem I have already mentioned: There is no place to put the duplicated methods. Since each class is custom, there will be duplicated functions like map and filter. By beginning with custom classes, you lose the ability to solve problems at the more general level.

What are the causes of this problem?

I think there are three factors that lead to this pattern in languages like Java.

  • Mutable collections
  • Semantics of the type system
  • Difficulty abstracting

Because collections are mutable, by returning a collection, you are allowing the caller to change the collection, even if they shouldn't. And even if they should, they will have low-level methods that allow them to change them in ways they should not be able to. This leads to programmers wanting to wrap up the collection with a new, custom API that enforces the correct mutations.

This problem goes away if your collections are not mutable. What caller A does to the collection does not affect what caller B does, so the API can wash its hands of responsibility for them. Note that in Java, there is a set of classes that decorate the mutable classes, essentially making them immutable. That would be one solution immediately available.

The type system does not allow you to specify a collection of interfaces you want to support. (Please correct me if I am wrong.) The built-in classes are not changeable, so you cannot refactor which interfaces they implement. Perhaps if you could pinpoint the exact methods you will support, this problem would go away.

In Clojure, we don't have this problem because we don't have types (and also the data is immutable). But in Haskell, you can list the type classes that your return value will implement. And you can define type class implementations (essentially adding interfaces to a Java class) after the type is defined.

But this gets to the third problem: Why don't Java programmers abstract more than they do? Why do they jump directly to the specific classes (Users, Documents, etc.) and avoid the more general solution (a MyCollection class)? Why does the abstraction occur out of band---in a Design Pattern description---instead of in code? Is it intrinsic to the language? Or is it a habit of the community?


Domain Modeling ๐ŸŽ™

I've updated the table of contents thanks to feedback from my readers.


Grokking Simplicity ๐Ÿ“˜

Here's a really nice review on Amazon:

I think this is one of the greatest didactic books on programming (and functional programming in particular) ever written.

You can order my book on Amazon. Please leave a rating and review. Reviews are a primary signal that Amazon uses to promote the book. They help others learn whether the book is for them.

You can order the print and eBook versions on Manning.com (use TSSIMPLICITY for 50% off).


Clojure Challenge ๐Ÿค”

Last challenge

Issue 476 - How many digits? - Submissions

This week's challenge

Box of chocolates

You work at a chocolate shop that makes two sizes of chocolates:

  • Small (2 grams each)
  • Large (5 grams each)

When someone orders a box of chocolates, they order by total mass. It's your job to figure out how to fulfill the order using a combination of small and large chocolates to exactly hit the total mass ordered.

Your task is to write a function that takes three arguments:

  1. smalls - The number of small chocolates available in the inventory.
  2. larges - The number of large chocolates available in the inventory.
  3. mass - The total mass of the ordered box.

Your function should return a map containing:

{:small ;; the number of small chocolates
 :large ;; the number of large chocolates
}

Or nil if the total mass is not possible.

One other constraint is you should strive to have the fewest number of chocolates that total to the target mass. That means you should prefer large chocolates over small chocolates if you have the choice.

Examples

(assemble 100 100 2) ;=> {:small 1 :large 0}
(assemble 100 100 1) ;=> nil
(assemble 100 100 10) ;=> {:small 0 :large 2}
(assemble 10 2 20) ;=> {:large 2 :small 5}

Thanks to this site for the problem idea, where it is rated Expert in Ruby. The problem has been modified.

Please submit your solutions as comments on this gist.

Rock on!
Eric Normand

Sean Allen
Sean Allen
Your friendly reminder that if you aren't reading Eric's newsletter, you are missing outโ€ฆ
๐Ÿ‘ โค๏ธ
Nicolas Hery
Nicolas Hery
Lots of great content in the latest newsletter! Really glad I subscribed. Thanks, Eric, for your work.
๐Ÿ‘ โค๏ธ
Mathieu Gagnon
Mathieu Gagnon
Eric's newsletter is so simply great. Love it!
๐Ÿ‘ โค๏ธ