Talk: Building a Data Pipeline with Clojure and Kafka


David Pick's talk at the conj is about Kafka, which is a distributed messaging system. It implements a publish/subscribe model and runs in a cluster. Its documentation claims that Kafka has very high throughput.

Why it matters

Braintree, where David Pick works, is a huge payment processor, now owned by PayPal. They must have very strict availability and consistency requirements in addition to the scale. The talk description hints that Clojure was helpful to their success. Some of the best talks I've seen are this kind of experience report.

About David Pick

Twitter - Github


David Pick was kind enough to answer a few questions about himself and his upcoming Clojure/conj talk. You may want to read the background before you begin.

Interview How did you get into Clojure?

David Pick: A friend of my mine suggested I read Purely Functional Data Structures and told me the work in the book was the basis for Clojure's persistent data structures. I really enjoyed it and decided to give Clojure a try. Braintree must process a lot of very important data, since it's a payments service. What were you using before you developed the Clojure and Kafka system?

DP: I'd say for 98% of our systems we still use Ruby and Postgres, we're simply using Clojure and Kafka as a messaging platform to move data between systems. What is Kafka?

DP: Kafka is a system for doing publish subscribe messaging, that was built by LinkedIn to have extremely high throughput and replay ability of messages. It stores all messages sent through the system for a rolling window. So consumers simply say give me messages starting with offset n or whatever your most recent message is. This way if a receiving system goes down the consumer can just rewind the stream and replay messages that were missed. Could you describe the problem you were trying to solve? What were the requirements?

DP: There were actually two problems we were trying to solve. The first was to pull our search tool into a separate system that was more scalable than our current one (the current one just runs sql queries on main database). The second was to improve the accuracy and speed that our data warehouse is updated.

For the infrastructure part of the solution we ended up using Kafka, Zookeeper, and Elasticsearch which are all JVM based so that ended up heavily influencing our decision to use Clojure. You mentioned in the talk description that Clojure had certain advantages. What tradeoffs did Clojure, or your choice of technologies, bring about?

DP: The biggest tradeoff I think we made with Clojure is familiarity, I was the only one on the team who knew Clojure prior to starting this project. So that has certainly slowed us down. Other than that things have gone surprisingly well. I'm really fascinated by the engineering efforts to build very reliable and robust systems. Are you going to talk about that?

DP: Yep, that's the plan! Awesome.

Do you have any resources that would help someone new to Clojure or new to Kafka so that they can make the most of your talk? Some pre-reading or pre-watching materials?

DP: For Clojure I definitely recommend both the Clojure Koans and 4Clojure. For Kafka I'd recommend the introduction on their site. Where can people follow your adventures online?

DP: Twitter is probably the best place @davidpick. Who would win in a no-holds-barred fight to the finish, Clojure with an Arduino-controlled laser or a T-Rex?

DP: If Rich is writing the Clojure I give it a fighting chance, otherwise it's the T-Rex for sure. Awesome! Thanks for a great interview. I look forward to seeing you at the Conj.

DP: Awesome, thanks Eric!