What are some of the coolest projects that you and your team have been working on?
Because Apache Kafka was originally developed at LinkedIn, we have a very strong Kafka development team. Both the Kafka development and SRE teams are very focused on open source work. As part of the SRE team, I took time to develop Burrow, which is an advanced tool for monitoring Kafka consumers. This solved a problem that LinkedIn had with how to monitor our applications, but because I was able to release it as an open source project, it’s also become a staple for many organizations outside of LinkedIn. We’ve recently completed a rewrite of the project as the 1.0 release to make it even easier for the community to engage with and improve it, something that benefits both the community and LinkedIn.
Burrow is one example of our projects that focus on the operability of Kafka, with others like Kafka Monitor and Cruise Control getting significant response as well. I’m also working on releasing some significant performance improvements to Kafka Mirror Maker, which is used for replicating data from one cluster to another.
What other projects are you involved in outside of Kafka?
Of late, I’ve been shifting my focus to look at problems that affect SRE in general and how to improve the quality of life for all of our engineers. For example, I’ve been heavily involved in the revamp of our processes for incident management, with the goal of making them more consistent and allowing us to move more quickly from problem to mitigation to resolution. I’m also starting to look at the use of another technology that LinkedIn is well-versed in, machine learning, to vastly improve site operations.
What made you first want to be a site reliability engineer?
Ever since high school I’ve been working in some aspect of operations—managing school computer systems in both high school and college, and working as a systems administrator after that. My work always included an element of creating tools to make my job easier, so SRE interested me as soon as I was introduced to the concept. It seemed to be a better description of the work I was doing, and it focused more on the thing I enjoyed best: strategic and proactive work to automate tasks.
What is the most challenging part of your job?
We have an excellent engineering team across the board, and the Streaming team in particular has a strong focus on operability and stability. Unfortunately, this means that we don’t have a lot of “easy” problems to work with anymore. Most of our problems end up requiring a significant investment of resources, in terms of both development and SRE, to solve.
Compared to other places you’ve worked, how do you like working at LinkedIn?
LinkedIn encourages us to take risks, and this is quite different than my previous positions. Everyone is constantly on the lookout for a new project or an area where we can pick up a little more performance with a change. And while we certainly don’t want to take the site down, mistakes and failures are not penalized. As a team, we are always learning and improving.
We’re also encouraged to engage with the larger community outside of LinkedIn, and this has resulted in a transformation of my career since I started here. I have had the opportunity to speak at numerous meetups and conferences on both Apache Kafka and on the practice of SRE in general, and I’ve even co-authored Kafka: The Definitive Guide.
What are your favorite things to do when you’re not at the office?
Especially with my travel schedule for work, most of my non-work hours are dedicated to my family. My daughters, wife, and I all love musical theater (Hamilton is a permanent fixture in our house right now) and all things Disney. If we’re not at Walt Disney World, or on a cruise, we’re planning our next trip.
On my own, I love to go out for a run. I’m on a bit of a break right now, having overdone it a little with eight marathons (and numerous other races) in the last 5 years, but I’m looking forward to getting out for some shorter runs and races as the weather warms up.