One of our sayings in Site Reliability Engineering (SRE) is that the goal of your job is to “automate yourself out of the job.” While some may have concerns of being replaced by robots, SRE’s see the value of automating work. It opens up time, removes tedious or repetitive tasks from a workflow, and allows our engineers to spend their valuable time on more complex issues. Used properly, automation opens the door for us to do more thorough investigations of site issues. And of course, if you’re the oncall engineer when a service breaks at 3 a.m. (as I have been many times before), the ability to automate aspects of diagnosing and repairing the issue is very welcome.
Today, we are pleased to announce that we have open sourced two new tools to assist engineers in automating the investigation of broken hosts and services: Fossor and Ascii Etch. Fossor is a plugin-oriented Python tool and library for automating the investigation of broken hosts and services. Ascii Etch is a Python library that takes streams of numbers and turns them into visual graphs using ascii characters, originally created to help display output from Fossor. We faced some real challenges that led to the creation of these tools. This post will cover these, and also how Fossor can be adapted and tailored for specific use through the creation of plugins.
One of the most powerful aspects of automation is harnessing a computer’s ability to perform tasks in parallel and thereby parse through vast amounts of data quickly. A typical site issue investigation requires performing a sequence of multiple investigative steps, such as the 10 useful commands listed in this Netflix engineering blog post. However, manually tracking commands takes valuable time, especially when dealing with increased latency or a full outage. Having experienced the pain of performing the same repetitive steps again and again during my own oncall shifts, I concluded that writing a tool to perform some of these basic checks in parallel would speed up the mean time to resolution. Taking the idea even further, I wanted a tool that could perform checks tailored specifically to my services while still having the flexibility to incorporate newly-developed checks in the future. Fossor was created to do just that.
Fossor architecture and design
In Latin, the word fossor means “grave digger” or “one who digs,” which fits well with Fossor’s purpose of helping users to dig into server or application issues. From its initial conception, an important feature in Fossor’s design was to allow others to easily expand its abilities by adding their own checks through the use of plugins. To ensure optimum performance, even with a potentially large plugin library, Fossor was designed with several key features.
First, to mitigate the problem of having too much output that could potentially obscure key data, Fossor only reports information to the user when it is deemed helpful, as defined by each plugin. This tailored output allows for easy access to reported information. The incorporation of Ascii Etch in certain plugins also allows for a graphical output of data, making the reports easier to read.
Second, to help curb the introduction of performance- or application-breaking bugs into the Fossor tool, Fossor separates its code into two parts: the engine, and the plugins. The engine is responsible for coordinating plugin execution. It collects the plugins and then carefully runs each one in its own process. By isolating each plugin in its own process, the main engine is protected from a single plugin failing and crashing the application. This plugin resiliency was specifically built in to allow Fossor to safely manage plugins from many contributors, thereby creating a platform for the bridging of expertise among users.
Plugins are small classes that must implement a single method: the run method. If the run method returns output, this indicates the output is “interesting” and should be reported back to the user. Below are two examples of plugins. The run method accepts a single argument, a Python dict named “variables,” used to optionally provide external information to the plugin. All plugin types use this same basic structure.
Example of a Check plugin