How was multi-clustering achieved?
We started by partitioning the endpoints of our service. This was relatively easy to do, because endpoints in the service were already divided into separate ownership groups called “verticals.” Then, using the data collected by our monitoring systems, we identified the verticals that had enough traffic to justify separating them.
Then, we laid out the plans. We prepared a runbook that described each step we would take for each vertical. We also prepared a schedule skeleton that we instantiated for each vertical, including tasks for infrastructure team, SREs, and vertical team. These allowed us to inform each vertical team and set a pace with all stakeholders.
For each vertical, we started by modifying our build system to create an additional deployable named after the vertical it would serve. This dedicated deployable also had its own configuration that inherited from and extended the shared configuration for the service.
In parallel, we started examining the traffic being received by the vertical’s endpoints in order to make an estimation of the number of servers that would be needed in the new cluster. Our calculation was simple:
VERTICAL_MAX_QPS: Maximum number of queries served by a vertical during a colo failover
EXTRA_CAP: A multiplier to account for additional capacity needed for downstream anomalies. Taken as 1.20, thereby increasing capacity by 20%.
HOST_MAX_QPS: Maximum QPS per host our service can serve (across all endpoints)
Initial cluster size = VERTICAL_MAX_QPS * EXTRA_CAP / HOST_MAX_QPS
We considered this estimation as an intelligent risk, to be mitigated by our ability to ramp traffic to the new deployables in small steps, and being able to closely observe QPS capacity, latencies, and error rates through our monitoring infrastructure for each vertical. In fact, it later turned out that the max QPS our service can serve for individual verticals varies between 30-300% of this observation! In the end, we were able to account for this variability in our process for determining final cluster size without any service disruptions.
Once we found the estimated size of the cluster, we put in our request for server resources. LinkedIn’s infrastructure tooling made it possible for us to quickly set up the new servers for the new deployables. Once the new servers were ready, we were able to move forward with deployment.
After deployment, we were ready to route traffic to the cluster. There are two traffic sources for API layer services at LinkedIn: traffic between our services that uses Rest.li D2 protocol, and HTTP(S) traffic from the internet through our (reverse) proxies in the traffic routing layer. Specific to our service (since it’s an API service serving frontends), the D2 traffic was much smaller compared to the HTTP traffic. This was a useful coincidence, because although our traffic systems allowed us to determine which percentage of the traffic would be routed to which deployable, we were unable to do this kind of traffic-shaping using the fully distributed Rest.li D2. Therefore, we first ramped our new deployables to share the D2 load for the vertical endpoints with the existing service, then stopped the existing service from taking D2 traffic, and finally ramped the HTTP traffic in steps.
While ramping the HTTP traffic, we did capacity testing. When we had enough traffic to overload at least three servers, we slowly took down servers, while observing the latencies and error rates. This process told us how many QPS each server was capable of processing without incidents for the set of endpoints we were separating. In turn, we used this observation as a basis for capacity planning, and it allowed us to fine-tune our resource allocation. It was part of our policy not to ramp traffic routing to 100% before this tuning of resources.
Since this is a living system, all of these changes happened in parallel to any development and feature ramps that vertical teams were working on. Hence, every step was carefully communicated to all stakeholders. That allowed vertical teams to be able to take the status of the transition from old service deployables to new ones when investigating problems they faced.
The most immediate impact of multi-clustering has been on resiliency. As we expected, we are now able to limit downstream failures or bugs in one vertical of the product, avoiding having them cascade to other verticals. Hence, we achieved our most important goal for this project. We have observed instances where one cluster has gone down due to backend issues, but linkedin.com was able to stay up.
In addition to improving resiliency, it has become possible to tune each cluster, which only serves the set of endpoints that belong to a single vertical, independently from others. The increased deployment granularity made the following possible:
We are able to do capacity planning much better. Separation of clusters allowed us to stress test each vertical separately, which was not possible before. We discovered considerable variation between per-host response capacity for different verticals. These measurements allowed us to fine-tune the resources for each vertical, and informed the engineers about whether and where they should improve their code.
We can monitor the service in a much better way. The logs for each vertical are separated, making it easier for the vertical teams to locate relevant information in the logs. Our regression monitoring system, called EKG, works by comparing performance between the previous and new deployment versions. EKG can now run in a per-vertical manner. Running EKG in this way increases the intelligibility of its results, and solves the problem of high QPS end-points in some verticals preventing regressions from being noticed in lower-QPS endpoints in other verticals.
We have more control over deployment architecture. The average downstream service fanout for the newly-created clusters is reduced to 35% of the fanout for the single-cluster deployment, improving stability and decreasing the statistical likelihood of service failure. Clustering based on verticals improves our ability to fine-tune security settings in our backend servers. We are also able to improve our backend resiliency by fine-tuning incoming query quotas for our backend servers, taking advantage of higher granularity in the Voyager-API deployment.
Voyager-API was built as a monolithic application to serve all platforms for LinkedIn’s flagship product. Voyager-API’s monolithic structure improved cross-platform collaboration and cross-vertical reuse, and hence supported faster iterations. However, with growing scale, its deployment as a monolithic application created risks of cascading failures across the different verticals it served.
In order to improve its resiliency and stability, we took an approach we call “multi-clustering”: we partitioned the set of endpoints along product verticals, and deployed a separate copy of the whole application for each partition, so as to not forgo the observed benefits of developing a monolithic app. This correlated nicely with the organizational structure as well, as each of these partitions were owned by a separate team.
As a result of our multi-clustering effort, we were able to reach our objective of increasing resiliency and stability. In addition, we improved our capacity planning, deployment fine-tuning, and developer productivity.
Along with the authors of the document, we would like to acknowledge sincere thanks to Diego Buthay, Theodore Ni, Aditya Modi, Felipe Salum, Maheswaran Veluchamy, and Anthony Miller, who also worked on the gigantic multi-clustering effort.