By Mohit Vora, Andrew Berglund, Videsh Sadafal, David Pfitzner, and Ellen Livengood
In previous posts we’ve talked about how we calculate, predict, and use content popularity for Open Connect to maximize the efficiency of our content delivery network and other data science challenges in this space. We also recently talked about improvements we made in the server throughput space.
In this blog post, we will dig deeper into how we place content on Open Connect Servers, (also referred to as Open Connect Appliances or OCAs in other literature), including our hashing strategies and how we deal with heterogeneous server clusters. This work is a result of continued collaboration with the Open Connect and Science & Analytics teams at Netflix.
Content Placement Goals
Content Placement refers to the decisions we make on a daily basis about which content gets deployed to which servers in a given cluster. (Refer back to our earlier blog for an overview of why these decisions are important.)
In general, to maximize traffic from a cluster we should place the most popular content from the catalog onto the cluster. It also makes sense to load balance the popular content over each server in the cluster. A secondary goal is for the allocation to be reasonably stable day-over-day and as stable as possible when servers are added to or removed from a cluster. And finally, this allocation algorithm needs to be reasonable in terms of compute requirements.
Uniform Consistent Hashing
We use Consistent Hashing to distribute content across multiple servers as follows. Imagine a ring with all numbers from 0 to N (Figure 1). Server IDs S1 to Sn are hashed over this ring. The space in the ring that precedes h(Si) and succeeds the previous hash is said to be owned by Si (Figure 2). Content IDs C1 to Cm are hashed over the same ring. Cj is assigned to the server that owns that part of the ring where h(Cj) lands (Figure 3).
In addition, we hash every server ID (S1 to Sn) 1000 times to generate a reasonably equal distribution of content and also to facilitate fair re-hashing when the cluster changes. Using the Uniform Consistent Hashing approach, we assign the same weight to every server. Finally, we find as many owners as we need replicas for a particular piece of content.
Using this approach, day-over-day churn is minimized. Content that is added to or removed from the catalog impacts only the server that needs to download or delete this piece of content. When a server is added into a cluster, 1000 new slices are distributed over the ring, where the new server takes over content roughly equally from the other servers. Similarly, when a server is removed from a cluster, its 1000 slices are removed, and it passes on the content ownership roughly equally to the rest of the servers in the cluster.
Heterogeneous Cluster Allocation
We found that the Uniform Consistent Hashing approach can be sub-optimal in our environment due to the additional layer of complexity that is introduced by our heterogeneous fleet of servers. Our servers fall into one of two general categories — Storage and Flash. These two server types have very different characteristics. Storage servers consist of mostly spinning disks, can hold upwards of 200 TB, and generate ~40 Gbps of throughput. Flash servers (all SSD disks) can generate up to ~100 Gbps but can hold only up to 18 TB of content. For small to medium ISP co-locations, we ship Storage servers only. Our IX and large ISP co-locations consist of a Flash server tier for most traffic and a Storage server tier for storing the entire catalog.
Our hardware team builds newer revisions of these servers with ever-increasing capability. For maximum flexibility, we need to enable newer servers to serve alongside older ones without compromising on resource utilization. Additionally, one or more drives on servers can fail. We disable such drives automatically and this leads to disk space differences even among servers with the same hardware type. Overall, these complexities mean that servers in the same cluster have different levels of storage and throughput capacities.
Uniform Consistent Hashing works great when servers are homogenous. But it tends to over- or under-utilize resources in the heterogeneous case.
Differing Storage: For storage clusters with different disk capacities (for example, four 100 TB servers and one 50 TB server in a single cluster), Uniform Consistent Hashing drops about 1/5th of content from the 250th to 500th TB mark and therefore would create a gap in stored popular content (a “content hole” in our terminology). In certain cases, content holes can lead to the content not being available for streaming.
Differing Throughput: In 2016, we built servers that could generate 100 Gbps of throughput with 18 TB drives. Most of our Flash servers in production are 40 Gbps with 12 TB disks. Uniform Consistent Hashing cannot combine these two types of servers into a single cluster, because the traffic attracted to a server would generally be proportional to storage size — 3:2. The target traffic proportions needs to be roughly 5:2.
The solution to these issues is a new algorithm we developed called Heterogeneous Cluster Allocation (HCA). The HCA algorithm is used to more intelligently distribute content across heterogeneous servers to make better use of hardware resources.
HCA addresses the above cases by altering the allocation protocol. The basic idea is simple — keep the consistent hashing, but use a model to come up with allocation weights when placing content on different individual servers. Weights are effected by changing the number of slices hashed onto the consistent hashing ring on a per-server basis.
We have two criteria that need to be satisfied:
- Distribute content in proportion to the storage capacity of each server without causing content holes
- Distribute popular and less popular content so that traffic attracted to a server is proportional to its throughput capacity
A simple weighted consistent hashing algorithm — assigning different weights to each machine — could satisfy one or the other constraint, but not both. To satisfy both criteria, we needed to use two different sets of allocation weights — one for popular content, and another for less popular content. The HCA algorithm is a systematic procedure for doing this.
The HCA algorithm allocates content in two stages, each with its own weighted consistent hash ring. To configure it, we must specify weights for each server in each stage and the catalog depth D (“cutoff”) where we switch from stage 1 to stage 2. Given the storage and throughput specification of each server, a popularity curve for the cluster’s region, and a candidate cutoff D, we formulate and solve an optimization problem that either yields a set of allocation weights satisfying both criteria above or determines that cutoff D is infeasible (no configuration satisfies the constraints).
While it is possible that no HCA configuration exists satisfying both criteria for some cluster and popularity curve combination, we find in practice that there is usually a wide range of cutoffs D that are feasible. For the final HCA configuration, we choose the cutoff D* that induces the least amount of churn for content that crosses the cutoff — for example, if the cutoff is at catalog depth D, and a particular downloadable was below D in the popularity ranking one night and after D the next due to popularity changes, it would be allocated in different rings on consecutive days, and may shuffle to a different server. We choose the cutoff where the probability of its shuffling is smallest.
We also need to handle the case that the cluster configuration has changed — for example, when an OCA is added or removed from the cluster. This scenario could also induce churn if the reconfiguration of HCA changes the cutoff D* or the token numbers. To mitigate this, we can scale up or down the token numbers in each zone (only their ratio matters, not the absolute number) to cause the smallest churn between reconfiguration.
Using the HCA algorithm to distribute content to Open Connect servers has shown a clear benefit, with content holes substantially reduced and significant improvements in load balance in clusters that are clearly heterogeneous.
We are always evaluating and improving our popularity algorithms and storage strategies. If these kinds of large scale challenges sound interesting to you, check out our latest job postings!