Recap of the Oct 2017 ML Platform meetup at Netflix HQ
Machine Learning is making fast inroads into many areas of business and is being employed in an increasingly widening array of commercial applications. At Netflix, ML has been used for several years on our Recommendations and Personalization problems. Over the last couple of years, ML has expanded into a wide range of newer applications here, such as Content Promotion, Content Price Modeling, Programmatic Marketing, Efficient Content Delivery, Adaptive Streaming QoE, to name only a handful. Scaling and speeding up the application of ML is at the top of mind for the multitudes of researchers, engineers, and data scientists working on these salient problems. We are always looking to learn from academic research and industry application in large scale contexts and love to share what we have learnt as part of running the Netflix use cases at scale.
Last week we hosted a Machine Learning Platform meetup at Netflix HQ in Los Gatos. We had five talks from top practitioners in the industry and in this post we will provide a brief overview summarizing them. The talks from Google, Twitter, Uber, Facebook and Netflix largely fell into one or both of these themes:
Sparse Data in ML: Challenges and solutions
Scaling up Training: Distributed vs Single machine approaches
Ed Chi kicked off the talks with a presentation on how to learn user and item representation for sparse data in recommender systems. His talk covered two major areas of emphasis — Focused Learning and Factorized Deep Retrieval.
In setting up the motivation for the first area, Ed talked of the Tyranny of the Majority in describing the outsized impact of dense sub-zones in otherwise sparse data sets. As as solution to this challenge, he talked about Focused Learning, a sort of divide-and-conquer approach, where you look for subsets of data to focus on and then figure out which models can learn on each subset with differing data density. As an example of this approach, he talked about how using focused and unfocused regularization factors in the cost function, allowed them to model the problem with more success. In the canonical example of movie recommendations, he demonstrated how this approach led to outsized improvements in model prediction quality for sparse data sets (e.g. where Documentaries is one of the most sparse categories).
The use case that Factorized Deep Retrieval addresses is about predicting good co-watch patterns of videos when the corpus of items to recommend is huge (YouTube). You want the recommendations to not ignore the fresh content while still ensuring relevance. He presented TensorFlow’s distributed implementation of WALS (weighted alternating least squares) Factorization as the solution they used for picking sparse but relevant candidates out of the huge candidate corpus. Online rankings for serving were staged into a tiered approach with a first pass nominator selecting a reasonably small set. Then subsequent rankers further refined the selection until a small set of highly relevant candidates were chosen and impressed to the user.
Google’s talk on learning with sparse data
Ed also talked about several implementation challenges, in sharing, failures, versioning, and coordination, that the TensorFlow team encountered and addressed in the TFX (TensorFlow Extended) system internally in use at Google. They are looking to add some of these features to TF Serving open-source in the future.
Next up was Joe Xie from Twitter who started off by setting the stage with the real time use cases of Twitter. He talked about their single/merged online training and prediction pipeline as a backdrop to the parameter server approach his team took to scaling training and serving. Joe talked about how they tackled one area to scale up at a time. He walked through three stages of their parameter server, starting with first decoupling the training and prediction requests, then focusing on scaling up the training traffic and finally scaling up the model size.
They were able to increase the prediction queries/sec by 10x by separating training and prediction requests in v1, increase training set size by 20x in v2, and accept models 10x larger in size in v3.
They are looking to explore data-parallel optimizations and feature sharding across their v3 distributed architecture. When asked about whether trainers and predictors are also sharded, Joe mentioned that for now they have sharded the trainers but not the predictors.
Twitter’s parameter server for online learning
Alex Sergeev from Uber introduced a new distributed training library they have recently open-sourced — Horovod, for making distributed TensorFlow workloads easier. Alex started off by setting the problem of scaling up model training. With ever increasing data sizes and the need for faster training, they explored distributed TensorFlow. Alex alluded to challenges in using the distributed TensorFlow package out of the box. At some level they felt lack of clarity around which knobs to use. As well, they weren’t quite able to do justice to their GPU utilization when training data at scale via TensorFlow.
They explored recent approaches to leveraging data-parallelism in the distributed training space. Motivated by some past work at Baidu and more recently at Facebook (see below), they took a different approach to their deep network training. Conceptually, the training set was split up in chunks of data each of which was processed in parallel by executors, computing gradients. In each pass the gradients are averaged and then fed back into the model.
Typically, in TensorFlow the workers compute the gradients and then send them to the Parameter Server(s) for averaging. Horovod utilizes a data-parallel “ring-allreduce” algorithm that obviates the need to have a parameter server. Each of the workers are involved in not only computing the gradients but also averaging them, communicating in a peer-to-peer fashion. Horovod uses NVDIA’s NCCL2 library to optimize bandwidth of communication between workers.
Alex talked about how they were able to utilize this for both single GPU as well as multiple GPU cases. He also talked about Tensor Fusion, an approach to smart batching that gave them larger gains on less optimized networks.
Uber’s Horovod for distributed TensorFlow
When asked about distributing the data across GPUs, interestingly, Alex mentioned that in their benchmarks they have seen better scale performance with 8 machines with a single GPU versus a single machine with 8 GPUs.
Continuing the thread of GPU optimization for large scale training, Aapo Kyrola from Facebook shared some insights on their experiences working on this problem. As a matter of fact, Aapo described in more detail the case study that was referenced by the Uber presentation on Horovod just prior to this talk.
Aapo started off by giving a quick overview of Caffe2, a lightweight ML framework, highlighting how training can be modeled as a directed graph problem. He compared synchronous SGD (stochastic gradient descent) with asynchronous SGD and then proceeded to describe how they pushed the boundaries on sync SGD on Caffe2 with GLOO. The latter is an open-sourced fast library for performing distributed reductions and leverages NVIDIA’s NCCL2 for inter-GPU reductions.
This all led to the case study of their recently published milestone achievement — they were able to train a ResNet-50 (a 50 layer residual convolutional net) on the ImageNet-1K dataset in less than 1 hour. Sync SGD was used with a data-parallel approach and the “all-reduce” algorithm from GLOO. The case study described how they were able to tweak the learning rate piece-wise with an 8K mini-batch size to get to near state-of-the-art on error rate metrics.
Aapo ended the talk on some lessons learned through this exercise, like how Sync SGD can be pushed quite far on modern GPUs, and how learning rate is an important hyper-parameter for mini-batch size.
Facebook’s GPU Training at scale
Netflix’s Benoit Rostykus talked about VectorFlow, a recently open-sourced library for handling sparse data. VectorFlow is designed to be a lightweight neural network library for not-so-deep networks and sparse data.
In many ways Benoit’s talk presented a bit of a counter narrative to the rest of the talks. He argued that while there are a lot of applications in ML which are better served by throwing more training data and more layers (eg, images in convolutional nets), there is a surprising number of applications on the other end of the spectrum too. Such applications, like real-time bidding, can be handled in a single machine context, often need simple feed-forward shallow nets, and performance on CPUs can be pushed hard enough to meet the training latency and cost budget.
VectorFlow was designed for these contexts and this philosophy informed several design decisions. For example the choice of the language, D, was one such decision. The modern system language was chosen to address the goal of a single language in adhoc as well as production contexts — providing the power of C++ and the simplicity/clarity of Python.
Another notable differentiator for VectorFlow is its focus on optimizing for latency instead of throughput. By avoiding large data transfers between RAM and CPU/GPU memory, and avoiding allocations during forward/back-prop, VectorFlow is able to speed things up considerably during training. Benoit was able to demonstrate that for a training set size of 33M rows with a data sparsity of 7%, VectorFlow was able to run each pass of SGD in 4.2 seconds on a single machine with 16 commodity CPU cores.
VectorFlow is used in real-time bidding, programmatic advertising, causal inference, and survival regression to name a few applications in Netflix.
Netflix’s VectorFlow – a minimalist neural network library
Benoit scoped out some of the future work on VectorFlow’s roadmap which includes deeper sparsity support, more complex nodes, and more optimizers, without giving up the core mantra behind VectorFlow that Simple > Complex. VectorFlow is open-sourced and available on github.
This meetup featured high quality talks and active participation by the audience. It was followed by engaging informal conversations after the talks around the direction of ML in general and deep learning in particular. At Netflix, we are solving many such challenging problems in ML research, engineering and infrastructure at scale as we grow to meet the needs of the next 100 million Netflix members. Join us if you are interested in pushing the state of the art forward.