Big challenges in the big data ecosystem
At LinkedIn, we have a number of challenges managing data in our complex data ecosystem. Changes to our infrastructure are often necessary to make progress, but they are difficult to accomplish without an expensive, large-scale, coordinated effort. Analytics processing systems are constantly improved to work more efficiently with different storage formats. Presto, for example, is typically a lot more efficient when running on columnar storage formats, as opposed to row-based formats. When a new system like Presto shows up, it is incredibly difficult to introduce a new storage format without disrupting existing applications, as they are often accessing raw data directly. Similarly, when other physical characteristics of data change, such as location or partitioning strategy, applications break and the war rooms begin. Finally, at LinkedIn’s scale, we are constantly hitting the limits of Hadoop scalability. We are thus forced to federate data across multiple clusters for scaling purposes. Assumptions about data availability in a particular cluster can often break applications in unexpected ways.
There is general awareness about the infrastructure-induced challenges in the big data community. What is often overlooked is that there are similar (if not more challenging) problems associated with dataset changes. Both schema changes and semantic changes can have widespread downstream impact. For example, when a field representing granularity of time is modified from seconds to milliseconds, a number of downstream processes are impacted because of assumptions in the code that are no longer true. These changes place a steep tax on the productivity of the team owning the dataset, as well as on teams responding to these changes. This effort involved in making such changes invariably overshadows the benefits, and as a result, necessary changes tend to get put off, slowing innovation. Refactoring code is an essential part of software craftsmanship. The art of refactoring data is just as vital as managing firmware or hardware upgrades, but is often overlooked.
What is Dali?
From these experiences, we have realized that challenges within the data ecosystem broadly fall into two categories: insulation from infrastructure changes, and data agility. There have been many attempts to tame the chaos associated with infrastructure changes. The most popular theme in this vein is the creation of a layer of abstraction separating physical details from logical concerns. In the big data space, a few successful incarnations of this idea can be seen in Apache Hive, Hive’s HCatalog component, and Twitter’s Data Abstraction Layer (DAL). Dali (Data Access at LinkedIn) draws inspiration from these efforts, yet it provides a unique point in the design space by addressing both types of challenges defined above.
At its core, Dali provides the ability to define and evolve a dataset. This is not unlike efforts like HCatalog, where a logical abstraction (e.g., a table with a well-defined schema) is used to define and access physical data in a principled manner through a catalog. Dali’s key distinction is that a dataset need not be restricted to just physical datasets. One can extend the concept to include virtual datasets, and treat physical and virtual datasets interchangeably. A virtual dataset is nothing but a view that allows transformations on physical datasets to be expressed and executed. In other words, a virtual dataset is the ability to express datasets as code. Depending on the dataset’s usage, it is very cost-effective for the infrastructure to move between physical and virtual datasets improving infrastructure utilization. Heavily-used datasets can be materialized, and less heavily-used datasets can stay virtual until they cross a usage threshold. As we will see in the following sections, there are several benefits to creating virtual datasets from a data agility perspective as well.
Dali at its core is the following components: a catalog to define and evolve physical and virtual datasets, a record-oriented dataset layer that enables applications to read datasets in any application environment, and a collection of development tools allowing data to be effectively treated like code by integrating with LinkedIn’s existing software management stack.
What are Dali Views (AKA virtual datasets)?
Why Dali Views?
Let us home in further on the need for virtual datasets with two simple examples. At LinkedIn, Profile is a popular dataset that is deeply nested, making it inconvenient to use with data processing layers like Pig and Hive (that have historically not worked well with nested data). Not solving this problem in a systematic manner results in a proliferation of unmanaged copies of vastly similar code floating around wherever the Profile dataset is accessed. A change to the schematic or semantic nature of the Profile dataset will trigger a whole range of unpredictable behavior across the entire data ecosystem. To solve this problem, we needed to provide both flattened and nested views of the data. We could go about this in two ways: (a) materialize two copies of the data, or (b) provide a virtual dataset that performs the necessary flattening function that is accessible to all data processing frameworks at run time. The latter has the benefit of reducing the storage costs associated with generating a materialized copy from the nested one.
Here is another motivating example that we have run into repeatedly at LinkedIn. Every new upgrade of our mobile or desktop application invariably introduces new tracking events. We used to track all page views (irrespective of source) in the same Kafka topic/HDFS dataset. As a result, PageViewEvents had a complicated schema, owing to the fact that it had to account for all kinds of page views. We decided to change this to having one topic per kind of page view. For example, when we launched a new version of our mobile app, we decided to add a new dataset version called VoyagerPageView. Consumers who need to read all PageViews could now use a DaliView that unions both PageView and VoyagerPageView. In future, as new PageView event types are added, the view is updated to include the new event, with no changes required from consumers. When new event types are added, we just push new versions of the view. Consumers can continue to consume from the view while changes occur behind the scenes. Examples like these illustrate that dataset management requires principled solutions that go beyond data abstraction layers that have been built in the past.
Dali Views: Under the hood