We designed and developed an independent framework to propagate Oracle schema changes (DDLs) to Kafka. It allows near-synchronous propagation of schema changes through integration with the Oracle release framework. It also makes APIs available for on-demand invocation by downstream (e.g., Brooklin) consumers.

Reliability: Operability, scalability, metrics

The building blocks for our data capture “Version 2.0” discussed in detail above build several key capabilities to ensure the smooth running and operation of our data capture system.

Monitoring and auto-remediation: We can now capture stats/metrics and provide an alerting mechanism for any activity above expected thresholds, through our in-house tools. For instance, we can capture data for the transactions per second for each process and every component, wherever it’s applicable. We are now capturing metrics for lag generated due to process failure or extra load at minute-level granularity, and generating alerts when latency exceeds the configured threshold. Stats and metrics collection is plugged in with in-house tools like Auto-Alerts, InGraphs, and Iris for efficient monitoring and alerting.

Proactive monitoring/detection of problematic load patterns: Collected metrics like (insert/update/deletes) and operations per second give us insight to understand change patterns for any table, and alerting can be configured for instances when we have deviation to allow us to take preemptive action.

Monitoring and detection of data consistency issues: Our data capture framework is now also supported by an audit process to monitor, detect, and alert for any data inconsistency issues. Data fixes that are made to fix inconsistency flow automatically to downstream systems, making data consistent everywhere.

Auto-recovery: Every component in this framework is designed and configured with auto-recovery features, which provides resilience to the system and avoids human intervention as much as possible.

Scalability through parallel processing: OGG Big Data Adapters have the capability of parallelism at table-level granularity through key-based hashing. This helps in scaling tables with very high operations and achieving real-time data replication between processes.

Conclusion

Incremental Data Capture Version 1.0 was designed 7 years ago at LinkedIn by the DBA team, and while it served us well for a long time, it ultimately works better for a single data center environment. For companies whose data processing systems are not time-bound and produce few gigabytes of transactional data per day, this version would work well and be a cost-effective solution. However, if you have terabytes of data being produced every day and your downstream data processing system depends on near real-time data and needs reliability/scalability as data grows, the Data Capture Version 2.0 system would fit better. LinkedIn has adapted the Version 2.0 system for its critical databases and soon will be rolling it out everywhere.

Acknowledgements

We would like to take time to thank the LinkedIn Database-SRE team for continuous support; kudos and special shout outs to the members of core team (Agila Devi and Srivathsan Vijaya Raghavan) for designing, developing, and implementing Data Capture Version 2.0 at LinkedIn.

Finally, thanks to LinkedIn’s Data leadership for having faith in us and their continuous guidance; thanks for all of your support.

LinkedIn