Customer Case Study
Customer Case Study Benefits Faster prototyping of new applications Easier debugging of complex pipelines Improved overall engineering team productivity Summary offers a robust advertising platform; discovering hidden patterns in data is critical to measuring the effectiveness of their products and in making improvements to the overall product suite Initial attempt to establish a self-hosted Hadoop cluster with Hive as the ad hoc query tool required two full-time engineers to manage the infrastructure, and was not an effective interactive query platform Databricks offered significant benefits for, including: faster prototyping of new applications, easier debugging of complex pipelines, and improved engineering productivity Customer Case Study 2
Business Background builds software for delivering ads into the natural flow of content sites and apps (also known as native advertising). Because serves ads on some of the most popular digital properties such as Forbes and People, the need for a highperformance big data scale processing platform permeates every aspect of their business. A core engineering function at is revenue management and analytics. The engineering team is responsible for building and maintaining a complex series of algorithms to analyze the terabyte-scale data generated by the ad serving platform. This data, known as the clickstream, contains crucial information about interactions between viewers and the content served by. Discovering hidden patterns in the clickstream is critical to measuring the effectiveness of the platform and in making improvements to the overall product suite. The clickstream is also valuable to the Support and Product Management teams at as it helps them define and improve the features and capabilities of the product. These teams leverage information from the clickstream to make data-driven decisions everyday. For example, a product manager would use this data to aggregate device data in a highly customized, specific manner to help identify new market opportunities. The Support team on the other hand, would mine the clickstream to gain deeper insights into the behavior of publisher integrations. Customer Case Study 3
Challenges initially turned to a self-hosted Hadoop cluster to meet it s big data processing needs. In order to keep the Hadoop clusters running smoothly, the team had to dedicate two full-time engineers to infrastructure maintenance. As a consequence, the engineering team lost two irreplaceable members in pursuing their core mission of building and supporting data-driven products. As part of their Hadoop stack, utilized Hive as their query engine. Hive s slow performance meant that many queries took a long time to run, creating not only contention, but also performance bottlenecks throughout the data pipeline. Instead of being able to explore data freely, engineers often had to wait for extended periods for Hive queries to complete, or to troubleshoot queries that never completed. These challenges were severely impeding s team productivity. They needed a faster and more reliable query engine that would minimize the waiting time and mitigate any performance bottlenecks. The desired solution would also need to be more cost efficient and less labor-intensive from an infrastructure management perspective. Solution deployed Databricks to provide the critical data processing components necessary for them to develop and test their data pipeline more effectively. These components included: Fully managed Spark clusters in the cloud that helps them focus on their data and not operations. An interactive workspace for exploration and visualization so teams can learn, work and collaborate in a single, easy to use environment. Customer Case Study 4
Databricks was deployed in s Virtual Private Cloud (VPC) in Amazon Web Services (AWS) within days. The cluster management interface in Databricks was simple enough to enable s engineering team to create, scale, and terminate Spark clusters with a few clicks, instead of dedicating full-time engineers to this task, as was the case with the self-hosted Hadoop clusters. Once the Spark clusters were in place, was able to easily bring their clickstream data from AWS S3 into the interactive workspace of Databricks. The interactive workspace provides notebooks, enabling users to work with the data in their preferred language - SQL, Python, Java, or Scala. Regardless of the language chosen, Spark s performance was memory-optimized and could compute results within an interactive session. The users were immediately able to visualize the results with many rich charts and graphs built into the notebooks with just a few clicks. This ability to visualize and interact with data in real-time with Databricks was a critical new capability for that was unattainable with Hive. With our previous data solution, despite the complexity and having to dedicate two full-time engineers to infrastructure maintenance, we still struggled with slow performance. In contrast, Databricks offered us the critical data processing components necessary for our team to uncover datadriven insights from our valuable clickstream. Robert Slifka Vice President of Engineering, Benefits gained significant benefits by adopting Databricks, including: faster prototyping of new applications, easier debugging of complex pipelines, and improved overall engineering team productivity. With Databricks, was able to prototype new applications dramatically faster, enabling their engineers to easily and painlessly experiment with innovative ways to perform data processing and aggregation. For example, s new streaming project with Kinesis progressed rapidly because the log processing semantics and Customer Case Study 5
analytics had already been thoroughly validated by prototypes built in Databricks prior to integration. Since Databricks is a controlled environment where engineers can easily run production code, debugging complex pipelines also became much easier for the team and they were able to reproduce failure characteristics quicker. This capability enabled to identify the root cause of production failures faster and reduce system downtime. This ultimately resulted in higher productivity for the entire team. Specific examples include: Freeing the two dedicated full-time engineers from supporting self-hosted Hadoop clusters to focus on core responsibilities in building and supporting data products Eliminating time spent on establishing poorly maintained Hive schemas by running Spark SQL Reducing engineering time spent on maintaining separate code bases by replacing custom UDFs in Hive queries with the combination of Spark SQL and libraries in Databricks notebooks More effective collaboration by sharing notebooks and building a common codebase between teams during investigations of failures Providing direct clickstream access to the product team with minimum support from the engineering team by building lightweight custom dashboards in Databricks Thanks to Databricks, our engineers have gone from being burdened with operations, having to face long wait times, performance bottlenecks and other hurdles that impede our progress to having the ability to easily dive right into analytics. As a result, our team is more productive and collaborative with big data than ever. Robert Slifka Vice President of Engineering, Evaluate Databricks with a trial account now. /registration Customer Case Study 150417 6