Databricks. A Primer



Similar documents
Databricks. A Primer

Making big data simple with Databricks

Customer Case Study. Sharethrough

Ali Ghodsi Head of PM and Engineering Databricks

From Spark to Ignition:

Customer Case Study. Automatic Labs

The Future of Data Management

Cisco Data Preparation

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

More Data in Less Time

Cloudera Enterprise Data Hub in Telecom:

Deploying an Operational Data Store Designed for Big Data

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Data Integration Checklist

Three Open Blueprints For Big Data Success

Qlik Sense Enterprise

Customer Case Study. Celtra

SAP BusinessObjects Edge BI, Standard Package Preferred Business Intelligence Choice for Growing Companies

Analance Data Integration Technical Whitepaper

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Bring your data to life with Microsoft Power BI. Peter Myers Bitwise Solutions

Microsoft Big Data. Solution Brief

How Companies are! Using Spark

Why Big Data Analytics?

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Sisense. Product Highlights.

Oracle Cloud: Line of Business PaaS Services. Balaji Yelamanchili Senior Vice President Product Development

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Assessing campaign management technology

Unified Big Data Processing with Apache Spark. Matei

Accenture and SAP: Delivering Visual Data Discovery Solutions for Agility and Trust at Scale

Analance Data Integration Technical Whitepaper

SQLstream 4 Product Brief. CHANGING THE ECONOMICS OF BIG DATA SQLstream 4.0 product brief

Integrating a Big Data Platform into Government:

BIG DATA ANALYTICS For REAL TIME SYSTEM

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

How To Use Hp Vertica Ondemand

Native Connectivity to Big Data Sources in MSTR 10

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

CRITEO INTERNSHIP PROGRAM 2015/2016

QLIKVIEW FOR LIFE SCIENCES. Partnering for Innovation and Sustainable Growth

Dell In-Memory Appliance for Cloudera Enterprise

Unleash your intuition

Interactive data analytics drive insights

Izenda & SQL Server Reporting Services

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

5 Big Data Use Cases to Understand Your Customer Journey CUSTOMER ANALYTICS EBOOK

Shark Installation Guide Week 3 Report. Ankush Arora

Informatica for Tableau Best Practices to Derive Maximum Value

Agil visualisering och dataanalys

HDP Hadoop From concept to deployment.

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Ad Hoc Analysis of Big Data Visualization

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Embedded Analytics & Big Data Visualization in Any App

Big Data for Investment Research Management

Moving From Hadoop to Spark

Big Data Integration: A Buyer's Guide

IBM BigInsights for Apache Hadoop

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

FlexPod from Cisco and NetApp:

Qlik Sense Enabling the New Enterprise

AtScale Intelligence Platform

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Dell* In-Memory Appliance for Cloudera* Enterprise

Empower Individuals and Teams with Agile Data Visualizations in the Cloud

Copyright 2013 Splunk Inc. Introducing Splunk 6

The cloud that s built for your business.

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Write Once, Run Anywhere Pat McDonough

S o l u t i o n O v e r v i e w. Turbo-charging Demand Response Programs with Operational Intelligence from Vitria

Bringing Big Data to People

How To Create A Help Desk For A System Center System Manager

Data virtualization: Delivering on-demand access to information throughout the enterprise

whitepaper Predictive Analytics with TIBCO Spotfire and TIBCO Enterprise Runtime for R

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Reimagining Business with SAP HANA Cloud Platform for the Internet of Things

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

What s New in Analytics: Fall 2015

Business Intelligence Cloud Service Deliver Agile Analytics

Tagetik Extends Customer Value with SQL Server 2012

Extend your analytic capabilities with SAP Predictive Analysis

Using Microsoft Business Intelligence Dashboards and Reports in the Federal Government

IBM Software IBM Business Process Management Suite. Increase business agility with the IBM Business Process Management Suite

Simplified Management With Hitachi Command Suite. By Hitachi Data Systems

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Big Data Visualization and Dashboards

What s New in Analytics: Fall 2015

I D C T E C H N O L O G Y S P O T L I G H T

Next Generation Business Performance Management Solution

Transcription:

Databricks A Primer

Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful open source data processing engine built for sophisticated analytics, ease of use, and speed. Databricks is the largest contributor to the open source Apache Spark project providing 10x more code than any other company. The company has also trained over 20,000 users on Apache Spark, and has the largest number of customers deploying Spark to date. Databricks provides a just-in-time data platform, to simplify data integration, real-time experimentation, and robust deployment of production applications. For more information on Spark, download the Spark Primer at databricks.com/resources/briefs. Data Databricks Value We ve had great success using Apache Spark on Databricks to compute the billions of data points behind our predictive models guiding consumers to the right health insurance plan. The simplicity and interactivity of Databricks makes it easy for developers and data scientists new to Spark to get up to speed very quickly, and not have to worry about the minutiae of managing clusters. Ani Vemprala, CTO & Co-founder, Picwell 2

What is Databricks? Databricks is a hosted end-to-end data platform powered by Apache Spark. It enables organizations to seamlessly transition from data ingest through exploration and production. There are four foundational components that comprise Databricks: Managed Spark Clusters Exploration and Visualization Production Pipelines Third-Party Apps The Foundational Components of Databricks 3

Foundational Components Managed Spark Clusters Fully managed Spark clusters in the cloud that helps enterprises focus on their data and not operations. Easily Provision Clusters: Launch, dynamically scale up or down, and terminate clusters with just a few clicks. We automate management so you can focus on your data. Harness the Power of Spark: Configured and tuned by the people who built it. Import Data Seamlessly: Import data from S3, your local machine, or a wide variety of data sources, including HDFS, RDBMS, Cassandra, and MongoDB. Exploration and Visualization An interactive workspace for exploration and visualization so users can learn, work, and collaborate in a single, easy to use environment. Explore: Use interactive notebooks to write Spark commands in R, Python, Scala, or SQL and reuse your favorite Python, Java, or Scala libraries. Collaborate: Work on the same notebook in real time or send it around for offline collaboration. Visualize: Leverage a wide assortment of point-and-click visualizations. Or use powerful scriptable options like matplotlib, ggplot, and D3. Publish: Build rich dashboards that present key findings to share with your colleagues and customers. 4

Production Pipelines A production pipeline scheduler that helps users get from prototype to production without re-engineering. Schedule Production Workflows: Schedule any existing notebook or locally developed Spark code to run periodically using existing or newly-provisioned clusters. Implement Complete Pipelines: Build production pipelines that span data import and ETL, complex conditional processing, and data export. Monitor Progress and Results: Set up custom alerts for job completion and failure, and easily view historical and in-progress results. Third-Party Apps A platform for powering Spark-based applications that helps users leverage a growing ecosystem of applications, and re-use their favorite tools. 5

What are some of the technical and operational bottlenecks faced by data scientists, data engineers and analysts with their data pipeline? Over last few years, Spark has made great strides in helping enterprises overcome some of their big data processing challenges, however many enterprises are still struggling to extract value from their data pipelines. Capturing value from big data requires capabilities beyond data processing; enterprises are finding out that there are many challenges in their journey to operationalize their data pipeline: 1. Infrastructure issues requiring data teams to pre-provision, setup and manage on-premise clusters that are both costly and time consuming. 2. Once the infrastructure challenges have been addressed, data scientists and engineers still have to contend with siloed workspaces where working with data, code, and visualization requires switching between different software, and sharing work amongst peers means manually copying data. 3. Sharing of insights to non-engineering stakeholders and the hand-off to the production team. 6

Problem: the journey is complex and costly. Get a cluster up and running Import and explore data Build a Production Pipeline Expensive to build and hard to manage Disparate and difficult tools Months of re-engineering to deploy Your Data Pipeline: the journey is complex and costly In all this, enterprises are required to cobble various components together, making it not just highly inefficient, but also difficult to track data lineage and usage patterns over the various components within the stack. With this current model, enterprises are not able to implement complete pipelines - this severely inhibits innovation and value creation. Why Databricks? Given the challenges faced by data professionals and enterprises in managing their data pipeline, we saw the need for a single platform that can enable customers to easily deploy Spark as-a-service while providing a rich set of tools out-of-the-box. Key attributes: Managed Spark Clusters in the Cloud Notebook Environment Spark-Powered Dashboards Production Pipeline Scheduler 3rd Party Applications 7

Our key differentiators are: Unified Platform With Databricks, enterprises are able to go from data ingest through exploration and production on a single data platform. This significantly minimizes the integration pains they currently face when cobbling together multiple tools and systems, and helps streamline entire pipeline deployments. With a unified platform, data professionals are able to reuse their code base by utilizing the same notebooks for exploration and production, resulting in tremendous time savings. Real-Time Databricks provides real-time capabilities in several dimensions. 1. The notebook feature allows users to perform interactive queries and visualize results in real-time. This can dramatically increase their productivity when performing explorations and gain additional insights. 2. The interactive workspace feature enables real-time collaboration amongst multiple users. Team members can seamlessly share code, plots, and results, leveraging each other s work far more effectively. 3. The streaming feature provides low-latency and fault-tolerant processing of continuous data streams. This enables organizations to rapidly take action in response to live data in real-time. Zero Management Databricks provides powerful cluster management capabilities which allow users to create new clusters in seconds, dynamically scale them up and down, and share them across users. This obviates the need to set up and maintain the clusters. As such organizations do not need to have dedicated DevOps teams - their data teams can now create self-service Spark clusters and import their data seamlessly. This allows them to focus on their core mission understanding and gaining insights from their data, not in managing day-to-day operations. Open Platform Databricks is a platform for powering Sparkbased applications and comes with a thirdparty API in addition to JDBC connectivity, so users can plug in their favorite BI tools directly to their Databricks clusters, as each cluster comes with a JDBC server. This enables users to reuse their favorite tools, leverage our growing application ecosystem and to maximize their investments and knowledge base, leading to improved time to value and productivity. 8

How are enterprises typically using Databricks? Enterprises deploy Databricks to achieve a wide variety of objectives, including: Prepare Data Import data using APIs or connectors Clean mal-formed data Aggregate data to create a data warehouse Databricks is powered by Spark, giving it the ability to ingest data from a diverse set of sources and perform simple yet scalable transformations of data. The real-time interactive querying environment and data visualization capability of Databricks makes this typically slow process much faster. Build Data Products Rapid prototyping Implement advanced analytics algorithms Create and monitor robust production pipelines Databricks allows teams of developers and data scientists to efficiently experiment with new product ideas through the interactive workspace. Advanced analytics libraries such as MLlib and GraphX also provide an easy way for teams to deploy sophisticated algorithms in Spark. Once a prototype has been built, one can seamlessly deploy it in production at scale using the Jobs feature. Perform Analytics Explore large data sets in real-time Find hidden patterns with advanced analytics algorithms Publish customized dashboards With Databricks, developers and data scientists can work in SQL, Python, Scala, Java, and R with a wide range of advanced analytics algorithms at their disposal. Teams can be instantly productive with real-time analysis of large-scale datasets on topics ranging from user behavior to customer funnel. Databricks can easily publish these results and complex visualizations as part of notebooks, integration with third party BI tools, or customized dashboards for consumption with a few clicks. 9

How will Databricks benefit data professionals and enterprises? Databricks helps data professionals and enterprises to focus on finding answers from their data, building data products, and ultimately capture the value promised by big data. The platform delivers the following key benefits to data professionals and enterprises: Higher productivity Maintenance-free infrastructure Real-time processing Easy to use tools Faster deployment of data pipelines Zero management Spark clusters Instant transition from prototype to production Data democratization within enterprises One shared repository Seamless collaboration Easy to build sophisticated dashboards and notebooks Try Databricks today with a trial account. databricks.com/try-databricks The fact that explorations by our data science team now take less than an hour, rather than days, has fundamentally changed how we ask questions and visualize changes to the index. Darian Shirazi, CEO, Radius Intelligence Databricks 2016. All rights reserved. Apache Spark and the Apache Spark Logo are trademarks of the Apache Software Foundation. 10