Databricks. A Primer



Similar documents
Databricks. A Primer

Making big data simple with Databricks

Customer Case Study. Sharethrough

Ali Ghodsi Head of PM and Engineering Databricks

From Spark to Ignition:

Customer Case Study. Automatic Labs

The Future of Data Management

Cisco Data Preparation

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Data Integration Checklist

More Data in Less Time

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Three Open Blueprints For Big Data Success

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Customer Case Study. Celtra

Analance Data Integration Technical Whitepaper

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Cloudera Enterprise Data Hub in Telecom:

Deploying an Operational Data Store Designed for Big Data

How Companies are! Using Spark

Sisense. Product Highlights.

Integrating a Big Data Platform into Government:

Microsoft Big Data. Solution Brief

Native Connectivity to Big Data Sources in MSTR 10

Unified Big Data Processing with Apache Spark. Matei

Bring your data to life with Microsoft Power BI. Peter Myers Bitwise Solutions

Analance Data Integration Technical Whitepaper

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

HDP Hadoop From concept to deployment.

How To Use Hp Vertica Ondemand

BIG DATA ANALYTICS For REAL TIME SYSTEM

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Dell In-Memory Appliance for Cloudera Enterprise

Big Data for Investment Research Management

Izenda & SQL Server Reporting Services

Oracle Cloud: Line of Business PaaS Services. Balaji Yelamanchili Senior Vice President Product Development

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Qlik Sense Enterprise

SQLstream 4 Product Brief. CHANGING THE ECONOMICS OF BIG DATA SQLstream 4.0 product brief

Write Once, Run Anywhere Pat McDonough

Shark Installation Guide Week 3 Report. Ankush Arora

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

SAP BusinessObjects Edge BI, Standard Package Preferred Business Intelligence Choice for Growing Companies

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Moving From Hadoop to Spark

AtScale Intelligence Platform

The Enterprise Data Hub and The Modern Information Architecture

Hitachi Data Center Analytics

CRITEO INTERNSHIP PROGRAM 2015/2016

This Symposium brought to you by

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

FlexPod from Cisco and NetApp:

Informatica for Tableau Best Practices to Derive Maximum Value

Why Big Data Analytics?

Accenture and SAP: Delivering Visual Data Discovery Solutions for Agility and Trust at Scale

Big Data Analytics Nokia

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Bringing Big Data to People

Interactive data analytics drive insights

GROW WITH BIG DATA Third Eye Consulting Services & Solutions LLC.

Social Media Implementations

Using Microsoft Business Intelligence Dashboards and Reports in the Federal Government

IBM BigInsights for Apache Hadoop

Ad Hoc Analysis of Big Data Visualization

Big Data Integration: A Buyer's Guide

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Ganzheitliches Datenmanagement

QLIKVIEW FOR LIFE SCIENCES. Partnering for Innovation and Sustainable Growth

Teradata Marketing Operations. Reduce Costs and Increase Marketing Efficiency

Embedded Analytics & Big Data Visualization in Any App

Advanced Solutions of Microsoft SharePoint Server 2013

Predictive Analytics

Descriptive to Predictive to Prescriptive Analytics: Move Up the Value Chain. Suren Nathan CTO

Cisco Solutions for Big Data and Analytics

Accelerating Web-Based SQL Server Applications with SafePeak Plug and Play Dynamic Database Caching

Solutions for Software Companies. Powered by

How To Make Data Streaming A Real Time Intelligence

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Business Intelligence and Big Data Analytics: Speeding the Cycle from Insights to Action Four Steps to More Profitable Customer Engagement

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Copyright 2013 Splunk Inc. Introducing Splunk 6

Hadoop & Spark Using Amazon EMR

The 4 Pillars of Technosoft s Big Data Practice

Assessing campaign management technology

Unleash your intuition

Next-Generation Cloud Analytics with Amazon Redshift

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

Transcription:

Databricks A Primer

Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically simplify big data processing and free users to focus on turning their data into value. We do this through our product, Databricks, that is powered by Spark. For more information on Spark, download the Spark Primer. Data Databricks Value We ve had great success using Apache Spark on Databricks to compute the billions of data points behind our predictive models guiding consumers to the right health insurance plan. The simplicity and interactivity of Databricks makes it easy for developers and data scientists new to Spark to get up to speed very quickly, and not have to worry about the minutiae of managing clusters. Ani Vemprala, CTO & Co-founder, Picwell 2

What is Databricks? Databricks is a hosted end-to-end data platform powered by Spark. It enables organizations to seamlessly transition from data ingest through exploration and production. There are four foundational components that comprise Databricks: Managed Spark Clusters Exploration and Visualization Production Pipelines Third-Party Apps The Foundational Components of Databricks 3

Managed Spark Clusters Fully managed Spark clusters in the cloud that helps enterprises focus on their data and not operations. Easily Provision Clusters: Launch, dynamically scale up or down, and terminate clusters with just a few clicks. We automate management so you can focus on your data. Harness the Power of Spark: Configured and tuned by the people who built it. Import Data Seamlessly: Import data from S3, your local machine, or a wide variety of data sources, including HDFS, RDBMS, Cassandra, and MongoDB. Exploration and Visualization An interactive workspace for exploration and visualization so users can learn, work, and collaborate in a single, easy to use environment. Explore: Use interactive notebooks to write Spark commands in R, Python, Scala, or SQL and reuse your favorite Python, Java, or Scala libraries. Collaborate: Work on the same notebook in real time or send it around for offline collaboration. Visualize: Leverage a wide assortment of point-and-click visualizations. Or use powerful scriptable options like matplotlib, ggplot, and D3. Publish: Build rich dashboards that present key findings to share with your colleagues and customers. 4

Production Pipelines A production pipeline scheduler that helps users get from prototype to production without re-engineering. Schedule Production Workflows: Schedule any existing notebook or locally developed Spark code to run periodically using existing or newly-provisioned clusters. Implement Complete Pipelines: Build production pipelines that span data import and ETL, complex conditional processing, and data export. Monitor Progress and Results: Set up custom alerts for job completion and failure, and easily view historical and in-progress results. Third-Party Apps A platform for powering Spark-based applications that helps users leverage a growing ecosystem of applications, and re-use their favorite tools. 5

What are some of the technical and operational bottlenecks faced by data scientists, data engineers and analysts with their data pipeline? Over last few years, Spark has made great strides in helping enterprises overcome some of their big data processing challenges, however many enterprises are still struggling to extract value from their data pipelines. Capturing value from big data requires capabilities beyond data processing; enterprises are finding out that there are many challenges in their journey to operationalize their data pipeline: 1. Infrastructure issues requiring data teams to pre-provision, setup and manage on-premise clusters that are both costly and time consuming. 2. Once the infrastructure challenges have been addressed, data scientists and engineers still have to contend with siloed workspaces where working with data, code, and visualization requires switching between different software, and sharing work amongst peers means manually copying data. 3. Sharing of insights to non-engineering stakeholders and the hand-off to the production team. 6

Problem: the journey is complex and costly. Get a cluster up and running Import and explore data Build a Production Pipeline Expensive to build and hard to manage Disparate and difficult tools Months of re-engineering to deploy Your Data Pipeline: the journey is complex and costly In all this, enterprises are required to cobble various components together, making it not just highly inefficient, but also difficult to track data lineage and usage patterns over the various components within the stack. With this current model, enterprises are not able to implement complete pipelines - this severely inhibits innovation and value creation. Why Databricks? Given the challenges faced by data professionals and enterprises in managing their data pipeline, we saw the need for a single platform that can enable customers to easily deploy Spark as-a-service while providing a rich set of tools out-of-the-box. Key attributes: Managed Spark Clusters in the Cloud Notebook Environment Production Pipeline Scheduler 3rd Party Applications 7

Our key differentiators are: Unified Platform With Databricks, enterprises are able to go from data ingest through exploration and production on a single data platform. This significantly minimizes the integration pains they currently face when cobbling together multiple tools and systems, and helps streamline entire pipeline deployments. With a unified platform, data professionals are able to reuse their code base by utilizing the same notebooks for exploration and production, resulting in tremendous time savings. Zero Management Databricks provides powerful cluster management capabilities which allow users to create new clusters in seconds, dynamically scale them up and down, and share them across users. This obviates the need to set up and maintain the clusters. As such organizations do not need to have dedicated DevOps teams - their data teams can now create self-service Spark clusters and import their data seamlessly. This allows them to focus on their core mission understanding and gaining insights from their data, not in managing day-to-day operations. Real-Time Databricks provides real-time capabilities in several dimensions. 1. The notebook feature allows users to perform interactive queries and visualize results in real-time. This can dramatically increase their productivity when performing explorations and gain additional insights. 2. The interactive workspace feature enables real-time collaboration amongst multiple users. Team members can seamlessly share code, plots, and results, leveraging each other s work far more effectively. Open Platform Databricks is a platform for powering Sparkbased applications and comes with a thirdparty API in addition to JDBC connectivity, so users can plug in their favorite BI tools directly to their Databricks clusters, as each cluster comes with a JDBC server. This enables users to reuse their favorite tools, leverage our growing application ecosystem and to maximize their investments and knowledge base, leading to improved time to value and productivity. 3. The streaming feature provides low-latency and fault-tolerant processing of continuous data streams. This enables organizations to rapidly take action in response to live data in real-time. 8

How are enterprises typically using Databricks? Enterprises deploy Databricks to achieve a wide variety of objectives, including: Prepare Data Import data using APIs or connectors Clean mal-formed data Aggregate data to create a data warehouse Databricks is powered by Spark, giving it the ability to ingest data from a diverse set of sources and perform simple yet scalable transformations of data. The real-time interactive querying environment and data visualization capability of Databricks makes this typically slow process much faster. Build Data Products Rapid prototyping Implement advanced analytics algorithms Create and monitor robust production pipelines Databricks allows teams of developers and data scientists to efficiently experiment with new product ideas through the interactive workspace. Advanced analytics libraries such as MLlib and GraphX also provide an easy way for teams to deploy sophisticated algorithms in Spark. Once a prototype has been built, one can seamlessly deploy it in production at scale using the Jobs feature. Perform Analytics Explore large data sets in real-time Find hidden patterns with advanced analytics algorithms Publish customized dashboards With Databricks, developers and data scientists can work in SQL, Python, Scala, Java, and R with a wide range of advanced analytics algorithms at their disposal. Teams can be instantly productive with real-time analysis of largescale datasets on topics ranging from user behavior to customer funnel. Databricks can easily publish these results and complex visualizations as part of notebooks, integration with third party BI tools, or customized dashboards for consumption with a few clicks. 9

How will Databricks benefit data professionals and enterprises? Databricks helps data professionals and enterprises to focus on finding answers from their data, building data products, and ultimately capture the value promised by big data. Evaluate Databricks with a trial account now. The platform delivers the following key benefits to data professionals and enterprises: Higher productivity Maintenance-free infrastructure Real-time processing Easy to use tools Faster deployment of data pipelines Zero management Spark clusters Instant transition from prototype to production Data democratization within enterprises One shared repository Seamless collaboration Easy to build sophisticated dashboards and notebooks databricks.com/registration The fact that explorations by our data science team now take less than an hour, rather than days, has fundamentally changed how we ask questions and visualize changes to the index. Darian Shirazi, CEO, Radius Intelligence 10