OpenChorus: Building a Tool-Chest for Big Data Science



Similar documents
Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Advanced In-Database Analytics

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Integrating a Big Data Platform into Government:

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

KnowledgeSEEKER Marketing Edition

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Databricks. A Primer

Databricks. A Primer

Big Data and the Data Lake. February 2015

Ganzheitliches Datenmanagement

HDP Hadoop From concept to deployment.

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

BIG Data Analytics Move to Competitive Advantage

A Comprehensive Review of Self-Service Data Visualization in MicroStrategy. Vijay Anand January 28, 2014

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Three Open Blueprints For Big Data Success

Oracle Big Data SQL Technical Update

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

The Future of Data Management

This Symposium brought to you by

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

WHAT S NEW IN SAS 9.4

Extend your analytic capabilities with SAP Predictive Analysis

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Predictive Analytics

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

UNIFY YOUR (BIG) DATA

The Future of Business Analytics is Now! 2013 IBM Corporation

Big Data and Data Science: Behind the Buzz Words

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Luncheon Webinar Series May 13, 2013

Find the Hidden Signal in Market Data Noise

Oracle Big Data Strategy Simplified Infrastrcuture

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Microsoft Services Exceed your business with Microsoft SharePoint Server 2010

EMC GREENPLUM DATABASE

Bringing the Power of SAS to Hadoop. White Paper

Cloudera Enterprise Data Hub in Telecom:

Cisco IT Hadoop Journey

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Big Data Analytics Nokia

Real-time Big Data Analytics with Storm

IBM Big Data Platform

SAS Enterprise Data Integration Server - A Complete Solution Designed To Meet the Full Spectrum of Enterprise Data Integration Needs

Cisco Solutions for Big Data and Analytics

Bringing Big Data to People

Upcoming Announcements

Ali Ghodsi Head of PM and Engineering Databricks

Safe Harbor Statement

Moving From Hadoop to Spark

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

TRANSFORM BIG DATA INTO ACTIONABLE INFORMATION

Collaboration. Michael McCabe Information Architect black and white solutions for a grey world

What s New in Analytics: Fall 2015

ANALYTICS CENTER LEARNING PROGRAM

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Unlocking the True Value of Hadoop with Open Data Science

What s New in Analytics: Fall 2015

Traditional BI vs. Business Data Lake A comparison

ANALYTICS IN BIG DATA ERA

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

In-Database Analytics

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015

Manage the Analytical Life Cycle for Continuous Innovation

Accelerating DevOps With Copy Data Virtualization. June, 2015

Big Data for Investment Research Management

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Next Generation Data Warehousing Appliances

Big Data and the new trends for BI and Analytics Juha Teljo Business Intelligence and Predictive Solutions Executive IBM Europe

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

#TalendSandbox for Big Data

More Data in Less Time

A Whole New World. Big Data Technologies Big Discovery Big Insights Endless Possibilities

Big Data for Investment Research Management

The Future of Data Management with Hadoop and the Enterprise Data Hub

Outline. What is Big data and where they come from? How we deal with Big data?

What s New with Informatica Data Services & PowerCenter Data Virtualization Edition

How To Handle Big Data With A Data Scientist

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Transcription:

OpenChorus: Building a Tool-Chest for Big Data Science Milind Bhandarkar Chief Scientist, Machine Learning Platforms EMC Greenplum 1

Agenda! Tools for Data Science! Data Science Workflow! Greenplum OpenChorus! How Chorus Works 2

Data Science Tools: Abundance of Riches! Proliferation of tools! Languages & Libraries R, Matlab, Python SciPy, NLTK, Madlib, Mahout! Frameworks Graphlab, Pregel (Giraffe), Mesos, CEP! Platforms/Data Stores MPP Databases, Hadoop, NoSQL (Hbase, Cassandra, MongoDB), SciDB 3

Choice of Tool(s)?! Hammer? Hadoop ought to be sufficient for most tasks If all you have have a hammer, throw away everything that is not a nail Jimmy Lin ( http://arxiv.org/abs/1209.2191) Operational complexity / learning curve not worth efficiency! Tool-Chest? Use the right tool for the right job How to reduce complexity 4

5

Data Science Workload! http://www.dataists.com/2010/09/a-taxonomy-ofdata-science/! Obtain! Scrub! Explore! Model! Interpret 6

Obtain! Corpus needs to be usable & sufficient! Possibly from multiple independent sources! Needs to be automated for streams! Needs to have efficient ingestion for one-time data 7

Scrub! Raw data is always messy Missing data, inconsistent data, charsets(!) NY, New York, NYC, Big Apple etc! Growing Dictionaries! Join with Crowdsourcing Mechanical Turk etc 8

Explore! Visualize, Clustering, Dimensionality reduction Feature correlations (scatter plots) Single feature histograms! Challenge: How not to lose these observations 9

Model! Find correlation of past data and outcome Find good training set Label the training set Derive model parameters Apply model, and validate! Ensemble Models: Model of models 10

Interpret! Models are built for prediction and interpretation! Check that there are no surprises! Reason about models! Improve models 11

Data Science Data Flow! Raw Data (Timed, Partitioned, Crowdsourced, Deduped etc)! Derived data (simple aggregates, other statistics)! Models (Feature weights, decision trees)! Indexes 12

Data Diversity! Natural Language Text, and Annotations! (Bags of words) : Concept! Graphs (sparse matrices)! Dense Matrices! Location (proximity) 13

Too Many Tools for One Data Science Project Analytics Tools Content and File Management Data Exploration and Sharing Data marts Excel spreadsheets Flat files Data mining BI/Visualization Data integration Share drives Wiki Content mgmt app Communications Emails Meetings IMs Process Documentation Some project plans, no up-to-date team collaborative documentation 14

High Cost of Knowledge Sharing! Data science process breaks when organization structure changes! Very difficult knowledge transfer! No insurance policy for the data science intellectual assets 15

Delayed Time-to-Market 3. Learn about the data 1. Find the data 4. Move to sandbox 2. Get access To data 6. Operationalize the model 6-9 Months 5. Analysis Finally! 16

Greenplum Chorus! Collaborative analytics! Powerful extensibility! The freedom of open source Greenplum s Social Platform for Collaborative Data Science 17

Chorus Enables Collaborative Data Science! Collaborate within projects, share data, content, and findings across teams! Make projects more transparent Data Discovery & Exploration Self-Services Provisioning! Iterate faster for accelerated insights with real-time social collaboration Publish & Share Collaboration Analysis & Modeling 18

Powerful Extensibility! Integrated development environment for analytics! Expand insights with simple access to third-party data! Fusion with leading analytics and visualization tools 19

The Freedom of Open Source www.openchorus.org! Modify and extend to any environment! Promotes an ecosystem of applications, startups, and data scientists community 20

How Chorus Works 21

How Chorus Works Su fe an Chorus Workspace Work Files Data Chorus View Activities & Insights GPDB External Table GPDB Source Data EDW Non-GPDB DB Sandbox Greenplum DB 22

Data Exploration Search and Data Discovery! Automatic indexing of metadata, work files, comments, and insights! Quickly find data across the enterprise regardless of location 23

Data Exploration Data Preview and Visualization! Data preview for instant understanding! Quick and easy data visualizations Visualize data for faster insight into datasets No need to export to third-party applications like R Not a replacement for advanced visualization tools 24

Data Exploration Living Data Dictionary! Bring everything about the data to the data Attach documents Ask questions Add comments! Build a living data dictionary Everything is current No more spreadsheets 25

Workspace Streamlines Collaboration Chorus Workspace! Chorus includes unlimited workspaces, each representing individual project! Streamlines complex useruser and user-data interactions 26

Multi-level Secure Collaboration Chorus Workspace! Authentication Integrates with LDAP and AD for password management! Application access control User roles: Admin vs. general user Workspace types: Public or private! Data access control Chorus enforces database rules and permissions 27

Data Dataset Types 1. Source Dataset Pointer to the source data Chorus Workspace Work Files Data Activities & Insights Both internal and external data Support both native connectivity for GPDB and flat files Use GPDB External Tables for Non- GPDB databases and Hadoop 2. Sandbox Dataset Copy of the source data to be used for analytics Data generated from analytics 28

Data Sandbox! Container of all the analytics data Chorus Workspace Work Files Data Activities & Insights! Ease of self-service provisioning of sandboxes Free up IT bandwidth Minimize data proliferation to uncontrolled/unmanaged data marts Sandbox Greenplum DB 29

Data Populating Sandbox Chorus Workspace Work Files Data Chorus View Activities & Insights Import data easily from anywhere:! Directly from sources! Through Chorus View! Flat file import GPDB External Table GPDB Source Data Non-GPDB EDW Sandbox Greenplum DB 30

Data Chorus View Utility Chorus Workspace Work Files Data Chorus View Sandbox Greenplum DB Activities & Insights! Single-view GUI utility for exploring, filtering, aggregating, and moving the desired data from sources to sandbox! Data exploration and visualization prior to bringing the data into sandbox! Derive variation of the basic source datasets without bringing the data into sandbox 31

Data - Chorus View Chorus View Select a.userid, a.customer_name, a.gender, a.customer_state, b.ipaddress, b.device, From customers AS a INNER JOIN weblog_2012q1 as b ON a.userid = b.userid GPDB GPDB External Table Source Dataset EDW Sandbox Dataset GPDB External Table 32

Data Automated Data Services! Subscribe to receive automatic updates Schedule imports from multiple data sources Define and share data sets within the data science team Removes manual data refresh activities 33

Work Files Chorus Workspace Work Files Data Chorus View Sandbox Activities & Insights! Work files are non-data assets SQL query statements with code editor interface Execution of in-database analytics, ex: MADLib, PL/R Third-party tool files PowerPoint, Word doc, etc.! Analytics asset management with version, compare, and archive work files Greenplum DB 34

Integration with Analytics Tools! Third-party tools Execute in-database analytics functions (ex: MADLib, R) from Chorus work files Publish and execute Alpine Miner Workflow from Chorus native interface Data preparation for analysis using SAS and other analytics tools! Code-design UI for SQL 35

Insight and Data Sharing! Post comments and ask questions on any analytics artifacts! Share and publish any activities or insights! Promote fast iteration on data and ideas 36

Activities and Insights Chorus Workspace Work Files Data Chorus View Activities & Insights! Build a living library of activities and insights Define, publish, and share new insights Discover and learn from existing insights! Iterate faster, model less Sandbox Greenplum DB 37

38