IBM Machine Learning and Data Analytics Collaboration Opportunities. Graham Mackintosh IBM Emerging Technology Project Executive 28 Sept 2016

Similar documents
IBM Big Data. Hadoop-tietoisku kumppaneille Pekka Leppänen, IBM Analytics Platform Leader Finland IBM Corporation

Databricks. A Primer

Databricks. A Primer

Making big data simple with Databricks

Manjula Ambur NASA Langley Research Center April 2014

Unified Big Data Processing with Apache Spark. Matei

Ali Ghodsi Head of PM and Engineering Databricks

Big Data Event. ACSIP & IBM Big Data University

The Big Data Revolution: welcome to the Cognitive Era.

Moving From Hadoop to Spark

BIG DATA & DATA SCIENCE

I. Justification and Program Goals

Microsoft Research Windows Azure for Research Training

From Spark to Ignition:

Networking in the Hadoop Cluster

Machine Learning and Predictive Analytics Foster Growth Convert Edit Feb

ANALYTICS CENTER LEARNING PROGRAM

Information management software solutions White paper. Powerful data warehousing performance with IBM Red Brick Warehouse

Analysis Tools and Libraries for BigData

Microsoft Research Microsoft Azure for Research Training

HDP Hadoop From concept to deployment.

2015 Ironside Group, Inc. 2

Machine Learning and Predictive Analytics Foster Growth [1]

Predictive Modeling Techniques in Insurance

Technology Enablement

Customer Case Study. Automatic Labs

Analytics In the Cloud

Performance Architect Remote Storage (Intern)

Unlocking the True Value of Hadoop with Open Data Science

Big Data Web Analytics Platform on AWS for Yottaa

IBM Smarter Analytics für Big Data

Customer Case Study. Sharethrough

Big Data Architect Certification Self-Study Kit Bundle

Frequently Asked Questions Plus What s New for CA Application Performance Management 9.7

2015 IBM Continuous Engineering Open Labs Target to better LEARNING

Performance and Scalability Overview

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

Worldwide Advanced and Predictive Analytics Software Market Shares, 2014: The Rise of the Long Tail

Bright Idea: GE s Storage Performance Best Practices Brian W. Walker

Big Data Processing. Patrick Wendell Databricks

How To Create A Data Visualization With Apache Spark And Zeppelin

Analytics-as-a-Service: From Science to Marketing

What s next for the Berkeley Data Analytics Stack?

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Disrupting The Market: Predictive Analytics As A Service

Towards Smart and Intelligent SDN Controller

Map-Reduce for Machine Learning on Multicore

SURVEY REPORT DATA SCIENCE SOCIETY 2014

SOLUTION BRIEF BIG DATA MANAGEMENT. How Can You Streamline Big Data Management?

CA Workload Automation for SAP Software

Introduction of thesis topics

An Open-Source Streaming Machine Learning and Real-Time Analytics Architecture

CRITEO INTERNSHIP PROGRAM 2015/2016

IBM Big Data in Government

Big Data Research in the AMPLab: BDAS and Beyond

Dell* In-Memory Appliance for Cloudera* Enterprise

Getting Started with IBM Bluemix: Web Application Hosting Scenario on Java Liberty IBM Redbooks Solution Guide

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

THE ENTERPRISE GAMING COOKBOOK

LIVEPERSON SOLUTIONS BRIEF. Identify Your Highest Value Visitors for Real-Time Engagement and Increased Sales

PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts

API MORNING. IBM Bluemix. The Digital Innovation Platform IBM Corporation

Transforming Analytics for Cognitive Business

Sustainability in Action

Tableau Server 7.0 scalability

Microsoft Big Data. Solution Brief

SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS

MASTER OF SCIENCE IN Computing & Data Analytics. (M.Sc. CDA)

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

IBM Analytics The fluid data layer: The future of data management

MASTER OF SCIENCE IN Computing & Data Analytics. (M.Sc. CDA)

Advanced In-Database Analytics

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

Innovate with the Cloud built for Cognitive Business - IBM Cloud.

A Sumo Logic White Paper. Harnessing Continuous Intelligence to Enable the Modern DevOps Team

Architecture & Experience

TIBCO Live Datamart: Push-Based Real-Time Analytics

Welkom! Copyright 2014 Oracle and/or its affiliates. All rights reserved.

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

How to Run a Successful Big Data POC in 6 Weeks

Linux A first-class citizen in Windows Azure. Bruno Terkaly bterkaly@microsoft.com Principal Software Engineer Mobile/Cloud/Startup/Enterprise

Next-Generation Mobile App Design and the Rise of Contextual Apps

NVIDIA GPUs in the Cloud

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

A Hurwitz white paper. Inventing the Future. Judith Hurwitz President and CEO. Sponsored by Hitachi

Applications of Deep Learning to the GEOINT mission. June 2015

Software challenges in the implementation of large surveys: the case of J-PAS

SOCIAL MEDIA LISTENING AND ANALYSIS Spring 2014

Using and Choosing a Cloud Solution for Data Warehousing

Architectures for massive data management

The 4 Pillars of Technosoft s Big Data Practice

An Overview of Predictive Analytics for Practitioners. Dean Abbott, Abbott Analytics

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

The Rise of Industrial Big Data. Brian Courtney General Manager Industrial Data Intelligence

Next-Gen Big Data Analytics using the Spark stack

Advanced analytics at your hands

Transcription:

IBM Machine Learning and Data Analytics Collaboration Opportunities Graham Mackintosh IBM Emerging Technology Project Executive 28 Sept 2016

Topics IBM Emerging Technology Quick Introduction Workshop Context Machine Learning and Deep Learning Apache Spark open CERN openlabs, opendata, SWAN, A few ideas to kick things off

IBM jstart IBM Emerging Technology jstart is the IBM Emerging Technologies client engagement team (ibm.com/jstart) Solutions for global customers using open & emerging technologies. Two examples of our active projects: - Spark Machine learning for signal classification - NASA, SETI, Stanford - Predictive analytics and real time streaming with the US Cycling Team Knowledge & experience transfer through customer engagements to IBM organizations and products.

jstart Projects and POC Process Requirements driven - start with a simple use case and iterate Low-friction PoC process to explore options & ideas - in-kind contribution Every jstart engagement has an assigned jstart Project Manager and an experienced Architect with ML experience Development Labs and Cloud-based POC environments Experience with a variety of ML technologies (scikit-learn, MLLib, Keras, etc.) Leverage third party & open source packages (e.g. HEP_ML for high energy physics) The jstart Engagement Process Solution Drivers & Boundaries Requirements & Solution Scope Constant feedback on Business & Technology Detailed Design Iterative Development Deployment & Skills Transfer

Workshop Context 1. openlabs is promoting the use of Machine Learning at CERN in collaboration with external companies and research institutions CERN openlab Machine Learning and Data Analytics workshop April 2016 2. Apache Spark enables interesting analytic capabilities and is well accepted by the global data science community SWAN open service for interactive analysis in the cloud CERN evaluation of Spark to predict CMS data set popularity Interest in MLLib, scikit-learn, Keras distributed deep learning, etc. 3. CERN is increasingly open to external citizen scientists Collaboration with LAL for the Higgs ML Challenge in 2014 opendata portal access with controls for data embargoes

Workshop Context 1. openlabs is promoting the use of Machine Learning at CERN in collaboration with external companies, and research institutions CERN openlab Machine Learning and Data Analytics workshop April 2016 IBM Watson - $1B investment in deep learning and cognitive computing IBM DataWorks launched (Watson, Spark, Data SWAN Science Experience) open service for interactive analysis in the cloud IBM SystemML CERN now evaluation an open Apache of Spark to predict CMS data set popularity incubator project IBM is a core Interest contributor in MLLib, to MLLib scikit-learn, Keras distributed deep learning, etc. IBM 3. Cognitive CERN Compute is increasingly Cluster for open to external citizen scientists Deep Learning opendata portal controlled access that respects data embargoes 2. Apache Spark enables interesting analytic capabilities and is well accepted by the global data science community

IBM will open source its breakthrough IBM SystemML machine learning technology and collaborate with Databricks to advance Spark s machine learning capabilities. Workshop IBM will commit more than Context 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide, and open a Spark Technology Center in San Francisco for the 1. openlabs Data Science and is Developer promoting community to foster the use of Machine Learning at CERN in design-led innovation in intelligent applications. collaboration with external companies, and research institutions IBM will educate more than 1 million data scientists and data CERN engineers on openlab Spark through extensive Machine partnerships Learning and Data Analytics workshop April 2016 with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC. 2. Apache Spark enables interesting analytic capabilities and is well accepted by the global data science community SWAN open service for interactive analysis in the cloud IBM has announced strategic investment in Spark now #2 contributor to Spark open source IBM Spark Technology Center opened in the heart of Silicon Valley Spark is linked to hundreds of other cloud services on IBM BlueMix Multiple Spark deployments and active POCs CERN evaluation of Spark to predict CMS data set popularity Interest in MLLib, scikit-learn, Keras distributed deep learning, etc. 3. CERN is increasingly open to external citizen scientists opendata portal controlled access that respects data embargoes

Workshop Context IBM Data Science Experience IBM Data Exchange IBM collaboration with NASA Advanced Super Computer Division to create training/test sets for ML models Example: Spark@SETI 1. openlabs is promoting the use of Machine Learning at CERN in collaboration with external companies, and research institutions CERN openlab Machine Learning and Data Analytics workshop April 2016 2. Apache Spark enables interesting analytic capabilities and is well accepted by the global data science community SWAN open service for interactive analysis in the cloud CERN evaluation of Spark to predict CMS data set popularity Interest in MLLib, scikit-learn, Keras distributed deep learning, etc. 3. CERN is increasingly open to external citizen scientists opendata portal controlled access that respects data embargoes

For example. Spark@SETI SETI Institute Backgrounder Headquartered in Mountain View, CA. Founded 1984. 150 Scientists, researchers and staff. The mission of the SETI Institute is to explore the potential for extra-terrestrial life. search for narrow band radio signals in the frequency range of 1GHz to 10GHz which could be evidence of intelligence outside our solar system. Allen Telescope Array (ATA) Phased Array Synthetic Dish 3 Beams The Allen Telescope Array 4.5TB data every hour 42 Receiving Dishes Each 6m diameter 1GHz to 10GHz Only the data with detected signals is saved for later analysis

Spark@SETI jstart project in collaboration with NASA and the SETI Institute IBM Apache Spark Services allows large volumes of radio signal data to be analyze in news ways Deep data mining the SETI 10-Year data archives Spark-enable analysis of long-duration observations (~5TB each) Intelligent signal classification with deep learning (Cognitive Compute Cluster) Open environment to allow other institutions and world-experts to participate NASA Space Science Division Stanford University Multiple concurrent research teams Swinburne University, Australia Wide-band signal detection experts IBM Research Johannesburg Square Kilometer Array research team

Spark@SETI IBM Spark@SETI GitHub repository Python Jupyter notebooks Python code install packages Standard GitHub Collaboration functions Import of signal data from SETI radio telescope data archives ~ 10 years SWIFT IBM Object Storage Shared repository of SETI data in Object Store 200M rows of signal event data 360,000 raw recordings of signals of interest Large long duration observations (~5TB each) ~20TB accessible data in storage

Spark@SETI Example Notebook Jupyter notebook showing complex radio signals being classified based on morphology and other features. Neural net model was developed on the IBM Cognitive Compute Cluster (GPU enhanced) and ported IBM Spark on the cloud for use by other researchers

Spark@SETI Technical pathfinder Multi-terabyte data sources 100 s of millions of records, millions of binary files ranging from 5MB to 5TB hardened SWIFT connectivity from Spark to Object Store CPU intensive algorithms for multi-variant data processing hardened Spark services for multi-day wall time workloads Multi-terabyte Ground-to-Cloud uploads IBM TS2270 tapes, Softlayer Data Transfer Services, etc. Advanced data visualization and notebook distribution Integration with the IBM Cognitive Compute Cluster Leverage deep learning models for real-time signal triage Cluster availability monitoring and support

PUBLIC-Spark@SETI Open invitation for external researchers and citizen scientists to analyze ATA signal data Gallery of greatest hits and github of notebooks for collaborative outcomes Analytic challenges and hackathons Review of results for potential use by the SETI Institute on the internal Spark environment

PUBLIC-Spark@SETI Stanford University Signal classification based on morphology and selected scalar metrics

PUBLIC-Spark@SETI Stanford University Signal classification based on morphology and selected scalar metrics Example: Randomly (?) modulated signals which are occasionally detected signal of interest? faulty equipment? The scalarinvariant feature transform (SIFT) Fisher Vector Squiggle Fingerprint

Getting back to the context of this workshop IBM experience is that these three are tightly linked 1. openlabs is promoting the use of Machine Learning at CERN in collaboration with external companies, and research institutions CERN openlab Machine Learning and Data Analytics workshop April 2016 2. Apache Spark enables interesting analytic capabilities and is well accepted by the global data science community SWAN open service for interactive analysis in the cloud CERN evaluation of Spark to predict CMS data set popularity Interest in MLLib, scikit-learn, Keras distributed deep learning, etc. 3. CERN is increasingly open to external citizen scientists opendata portal controlled access that respects data embargoes IBM is investing strategically in both Spark and DL Spark community is hotbed of ML and DL activity IBM DSX and Spark Services are ideally suited to support public-facing initiatives Externally contributed innovations can be leveraged for internal use (which is often the motivator) This convergence is the basis for proposing that CERN & IBM should collaborate in these areas

Ideas: Two parallel work streams 1. POC for Internal Use Case many possibilities from April workshop jstart collaboration no-charge exploration of the potential, iterative development/demos, begin knowledge transfer Leverage of IBM Cognitive Compute Cluster and access to IBM Spark and DSX, Softlayer, Object Store, BlueMix services. 2. POC for Public facing Use Case Spark@CERN IBM Data Science Experience Spark@CERN Fully support IBM cloud infrastructure 24x7 Expand and extend the reach of SWAN Controlled access to CMS data Hack-a-thons and ML challenges

Thank you

Supporting Material

The jstart Engagement Process Solution Drivers & Boundaries Requirements & Solution Scope Detailed Design Iterative Development Deployment & Skills Transfer Clear understanding of business problem to be solved Business and technical management commitment Funding in place Right skills identified and committed to project Decision making context Solution definition Small team Define scope Map business needs and technology Deliverables Use cases Preliminary design Tentative schedule Initial sizing Detailed schedule Finalize scope Final technology selections Deliverables Design documents Project schedule Early prototyping Regular code drops Testing throughout cycle Constant feedback from users Modifications via change request Solution deployment Customer selfsufficiency Reusable assets Other business areas or technology

jstart and Apache Spark Ideal for Rapid Results POCs! Apache Foundation open source project In-memory compute engine that works with data; not a data store Enables highly iterative analysis on large volumes of data at scale Unified rapid dev environment for developers and data engineers Greatly simplifies the development of intelligent apps fueled by data

Thank You!