IBM Machine Learning and Data Analytics Collaboration Opportunities Graham Mackintosh IBM Emerging Technology Project Executive 28 Sept 2016
Topics IBM Emerging Technology Quick Introduction Workshop Context Machine Learning and Deep Learning Apache Spark open CERN openlabs, opendata, SWAN, A few ideas to kick things off
IBM jstart IBM Emerging Technology jstart is the IBM Emerging Technologies client engagement team (ibm.com/jstart) Solutions for global customers using open & emerging technologies. Two examples of our active projects: - Spark Machine learning for signal classification - NASA, SETI, Stanford - Predictive analytics and real time streaming with the US Cycling Team Knowledge & experience transfer through customer engagements to IBM organizations and products.
jstart Projects and POC Process Requirements driven - start with a simple use case and iterate Low-friction PoC process to explore options & ideas - in-kind contribution Every jstart engagement has an assigned jstart Project Manager and an experienced Architect with ML experience Development Labs and Cloud-based POC environments Experience with a variety of ML technologies (scikit-learn, MLLib, Keras, etc.) Leverage third party & open source packages (e.g. HEP_ML for high energy physics) The jstart Engagement Process Solution Drivers & Boundaries Requirements & Solution Scope Constant feedback on Business & Technology Detailed Design Iterative Development Deployment & Skills Transfer
Workshop Context 1. openlabs is promoting the use of Machine Learning at CERN in collaboration with external companies and research institutions CERN openlab Machine Learning and Data Analytics workshop April 2016 2. Apache Spark enables interesting analytic capabilities and is well accepted by the global data science community SWAN open service for interactive analysis in the cloud CERN evaluation of Spark to predict CMS data set popularity Interest in MLLib, scikit-learn, Keras distributed deep learning, etc. 3. CERN is increasingly open to external citizen scientists Collaboration with LAL for the Higgs ML Challenge in 2014 opendata portal access with controls for data embargoes
Workshop Context 1. openlabs is promoting the use of Machine Learning at CERN in collaboration with external companies, and research institutions CERN openlab Machine Learning and Data Analytics workshop April 2016 IBM Watson - $1B investment in deep learning and cognitive computing IBM DataWorks launched (Watson, Spark, Data SWAN Science Experience) open service for interactive analysis in the cloud IBM SystemML CERN now evaluation an open Apache of Spark to predict CMS data set popularity incubator project IBM is a core Interest contributor in MLLib, to MLLib scikit-learn, Keras distributed deep learning, etc. IBM 3. Cognitive CERN Compute is increasingly Cluster for open to external citizen scientists Deep Learning opendata portal controlled access that respects data embargoes 2. Apache Spark enables interesting analytic capabilities and is well accepted by the global data science community
IBM will open source its breakthrough IBM SystemML machine learning technology and collaborate with Databricks to advance Spark s machine learning capabilities. Workshop IBM will commit more than Context 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide, and open a Spark Technology Center in San Francisco for the 1. openlabs Data Science and is Developer promoting community to foster the use of Machine Learning at CERN in design-led innovation in intelligent applications. collaboration with external companies, and research institutions IBM will educate more than 1 million data scientists and data CERN engineers on openlab Spark through extensive Machine partnerships Learning and Data Analytics workshop April 2016 with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC. 2. Apache Spark enables interesting analytic capabilities and is well accepted by the global data science community SWAN open service for interactive analysis in the cloud IBM has announced strategic investment in Spark now #2 contributor to Spark open source IBM Spark Technology Center opened in the heart of Silicon Valley Spark is linked to hundreds of other cloud services on IBM BlueMix Multiple Spark deployments and active POCs CERN evaluation of Spark to predict CMS data set popularity Interest in MLLib, scikit-learn, Keras distributed deep learning, etc. 3. CERN is increasingly open to external citizen scientists opendata portal controlled access that respects data embargoes
Workshop Context IBM Data Science Experience IBM Data Exchange IBM collaboration with NASA Advanced Super Computer Division to create training/test sets for ML models Example: Spark@SETI 1. openlabs is promoting the use of Machine Learning at CERN in collaboration with external companies, and research institutions CERN openlab Machine Learning and Data Analytics workshop April 2016 2. Apache Spark enables interesting analytic capabilities and is well accepted by the global data science community SWAN open service for interactive analysis in the cloud CERN evaluation of Spark to predict CMS data set popularity Interest in MLLib, scikit-learn, Keras distributed deep learning, etc. 3. CERN is increasingly open to external citizen scientists opendata portal controlled access that respects data embargoes
For example. Spark@SETI SETI Institute Backgrounder Headquartered in Mountain View, CA. Founded 1984. 150 Scientists, researchers and staff. The mission of the SETI Institute is to explore the potential for extra-terrestrial life. search for narrow band radio signals in the frequency range of 1GHz to 10GHz which could be evidence of intelligence outside our solar system. Allen Telescope Array (ATA) Phased Array Synthetic Dish 3 Beams The Allen Telescope Array 4.5TB data every hour 42 Receiving Dishes Each 6m diameter 1GHz to 10GHz Only the data with detected signals is saved for later analysis
Spark@SETI jstart project in collaboration with NASA and the SETI Institute IBM Apache Spark Services allows large volumes of radio signal data to be analyze in news ways Deep data mining the SETI 10-Year data archives Spark-enable analysis of long-duration observations (~5TB each) Intelligent signal classification with deep learning (Cognitive Compute Cluster) Open environment to allow other institutions and world-experts to participate NASA Space Science Division Stanford University Multiple concurrent research teams Swinburne University, Australia Wide-band signal detection experts IBM Research Johannesburg Square Kilometer Array research team
Spark@SETI IBM Spark@SETI GitHub repository Python Jupyter notebooks Python code install packages Standard GitHub Collaboration functions Import of signal data from SETI radio telescope data archives ~ 10 years SWIFT IBM Object Storage Shared repository of SETI data in Object Store 200M rows of signal event data 360,000 raw recordings of signals of interest Large long duration observations (~5TB each) ~20TB accessible data in storage
Spark@SETI Example Notebook Jupyter notebook showing complex radio signals being classified based on morphology and other features. Neural net model was developed on the IBM Cognitive Compute Cluster (GPU enhanced) and ported IBM Spark on the cloud for use by other researchers
Spark@SETI Technical pathfinder Multi-terabyte data sources 100 s of millions of records, millions of binary files ranging from 5MB to 5TB hardened SWIFT connectivity from Spark to Object Store CPU intensive algorithms for multi-variant data processing hardened Spark services for multi-day wall time workloads Multi-terabyte Ground-to-Cloud uploads IBM TS2270 tapes, Softlayer Data Transfer Services, etc. Advanced data visualization and notebook distribution Integration with the IBM Cognitive Compute Cluster Leverage deep learning models for real-time signal triage Cluster availability monitoring and support
PUBLIC-Spark@SETI Open invitation for external researchers and citizen scientists to analyze ATA signal data Gallery of greatest hits and github of notebooks for collaborative outcomes Analytic challenges and hackathons Review of results for potential use by the SETI Institute on the internal Spark environment
PUBLIC-Spark@SETI Stanford University Signal classification based on morphology and selected scalar metrics
PUBLIC-Spark@SETI Stanford University Signal classification based on morphology and selected scalar metrics Example: Randomly (?) modulated signals which are occasionally detected signal of interest? faulty equipment? The scalarinvariant feature transform (SIFT) Fisher Vector Squiggle Fingerprint
Getting back to the context of this workshop IBM experience is that these three are tightly linked 1. openlabs is promoting the use of Machine Learning at CERN in collaboration with external companies, and research institutions CERN openlab Machine Learning and Data Analytics workshop April 2016 2. Apache Spark enables interesting analytic capabilities and is well accepted by the global data science community SWAN open service for interactive analysis in the cloud CERN evaluation of Spark to predict CMS data set popularity Interest in MLLib, scikit-learn, Keras distributed deep learning, etc. 3. CERN is increasingly open to external citizen scientists opendata portal controlled access that respects data embargoes IBM is investing strategically in both Spark and DL Spark community is hotbed of ML and DL activity IBM DSX and Spark Services are ideally suited to support public-facing initiatives Externally contributed innovations can be leveraged for internal use (which is often the motivator) This convergence is the basis for proposing that CERN & IBM should collaborate in these areas
Ideas: Two parallel work streams 1. POC for Internal Use Case many possibilities from April workshop jstart collaboration no-charge exploration of the potential, iterative development/demos, begin knowledge transfer Leverage of IBM Cognitive Compute Cluster and access to IBM Spark and DSX, Softlayer, Object Store, BlueMix services. 2. POC for Public facing Use Case Spark@CERN IBM Data Science Experience Spark@CERN Fully support IBM cloud infrastructure 24x7 Expand and extend the reach of SWAN Controlled access to CMS data Hack-a-thons and ML challenges
Thank you
Supporting Material
The jstart Engagement Process Solution Drivers & Boundaries Requirements & Solution Scope Detailed Design Iterative Development Deployment & Skills Transfer Clear understanding of business problem to be solved Business and technical management commitment Funding in place Right skills identified and committed to project Decision making context Solution definition Small team Define scope Map business needs and technology Deliverables Use cases Preliminary design Tentative schedule Initial sizing Detailed schedule Finalize scope Final technology selections Deliverables Design documents Project schedule Early prototyping Regular code drops Testing throughout cycle Constant feedback from users Modifications via change request Solution deployment Customer selfsufficiency Reusable assets Other business areas or technology
jstart and Apache Spark Ideal for Rapid Results POCs! Apache Foundation open source project In-memory compute engine that works with data; not a data store Enables highly iterative analysis on large volumes of data at scale Unified rapid dev environment for developers and data engineers Greatly simplifies the development of intelligent apps fueled by data
Thank You!