The State of Big Data: Use Cases and the Ogre patterns



Similar documents
NIST Big Data PWG & RDA Big Data Infrastructure WG: Implementation Strategy: Best Practice Guideline for Big Data Application Development

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

HPC ABDS: The Case for an Integrating Apache Big Data Stack

INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING

arp: Collective Communication on Hadoop Judy Qiu, Indiana University

Why Spark on Hadoop Matters

A Brief Introduction of Existing Big Data Tools. Bingjing Zhang

Big Data Use Cases and Requirements

Implications of NIST Big Data Application Classification for Benchmarking

Hadoop Ecosystem B Y R A H I M A.

Big Data for Government Symposium

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Open source software for building a private cloud

Upcoming Announcements

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Challenges for Data Driven Systems

Big Data and Data Science: Behind the Buzz Words

HDP Hadoop From concept to deployment.

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Big Data. Lyle Ungar, University of Pennsylvania

HADOOP. Revised 10/19/2015

Bringing Big Data to People

Cloud Scale Distributed Data Storage. Jürmo Mehine

Big Data Applications, Software and System Architectures

Comprehensive Analytics on the Hortonworks Data Platform

Apache Hadoop Ecosystem

The Internet of Things and Big Data: Intro

COMP9321 Web Application Engineering

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Lars Francke Diplom Wirtschaftsinformatiker (FH) Sülldorfer Kirchenweg 34

The Future of Data Management with Hadoop and the Enterprise Data Hub

How to Hadoop Without the Worry: Protecting Big Data at Scale

6.S897 Large-Scale Systems

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Workshop on Hadoop with Big Data

Dominik Wagenknecht Accenture

HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING. Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

HADOOP VENDOR DISTRIBUTIONS THE WHY, THE WHO AND THE HOW? Guruprasad K.N. Enterprise Architect Wipro BOTWORKS

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Big Data and Analytics: Challenges and Opportunities

Case Study : 3 different hadoop cluster deployments

Apache Flink Next-gen data analysis. Kostas

Spark and the Big Data Library

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Large-Scale Data Processing

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

TRAINING PROGRAM ON BIGDATA/HADOOP

The other Apache Technologies your Big Data solution needs! Nick Burch CTO, Quanticate

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

How To Write A Trusted Analytics Platform (Tap)

CloudStack and Big Data. Sebastien May 22nd 2013 LinuxTag, Berlin

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Analytics and Optimization

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Open Source for Cloud Infrastructure

Cloud Computing and Big Data What Technical Writers Need to Know

SQL on NoSQL (and all of the data) With Apache Drill

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

How Companies are! Using Spark

Self-service BI for big data applications using Apache Drill

Data Security in Hadoop

HDP Enabling the Modern Data Architecture

Applications for Big Data Analytics

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

HPC data becomes Big Data. Peter Braam

Hadoop. Sunday, November 25, 12

Big-Data Computing with Smart Clouds and IoT Sensing

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Creating Big Data Applications with Spring XD

Dell In-Memory Appliance for Cloudera Enterprise

Self-service BI for big data applications using Apache Drill

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

MLlib: Scalable Machine Learning on Spark

ebook Big Data Explained, Analysed, Solved

#TalendSandbox for Big Data

Big Data Analytics - Accelerated. stream-horizon.com

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

Moving From Hadoop to Spark

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Unified Big Data Processing with Apache Spark. Matei

Search and Real-Time Analytics on Big Data

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

Transcription:

The State of Big Data: Use Cases and the Ogre patterns NIST Big Data Public Working Group IEEE Big Data Workshop October 27, 2014 Geoffrey Fox, Digital Science Center, Indiana University, gcf@indiana.edu

Requirements and Use Case Subgroup The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus list of Big Data requirements across all stakeholders. This includes gathering and understanding various use cases from diversified application domains. Tasks Gather use case input from all stakeholders 51 Get Big Data requirements from use cases. (35 General; 437 Specific) Analyze/prioritize a list of challenging general requirements that may delay or prevent adoption of Big Data deployment Work with Reference Architecture to validate requirements and reference architecture Develop a set of general patterns capturing the essence of use cases (Agreed plan at September 30, 2013. Became the Ogres) 2

Use Case Template 3 26 fields completed for 51 areas Government Operation: 4 Commercial: 8 Defense: 3 Healthcare and Life Sciences: 10 Deep Learning and Social Media: 6 The Ecosystem for Research: 4 Astronomy and Physics: 5 Earth, Environmental and Polar Science: 10 Energy: 1

51 Detailed Use Cases: Contributed July-September 2013 http://bigdatawg.nist.gov/usecases.php, 26 Features for each use case Government Operation(4): National Archives and Records Administration, Census Bureau Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS) Defense(3): Sensors, Image surveillance, Situation Assessment Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets The Ecosystem for Research(4): Metadata, Collaboration, Translation, Light source data Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron Collider at CERN, Belle II Accelerator in Japan Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors Energy(1): Smart grid 4

Patterns (Ogres) modelled on 13 Berkeley Dwarfs Dense Linear Algebra Sparse Linear Algebra Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines The Berkeley dwarfs and NAS Parallel Benchmarks are perhaps two best known approaches to characterizing Parallel Computing Uses Cases / Kernels / Patterns Note dwarfs somewhat inconsistent as for example MapReduce is a programming model and spectral method is a numerical method. No single comparison criterion and so need multiple facets! 5

7 Computational Giants of NRC Massive Data Analysis Report 1) G1: Basic Statistics (termed MRStat later as suitable for simple MapReduce implementation) 2) G2: Generalized N-Body Problems 3) G3: Graph-Theoretic Computations 4) G4: Linear Algebraic Computations 5) G5: Optimizations e.g. Linear Programming 6) G6: Integration (Called GML Global Machine Learning Later) 7) G7: Alignment Problems e.g. BLAST 6

Features of 51 Use Cases I PP (26) All Pleasingly Parallel or Map Only MR (18) Classic MapReduce MR (add MRStat below for full count) MRStat (7) Simple version of MR where key computations are simple reduction as found in statistical averages such as histograms and averages MRIter (23) Iterative MapReduce or MPI (Spark, Twister) Graph (9) Complex graph data structure needed in analysis Fusion (11) Integrate diverse data to aid discovery/decision making; could involve sophisticated algorithms or could just be a portal Streaming (41) data comes in incrementally and is processed this way Classify (30) Classification: divide data into categories S/Q (12) Index, Search and Query 7

Features of 51 Use Cases II CF (4) Collaborative Filtering for recommender engines LML (36) Local Machine Learning (Independent for each parallel entity) application could have GML as well GML (23) Global Machine Learning: Deep Learning, Clustering, LDA, PLSI, MDS, Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt. Can call EGO or Exascale Global Optimization with scalable parallel algorithm Workflow (51) Universal GIS (16) Geotagged data and often displayed in ESRI, Microsoft Virtual Earth, Google Earth, GeoServer etc. HPC (5) Classic large-scale simulation of cosmos, materials, etc. generating (visualization) data Agent (2) Simulations of models of data-defined macroscopic entities represented as agents 8

First set of Ogre Facets Facets I: The features just discussed (PP, MR, MRStat, MRIter, Graph, Fusion, Streaming (DDDAS), Classify, S/Q, CF, LML, GML, Workflow, GIS, HPC, Agents) Facets II: Some broad features familiar from past like BSP (Bulk Synchronous Processing) or not? SPMD (Single Program Multiple Data) or not? Iterative or not? Regular or Irregular? Static or dynamic?, communication/compute and I-O/compute ratios Data abstraction (array, key-value, pixels, graph ) 9

Data Processing Facet: Illustrated by Typical Science Case 10

Core Analytics Facet I Map-Only Pleasingly parallel - Local Machine Learning MapReduce: Search/Query/Index Summarizing statistics as in LHC Data analysis (histograms) (G1) Recommender Systems (Collaborative Filtering) Linear Classifiers (Bayes, Random Forests) Alignment and Streaming (G7) Genomic Alignment, Incremental Classifiers Global Analytics: Nonlinear Solvers (structure depends on objective function) (G5,G6) Stochastic Gradient Descent SGD and approximations to Newton s Method Levenberg-Marquardt solver 11

Core Analytics Facet II Global Analytics: Map-Collective (See Mahout, MLlib) (G2,G4,G6) Often use matrix-matrix,-vector operations, solvers (conjugate gradient) Clustering (many methods), Mixture Models, LDA (Latent Dirichlet Allocation), PLSI (Probabilistic Latent Semantic Indexing) SVM and Logistic Regression Outlier Detection (several approaches) PageRank, (find leading eigenvector of sparse matrix) SVD (Singular Value Decomposition) MDS (Multidimensional Scaling) Learning Neural Networks (Deep Learning) Hidden Markov Models Graph Analytics (G3) Structure and Simulation (Communities, subgraphs/motifs, diameter, maximal cliques, connected components, Betweenness centrality, shortest path) Linear/Quadratic Programming, Combinatorial Optimization, Branch and Bound (G5) 12

Cross-Cutting Functionalities 1) Message and Data Protocols: Avro, Thrift, Protobuf 2)Distributed Coordination: Zookeeper, Giraffe, JGroups 3)Security & Privacy: InCommon, OpenStack Keystone, LDAP, Sentry 4)Monitoring: Ambari, Ganglia, Nagios, Inca 17 layers ~200 Software Packages Map Use cases to HPC-ABDS Software Model Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies October 10 2014 17)Workflow-Orchestration: Oozie, ODE, ActiveBPEL, Airavata, OODT (Tools), Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e- Science Central, 16)Application and Analytics: Mahout, MLlib, MLbase, DataFu, mlpy, scikit-learn, CompLearn, Caffe, R, Bioconductor, ImageJ, pbdr, Scalapack, PetSc, Azure Machine Learning, Google Prediction API, Google Translation API, Torch, Theano, H2O, Google Fusion Tables 15)High level Programming: Kite, Hive, HCatalog, Tajo, Pig, Phoenix, Shark, MRQL, Impala, Presto, Sawzall, Drill, Google BigQuery (Dremel), Google Cloud DataFlow, Summingbird, Google App Engine, Red Hat OpenShift 14A)Basic Programming model and runtime, SPMD, Streaming, MapReduce: Hadoop, Spark, Twister, Stratosphere, Reef, Hama, Giraph, Pregel, Pegasus 14B)Streaming: Storm, S4, Samza, Google MillWheel, Amazon Kinesis 13)Inter process communication Collectives, point-to-point, publish-subscribe: Harp, MPI, Netty, ZeroMQ, ActiveMQ, RabbitMQ, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT Public Cloud: Amazon SNS, Google Pub Sub, Azure Queues 12)In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis (key value), Hazelcast, Ehcache 12)Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus and ODBC/JDBC 12)Extraction Tools: UIMA, Tika 11C)SQL: Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, SciDB, Apache Derby, Google Cloud SQL, Azure SQL, Amazon RDS 11B)NoSQL: HBase, Accumulo, Cassandra, Solandra, MongoDB, CouchDB, Lucene, Solr, Berkeley DB, Riak, Voldemort. Neo4J, Yarcdata, Jena, Sesame, AllegroGraph, RYA, Espresso Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A)File management: irods, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10)Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop 9)Cluster Resource Management: Mesos, Yarn, Helix, Llama, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Google Omega, Facebook Corona 8)File systems: HDFS, Swift, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7)Interoperability: Whirr, JClouds, OCCI, CDMI, Libcloud,, TOSCA, Libvirt 6)DevOps: Docker, Puppet, Chef, Ansible, Boto, Cobbler, Xcat, Razor, CloudMesh, Heat, Juju, Foreman, Rocks, Cisco Intelligent Automation for Cloud 5)IaaS Management from HPC to hypervisors: Xen, KVM, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, VMware ESXi, vsphere, OpenStack, Geoffrey OpenNebula, Fox: The Eucalyptus, State of Nimbus, Big Data CloudStack, VMware vcloud, Amazon, Azure, Google and other public Clouds, Networking: Google Cloud DNS, Amazon Route 53