On Establishing Big Data Breakwaters

Similar documents

Understanding Big Data Analytics Applications in Earth Science Morris Riedel, Rahul Ramachandran/Kuo Kwo-Sen, Peter Baumann Big Data Analytics

European Data Infrastructure - EUDAT Data Services & Tools

High Productivity Data Processing Analytics Methods with Applications

Classification Techniques in Remote Sensing Research using Smart Data Analytics

COMP9321 Web Application Engineering

Big Data and Analytics: Challenges and Opportunities

How To Handle Big Data With A Data Scientist

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Scalable Developments for Big Data Analytics in Remote Sensing

NoSQL Data Base Basics

Transforming the Telecoms Business using Big Data and Analytics

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

International Journal of Innovative Research in Computer and Communication Engineering

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data: Opportunities & Challenges, Myths & Truths 資料來源 : 台大廖世偉教授課程資料

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Big Data and Data Science: Behind the Buzz Words

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Challenges for Data Driven Systems

Why is Internal Audit so Hard?

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

IBM Solution Framework for Lifecycle Management of Research Data IBM Corporation

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Data and Analytics (Fall 2015)

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Data Requirements from NERSC Requirements Reviews

Big Data Analytics. Tools and Techniques

Big Data a threat or a chance?

HPC technology and future architecture

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Advanced Big Data Analytics with R and Hadoop

NextGen Infrastructure for Big DATA Analytics.

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

OpenAIRE Research Data Management Briefing paper

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Integrating Big Data into the Computing Curricula

Oracle Big Data SQL Technical Update

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Data-Intensive Science and Scientific Data Infrastructure

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

SURVEY REPORT DATA SCIENCE SOCIETY 2014

The InterNational Committee for Information Technology Standards INCITS Big Data

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Reference Architecture, Requirements, Gaps, Roles

This Symposium brought to you by

Big Data Analytics for Space Exploration, Entrepreneurship and Policy Opportunities. Tiffani Crawford, PhD

BIG DATA TOOLS. Top 10 open source technologies for Big Data

So What s the Big Deal?

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

BIG DATA CHALLENGES AND PERSPECTIVES

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data Challenges in Bioinformatics

BIG DATA TRENDS AND TECHNOLOGIES

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Hadoop. Sunday, November 25, 12

Big Data Analytics in Space Exploration and Entrepreneurship

BIG DATA-AS-A-SERVICE

A Service for Data-Intensive Computations on Virtual Clusters

Industry 4.0 and Big Data

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

ANALYTICS CENTER LEARNING PROGRAM

Introduction to Data Mining

Big Data With Hadoop

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Cloud and Big Data Standardisation

Application Development. A Paradigm Shift

NITRD and Big Data. George O. Strawn NITRD

Where is... How do I get to...

How To Scale Out Of A Nosql Database

Luncheon Webinar Series May 13, 2013

The Next Wave of Data Management. Is Big Data The New Normal?

Statistics for BIG data

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

The? Data: Introduction and Future

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Conquering the Astronomical Data Flood through Machine

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Transcription:

On Establishing Big Data Breakwaters with Analytics Dr. - Ing. Morris Riedel Head of Research Group High Productivity Data Processing, Juelich Supercomputing Centre, Germany Adjunct Associated Professor, University of Iceland, Iceland Co-chair Research Data Alliance, Big Data Analytics Interest Group

Big Data Waves Surfboards Breakwaters How to engage in the rising tide of big data? Unsolved Questions: Scale Heterogeneity Stewardship Curation Long-Term Access and Storage [1] HLEG Report [2] KE Report Research Challenges: Collection, Trust, Usability Interoperability, Diversity Security, Smart Analytics, Education and training Data publication and access Commercial exploitation New social paradigms Preservation and sustainability Concrete Next Steps

Looking into Big Data Waves Volume Variety Structured Data Automobiles Simulations Mobile Locationbased Data Machine Data IMHO, it s great! Velocity Click Stream Image-based analysis large instruments Text Data Context Social Network RFID Smart Meter Customer Data Point of Sale

Understanding Big Data Waves Big Data Streams with high velocity Web Actions n optional filters Data stream with data hidden in logs/events Data Sources Measurement Devices Data stream with measured data Computational Simulations Data stream with data of HPC/HTC simulation Data Sinks

Crowdsourcing increases # of Big Data Streams Usual Citizens / Citizen Scientist Data streams with data (low trust) Individuals with domain as Hobby Exabytes Data streams with data (moderate trust) Scientific/Engineering Domain Experts Data streams with data (high trust) Petabytes Terabytes

Alpha Magnetic Spectrometer (AMS) @ ISS What are building blocks of the Universe? Search for 20 TB Cosmic ~400 Antimatter JUROPA++ TB 1-2 Pflop/s + Booster JSC is main facility for AMS computing in Germany Selected Juelich Supercomputing Centre (JSC) Research Examples Simulation Labs (e.g. Climate SimLab) one example out of many JSC SimLabs ~10 years 500 HDF Files ~50 TB 50 TB ~20 TB Towards Exascale Applications with combined characteristics of simulations and analytics

The next generation radio telescope for science pushing the limits of the observable universe out by billions of galaxies The square kilometre array 1 PB in 20 seconds LOFAR test site Jülich

Large-scale Computational Massively Parallel Applications simulate Reality Better Prediction Accuracy means Bigger Data TOP 500 HPC Systems 11/2013 We are unable to store the output data of all computational simulations/users [3] F. Berman

Making use of Big Data is important but How? Netflix Prize Challenge ~2009 Challenge: Improve movie recommender Prize: 1 M $ for team with 10% improvements Open historical data to train a new system Traditional Machine Learning Algorithms used Prize received with Artificial Neural Networks It becomes a strategic differentiator in the market H1N1 Virus Made Headlines ~2009 Nature paper from Google employees Explains how Google is able to predict flus Not only national scale, but down to regions Possible via logged big data - search queries [4] Jeremy Ginsburg et al., Detecting influenza epidemics using search engine query data, Nature 457 (2009) ~1998-today

Smart Analytics are Needed to Take Advantage of Big Data The challenge is to understand which analytics make sense Terms & Motivation Understanding climate change, finding alternative energy sources, and preserving the health of an ageing population are all cross-disciplinary problems that require high-performance data storage, smart analytics, transmission and mining to solve. [1] HLEG Report In the data-intensive scientific world, new skills are needed for creating, handling, manipulating, analysing, and making available large amounts of data for re-use by others. [2] KE Report Integration of data analytics with exascale simulations represents a new kind of workflow that will impact both data-intensive science and exascale computing. [5] DOE ASCAC Report

Big Data Analytics Interest Group Can we establish a neutral group gathering clear evidence which analytics make sense in Research? Develops community based recommendations on feasible data analytics approaches to address community needs of utilizing large quantities of data Analyzes different scientific research domain applications and their potential use of various big data analytics techniques A systematic classification of feasible combinations of analysis algorithms, analytical tools, data and resource characteristics and scientific queries will be covered in these recommendations

Big Data Analytics Interest Group Agricultural Data Interoperability IG Big Data Analytics IG Brokering IG Certification of Digital Repositories IG Community Capability Model WG Data Citation WG Data Foundation and Terminology WG Data in Context IG Data Type Registries WG Defining Urban Data Exchange for Science IG Digital Practices in History and Ethnography IG Engagement Group IG Legal Interoperability IG Long tail of research data IG Marine Data Harmonization IG Metadata IG Metadata Standards Directory WG PID Information Types WG Practical Policy WG Preservation e-infrastructure IG Publishing Data IG Standardization of Data Categories and Codes IG Structural Biology IG Toxicogenomics Interoperability IG UPC Code for Data IG Wheat Data Interoperability WG [6] Research Data Alliance Big Data Analytics IG Web page

Big Data Analytics Interest Group New Technology Impact As Challenge for Users NoSQL Databases Example Selected Features Simplicity of design and deployment Horizontal scaling Less constrained consistency models Finer control over availability Simple retrieval and appending... Types Key-Value-based (e.g. Cassandra) Column-based (e.g. Apache Hbase) Document-based (e.g. MongoDB) Graph-based (e.g. Neo4J) String-based Key-Value Stores

Big Data Analytics Interest Group New Technology Impact As Challenge for Users Map-Reduce Paradigm vs. Traditional HPC What is the right technique out of many? [7] Computer Vision using Map-Reduce

Big Data Analytics Interest Group Selected Data Analytics Techniques Classification? Clustering Regression Groups of data exist New data classified to existing groups No groups of data exist Create groups from data close to each other Identify a line with a certain slope describing the data

Big Data Analytics Interest Group Big Data Analytics Techniques require their Parallelization [9] Geoffrey Fox et al. [10] Gesmundo et al. [11] Zhanquan Sun et al. Parallel K-Means Clustering Algorithms Parallel Perceptron Using Map-Reduce Parallel Support Vector Machines

Methods & Tools Sampling vs. Big Data Parallelization! Applied Statistics Data Mining Machine Scientific Computing Learning Algorithms Computational Scientist Software Engineer Engineer Data Miner Statistician Data Scientist new DBs Training Data Scientists Big Data Analytics Reference Material for Data Scientists Insights Statistical Data Mining Course HPC B(ig Data) Course Data Scientists with skills of various fields

Concrete Use Case Underutilized Unique Big Data Big Data not always means Big Volume, but also Context (uniqueness) In-flight Measurements ( Context ) MOZAIC 1994 today (~20 years): Ozone, water vepor, CO, NOy, + p, T, Wind IAGOS 2011 today ( better data ): + Cloud droplets + CO 2, CH 4, Aerosole Challenge: Large underutilized data collections exist Approach: Initial general data analytics techniques Domain-specific data analysis by experts Big Data Analytics

Concrete Use Case Automating Quality Control But Bigger Data not always means Better data Challenge: Large data collection exist with many measurements Approach: Initial general data analytics techniques E.g. automated outlier detection for quality control (Previously only manual sampling, but better?) Domain-specific focussed data validation by experts Big Data Analytics With thanks to Robert Huber, Marum, Bremen

ScienceTube Long-term Data Preservation and Curation bears potentials to lower Data Waves and supports big data analytics process [8] ScienceTube Search? References to data? Sharing? Metadata? Trust? Open Data? We need to dive into data Selected Benefits of open data infrastructures for science & engineering: High reliability, so data scientists can count on its availability Open deposit, allowing user-community centres to store data easily Persistent identification, allowing data centres to register a huge amount of markers to track the origins and characteristics of the information Metadata support to allow effective management, use and understanding Avoids re-creation of datasets through easy data lookups and re-use Enables easier identification of duplicates to remove them & save storage

Towards Exascale: Applications with combined characteristics of simulations & analytics In-Situ Analytics correlations visual analytics e.g. map-reduce jobs, R-MPI In-situ correlations & data reduction e.g. clustering, classification In-situ statistical data mining exascale application scientific visualization & beyond steering key-value pair DB analytics part visualization part interactive scalable I/O distributed archive computational simulation part in-memory Exascale computer with access to exascale storage/archives Inspired by a [5] DOE ASCAC report

References [1] High Level Expert Group on Scientific Data, Riding the Wave,Report to the European Commission, October 2010, Online: http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi-report.pdf [2] Knowledge Exchange Partner, A Surfboard for Riding the Wave, November 2012, Online: http://www.knowledge-exchange.info/surfboard [3] Fran Berman, Maximising the Potential of Research Data, January 2013, Online: http://rdi2.rutgers.edu/event/rdi2-distinguished-seminar-maximizing-innovation-potential-research-data [4] Jeremy Ginsburg et al., Detecting influenza epidemics using search engine query data, Nature 457 (2009), Online: http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html [5] DOE ASCAC Data Subcommittee Report, Synergistic Challenges in Data-Intensive Science and Exascale Computing, 2013, Online: http://www.sci.utah.edu/publications/chen13/ascac_data_intensive_computing_report_final.pdf [6] Research Data Alliance (RDA), Big Data Analytics Web Page, Online: https://rd-alliance.org/internal-groups/big-data-analytics-ig.html [7] Brandyn White et al., Web-Scale Computer Vision using MapReduce for Multimedia Data Mining, Online: http://dl.acm.org/citation.cfm?doid=1814245.1814254 [8] M. Riedel and P. Wittenburg et al. A Data Infrastructure Reference Model with Applications: Towards Realization of a ScienceTube Vision with a Data Replication Service, 2013, Online: http://www.jisajournal.com/content/4/1/1 [9] G. Fox, MPI and Map-Reduce, Talk at CCGSC 2010 Flat Rock, NC, 2010, Online: http://web.eecs.utk.edu/~dongarra/ccgsc2010/slides/talk12-fox.pdf [10] Gesmundo et al., Hadoop Perceptron, 2012, Online: http://cui.unige.ch/~gesmundo/papers/gesmundo-eacl2012.pdf [11] Zhanquan Sun et al., Study on Parallel SVM Based on MapReduce, Online: http://grids.ucs.indiana.edu/ptliupages/publications/study%20on%20parallel%20svm%20based%20on%20mapreduce.pdf

Thanks Talk available at: www.morrisriedel.de/talks Contact: m.riedel@fz-juelich.de