An Open Dynamic Big Data Driven Applica3on System Toolkit

Similar documents
What Is Big Data? Craig C. Douglas University of Wyoming

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Phone Systems Buyer s Guide

BENCHMARKING V ISUALIZATION TOOL

INCREMENTAL, APPROXIMATE DATABASE QUERIES AND UNCERTAINTY FOR EXPLORATORY VISUALIZATION. Danyel Fisher Microso0 Research

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

BIG DATA AND INVESTIGATIVE ANALYTICS

NextGen Infrastructure for Big DATA Analytics.

An to Big Data, Apache Hadoop, and Cloudera

Introduction to Predictive Analytics. Dr. Ronen Meiri

Definition of Computers. INTRODUCTION to COMPUTERS. Historical Development ENIAC

SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS

Using RDBMS, NoSQL or Hadoop?

BIG DATA: ARE YOU READY? Andy Kyiet Demand Flow Intelligence May, 2013

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Kaseya Fundamentals Workshop DAY THREE. Developed by Kaseya University. Powered by IT Scholars

Real Time Big Data Processing

Database Fundamentals

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

OS/Run'me and Execu'on Time Produc'vity

Doing Multidisciplinary Research in Data Science

The Big Deal about Big Data. Mike Skinner, CPA CISA CITP HORNE LLP

More Than A Buzzword: Big Data in the Environmental Arena

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Foundations of Business Intelligence: Databases and Information Management

UAB Cyber Security Ini1a1ve

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Majed Al-Ghandour, PhD, PE, CPM Division of Planning and Programming NCDOT 2016 NCAMPO Conference- Greensboro, NC May 12, 2016

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

The Right BI Tool for the Job in a non- SAP Applica9on Environment

Big Data Processing Experience in the ATLAS Experiment

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Gi-Joon Nam, IBM Research - Austin Sani R. Nassif, Radyalis. Opportunities in Power Distribution Network System Optimization (from EDA Perspective)

Bringing Big Data Modelling into the Hands of Domain Experts

DNS Big Data

How To Understand The Big Data Paradigm

How To Use Splunk For Android (Windows) With A Mobile App On A Microsoft Tablet (Windows 8) For Free (Windows 7) For A Limited Time (Windows 10) For $99.99) For Two Years (Windows 9

Clouds and Other Computa1onal Frameworks. Evere7 Toews, Cybera Inc. Todd King, UCLA

Computer Logic (2.2.3)

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson

Applications for Business Intelligence, Predictive Analytics and Big Data

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Big Data and Big Data Modeling

Introduc8on to Apache Spark

Cloud Compu)ng. Yeow Wei CHOONG Anne LAURENT

Big Data Challenges in Bioinformatics

Stream Deployments in the Real World: Enhance Opera?onal Intelligence Across Applica?on Delivery, IT Ops, Security, and More

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

HP Vertica at MIT Sloan Sports Analytics Conference March 1, 2013 Will Cairns, Senior Data Scientist, HP Vertica

Bank of America Security by Design. Derrick Barksdale Jason Gillam

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

DDC Sequencing and Redundancy

Big Data Research at DKRZ

Data Center Evolu.on and the Cloud. Paul A. Strassmann George Mason University November 5, 2008, 7:20 to 10:00 PM

Architectures for massive data management

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

BIG DATA CHALLENGES AND PERSPECTIVES

Integrating a Big Data Platform into Government:

Application Development. A Paradigm Shift

Mass Storage Structure

Data Mining in the Swamp

Protec'ng Informa'on Assets - Week 8 - Business Continuity and Disaster Recovery Planning. MIS 5206 Protec/ng Informa/on Assets Greg Senko

The State of Real-Time Big Data Analytics & the Internet of Things (IoT) January 2015 Survey Report

UNIFIED, END- TO- END EDISCOVERY

Cloud Computing Where ISR Data Will Go for Exploitation

Reduction of Data at Namenode in HDFS using harballing Technique

Big Data. The Big Picture. Our flexible and efficient Big Data solu9ons open the door to new opportuni9es and new business areas

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

lesson 1 An Overview of the Computer System

Big Data Use Cases. At Salesforce.com. Narayan Bharadwaj Director, Product Management

Big Data Are You Ready? Thomas Kyte

Transcription:

An Open Dynamic Big Data Driven Applica3on System Toolkit Craig C. Douglas University of Wyoming and KAUST This research is supported in part by the Na3onal Science Founda3on and King Abdullah University of Science & Technology (KAUST).

Computa3onal Sciences Morphs into Data Intensive Scien3fic Discovery (DISD) Big Data has become a superset of computa3onal sciences with applica3ons in all walks of life with the overriding ques3on, What if you had all of the data? Technology limits to some extent how big the Big Data can be, but over 3me the size has increased by 1,000 fold every few years since the 1950 s. 2

What Is Big Data?... It Depends Unit Approximately 10 n Related to Kilobyte (KB) 1,000 bytes 3 Circa 1952 computer memory Megabyte (MB) 1,000 KB 6 Circa 1976 supercomputer memory Gigabyte (GB) 1,000 MB 9 Mid 1980 s disk controller memory a^ached to a mainframe (with 128 MB memory) 10 Gigabytes (GB) 10,000 MB 10 2013 typical memory s3ck (16 GB) Terabyte (TB) 1,000 GB 12 2012 largest SSD in a laptop Petabyte (PB) 1,000 TB 15 250,000 DVD s or the en3re digital library of all known books wri^en in all known languages Exabyte (EB) 1,000 PB 18 175 EB copied to disk in 2010 (est.) Ze^abyte (ZB) 1,000 EB 21 2 ZB copied to disk in 2011 (est.) 10 Ze^abytes (ZB) 10,000 EB 22 Single NSA Big Dataset in 2013 (est.) 3

Dynamic Big Data Driven Applica3on Systems (DBDDAS) DBDDAS research is mo3vated from the following difficult key points: Obtaining and communica3ng out of sequence remote data of unknown quality to remote computers and using it to steer simula3ons. The obvious societal gains from more accurate, long term forecasts from nonlinear, 3me dependent, complex scien3fic and engineering models. Upda3ng the en3re DDDAS paradigm using recent Big Data research concepts. 4

DDDAS and BD All DDDAS have been dealing with the Big Data issue as long as there have been DDDAS. Each project has individually invented Big Data concepts with no real field wide consensus on how to handle Big Data. Combining the two paradigms not only makes sense, but is essen3al to crea3ng new DBDDAS quickly and efficiently. Re- inven'ng the wheel is bad science once the wheel has been iden'fied. 5

DDDAS Paradigm Extends conven3onal applica3ons by adding the following: Model state save, modifica3on, and restora3on. Ensemble data assimila3on algorithms to modify ensemble member states by comparing the data with synthe3c data of the same kind created from the simula3on state. Mul3scale algorithms for interpola3on within models. Retrieve, filter, and ingest data from intelligent sensors with poten3ally significant processing capabili3es. Visualize computed results on a wide variety of devices including mobile ones such as a tablet or smart phone. 6

Big Data Paradigm BD adds Structured query language (SQL) and Not only SQL (NoSQL) capabili3es for storing and retrieving data from dynamically growing databases or data streams. Flexible methods for handling large quan33es of data in highly parallel compu3ng environments, e.g., MapReduce, a parallel merge- sort algorithm. Fast, parallel read- write capabili3es, e.g., NetCDF or HDF5. Extremely large, robust, distributed file systems. Time dependent data retrieval based on a dynamically chosen 3me windows so that automa3c weigh3ng of data based on how recent it was acquired can be easily adopted. Big Data visualiza3on. 7

DBDDAS Commonali3es Sensors should provide data with known error bounds within 3me constraints. Correct sensors need to be chosen to provide good data from specific loca3ons. Data is not guaranteed to be accurate, complete, nor 3mely, but will arrive anyway and must be sensibly processed, including discarding. Sensors should be dynamically reprogrammable during simula3ons. 8

DBDDAS Commonali3es Rapid error growth in new data when steering is unavailable frequently occurs, causing simula3ons to be restarted. Automa3c new methods to recognize and curtail rapid error growth are essen3al. New data intensive methods are helpful. An applica3on s models, numerical methods, and most significant items that are tracked may need to be changed as a result of the incoming sensor data or a human in the loop. Extremely large quan33es of input and/or output data that is difficult to analyze. Inadequate visualiza3on of output data common. 9

DBDDAS Commonali3es Using Big Data analy3c techniques, Much longer running simula3ons are possible instead of running a large sequence of short running simula3ons and then combining each result into a video of predic3ons. Addi3onally, data can be be^er analyzed, visualized, managed, and accessed faster using Big Data methods than ad hoc data management methods created by nonexperts who excel otherwise in crea3ng applica3on simulators. 10

OpenDBDDAS Framework Much of the framework exists, but is not 3ed together. An open source sensor language standard that stays open. An open source MapReduce suitable for medical records that Must be HIPPA compliant. Must scale well for very large databases. Must have individual/group access capabili3es (ala Linux files) for records. Must not have complexity O(disk access) on a DFS. Should use OpenMP and MPI. Should use (disk) cache aware hashing methods. Will be useful well beyond medical records. 11

OpenDBDDAS Framework Cache aware, fast hash tables (not textbook variety). High performance disk I/O assuming there is already a (distributed) file system. SQL and NoSQL databases. Data warehouse with six point func3onality retrieval, extrac3on, conversion, quality control, store, no3fica3on Intelligent sensing and processing (ISP) capable. 12

OpenDBDDAS Framework DDDAS toolkit that allows Uncertainty quan3fica3on and sta3s3cal analysis. Numerical solvers for standard (nonlinear 3me dependent coupled) PDEs. Op3miza3on solvers. Data assimila3on using mul3scale interpola3on methods. Item iden3fica3on, tracking, and steering. Anomaly tracking and verifica3on. Human no3fica3on methods. Must be usable by non- computer scien'sts. 13

Typical Use of OpenDBDDAS Framework 14

Travel App: FlightCast A DBDDAS for flyers. Aimed primarily at: Frequent flyers, Par3cularly flying commuters. 15

FlightCast Model airplane connec3ons for a given route and day. Project is broken into mul3ple func3onal units: Sensors: poll and collect data (sonware agents) Transport: move data inside the DDDAS. Data store: repository for Big Data. Predictor: makes delay predic3ons. Corrector: analyze accuracy of predic3ons and modify parameters (uses BD to solve an inverse problem cheaply). 16

FlightCast 17

FlightCast 18

Puong It Together 19

DBDDAS Five major components 1. Sensors: collect data for the DDDAS sonware agents. 2. Transport: transfer data between the sensors and the other parts of the DDDAS. 3. Data store: central repository for collected data. 4. Predictor: predict flight delay and cancella3on based on historical data and current condi3ons. 5. Corrector: analyze the accuracy of predic3ons and make correc3ons to the predictor to improve accuracy over 3me. 20

Sensors Travellers Tracks traveller loca3on before, during, aner flight(s) Monitors preferences (want on 3me, want misses, etc.) Maximize frequent flyer miles given condi3ons Airlines, Threat level, and Weather (similar) Low level web requests to avoid graphics, ads, etc. cookies helped. Used tools for construc3ng simple compilers. Weather: wind speed + direc3on, visibility, temperature, type of precipita3on (could have added more) Lots of data from airlines and weather. Less so from TSA. 21

Sensors and Transport Seasonal and Loca3on specific trends This required a training period of at least one year. We used a number of years worth of weather data specific to a set of airports to build a database and correlated flight delays and cancella3ons to it. Transport Java based GRID compu3ng enabled Stored data in a SQL database. 22

Predictor- Corrector Predictor Combined data from the relevant sensors (airline, weather, threat level, seasonal, and a specific traveller), Weighted the data, Produced a probability (from 0 to 1) of either a significant delay or cancella3on. Each factor in the probability was produced using fuzzy logic. The factors used in the predic3ons were stored in the database for use by the corrector component later. 23

Predictor- Corrector Corrector Component reviewed the results of the predictor on a regular basis It is inexpensive computa3onally due to the informa3on saved in the predic3on component. The main purpose of the corrector is to adjust weights used in the predictor component. Predictor/corrector components ran mul3ple 3mes before the actual flight(s) and the corrector ran once more aner the flight(s). 24

FlightCast Most Like Future DBDDAS Building up a database of flight informa3on for just Delta and United airlines over one year generated several disk drives worth of data, even with rela3vely low sampling rates and used all standard BD techniques known. Aner several months of opera3ng FlightCast tracking one par3cular frequent flyer with a regular commute between New York s Laguardia and Lexington s airports, FlightCast had a be^er than 95% correct predic3on rate. The two airlines tracked had nothing similar to FlightCast and could only react to aircran delays and maintenance issues on the day of flights. To this day there is no tool similar to FlightCast available to the general public. 25