BIG DATA POSSIBILITIES AND CHALLENGES



Similar documents
Data Centric Computing Revisited

A Method Using ArcMap to Create a Hydrologically conditioned Digital Elevation Model

The following was presented at DMT 14 (June 1-4, 2014, Newark, DE).

Big Data in the context of Preservation and Value Adding

Big-data Analytics: Challenges and Opportunities

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Now, Next and the Future: IT, Big Data and other Implications for RIM. Presented by Michael S. Smith /

Veracity in Big Data Reliability of Routes

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Global environmental information Examples of EIS Data sets and applications

Big Data Challenges in Bioinformatics

RevoScaleR Speed and Scalability

Impact of water harvesting dam on the Wadi s morphology using digital elevation model Study case: Wadi Al-kanger, Sudan

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Big Data Hope or Hype?

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Advanced Methods for Pedestrian and Bicyclist Sensing

Data deluge (and it s applications) Gianluigi Zanetti. Data deluge. (and its applications) Gianluigi Zanetti

EIVA NaviModel3. Efficient Sonar Data Cleaning. Implementation of the S-CAN Automatic Cleaning Algorithm in EIVAs NaviModel3. Lars Dall, EIVA A/S

Web analytics: Data Collected via the Internet

Cloud Computing Trends

DIGITAL UNIVERSE UNIVERSE

12/7/2015. Data Science Master s programs

Earth Data Science in The Era of Big Data and Compute

BIG DATA AND ANALYTICS

Modern (Computational) Approaches to Big Data Analytics. CSC 576 Computer Science, University of Rochester Instructor: Ji Liu

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

Proposal for the Theme on Big Data. Analytics. Qiang Yang, HKUST Jiannong Cao, PolyU Qi-man Shao, CUHK. May 2015

Statistical Challenges with Big Data in Management Science

Flood Modelling for Cities using Cloud Computing FINAL REPORT. Vassilis Glenis, Vedrana Kutija, Stephen McGough, Simon Woodman, Chris Kilsby

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Using big data in product design. Executive Director Jesper N. Pedersen

Name(s) Class Date. Q1. In 2009, kids ages 11 to 14 spent the most time with this type of media:

Mobile Monetization Scenario Design & Big Data. Arther Wu Senior Director of Monetization and Business Operation

06 - NATIONAL PLUVIAL FLOOD MAPPING FOR ALL IRELAND THE MODELLING APPROACH

Big Data Driven Knowledge Discovery for Autonomic Future Internet

Data Mining: Benefits for business.

1. Understanding Big Data

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP

Why Modern B2B Marketers Need Predictive Marketing

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Massive Labeled Solar Image Data Benchmarks for Automated Feature Recognition

Exploiting the power of Big Data

Big Data and Open Data

NEXSAN NST STORAGE FOR THE VIRTUAL DESKTOP

Use Data Budgets to Manage Large Acoustic Datasets

Collaborations between Official Statistics and Academia in the Era of Big Data

Big Data and the Internet of Things

Ratings, Audiences, & Failed Shows

ETCIC Internships Open to Sophomores:

Big Data. What is Big Data? Over the past years. Big Data. Big Data: Introduction and Applications

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

New, Unique, and Dedicated dataset for the Global Atlas

SEO. People turn to search engines for questions. When they arrive at your website they are expecting immediate answers

On the CyberShARE site you will find this interpolated version (labeled SRTMv4_90m)

White Paper. Are SaaS and Cloud Computing Your Best Bets?

Research Note What is Big Data?

HYDROLOGY OF THE TRANSBOUNDARY DRIN RIVER BASIN

The 7 Wonders of Marketing. 7 essential tips on how to make marketing work for you

Take Control of your future with this residual income, home based business.

You re One in Seven Billion!

MAKING SENSE OF BIG DATA. Making Sense of Big Data 1

Technology Implications of an Instrumented Planet presented at IFIP WG 10.4 Workshop on Challenges and Directions in Dependability

Notable near-global DEMs include

Good morning. It is a pleasure to be with you here today to talk about the value and promise of Big Data.

Havnepromenade 9, DK-9000 Aalborg, Denmark. Denmark. Sohngaardsholmsvej 57, DK-9000 Aalborg, Denmark

We are Big Data A Sonian Whitepaper

Internet Safety Guide for Parents

Evaluation of surface runoff conditions. scanner in an intensive apple orchard

class 1 welcome to CS265! BIG DATA SYSTEMS prof. Stratos Idreos

Using the HP Vertica Analytics Platform to Manage Massive Volumes of Smart Meter Data

How To Hydrologically Condition A Digital Dam

Geospatial Information for disaster risk reduction and natural resources management. Rolando Ocampo Alcántar

COMP9321 Web Application Engineering

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

Big Data Systems CS 5965/6965 FALL 2014

Unprecedented Performance and Scalability Demonstrated For Meter Data Management:

Parallelism and Cloud Computing

Transcription:

BIG DATA POSSIBILITIES AND CHALLENGES PROFESSOR AND CENTER DIRECTOR

WHY BIG DATA? In God we trust - all others must bring data W. Edwards Deming (US engineer and statistician, 1900-1993)

WHAT IS BIG DATA? Wikipedia (en.wikipedia.org/wiki/big_data) All-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications Big Data characteristics Volume (often very large) Velocity (often arrives very fast) Variety (often varied/complex format/type/meaning) Veracity (often uncertain or imprecise)

BIG DATA AVAILABILITY Pervasive use of computers and sensors Ability to acquire/store/process data Big Data collected everywhere Society increasingly data driven Today as much data created in two days as we did until 2003!

BIG DATA EXAMPLE: THE INTERNET What happens in an internet minute?

BIG DATA EXAMPLE: TERRAIN DATA Previously 30-100 meter data E.g Shuttle Radar Topography Mission (SRTM) near global 90-meter dataset Now accurate meter or sub-meter data (e.g. LiDAR) Europe: Denmark, Sweden, Netherlands, USA: NC, OH, PA, DE, IA, LA, Denmark Denmark at 30-meter: ~46 million data points (GB) Current 2-meter model: ~12 billion data points (TB) Upcoming ½-meter model: ~ 168 billion data points

BIG DATA IMPORTANCE Nature/Science: Paradigm shift; Science will be about mining data The economist: Managing data deluge difficult; doing so will transform business and public life Value is not in data creation but in data analysis!

BIG DATA ANALYSIS IMPORTANCE New York Times, 11/2/2012: The age of Big Data What is Big Data? A meme and a marketing term, for sure, but also shorthand for advancing trends in technology that open the door to a new approach to understanding the world and making decisions.

BIG DATA ANALYSIS IMPORTANCE Dan Ariely: Big Data is like teenage sex: Everyone talks about it Nobody really knows how to do it Everyone thinks everyone is doing it So everyone claims they are doing it And like sex, the ones getting the most are smart enough not to talk about it

BIG DATA INCREASING IMPORTANCE Increasing government awareness of importance of Big Data analysis Big Data as a driver for growth Governments are increasingly supporting use of data through free data programs

POPULAR BIG DATA ANALYSIS EXAMPLES Google: Power of statistical methods on Big Data from the web Google flue-trends - Statistically certain search terms are good indicators of flu activity Google translate - Not : Linguistic analysis to extract the meaning from syntax and vocabulary - Instead : Statistically most probable translation based on similar translations on web

POPULAR BIG DATA ANALYSIS EXAMPLES Netflix: The power of recommendation systems Analysis of subscriber preferences created hit series House of Cards - Old (1990) British TV series still popular - Films featuring Kevin Spacey had always done well - Movies directed by David Fincher ( the social network ) had a healthy share

BIG DATA ANALYSIS CHALLENGES What questions should be asked What questions can be answered How can questions be answered How is Big Data processed efficiently How can different data be combined How is uncertainly handled What about legal issues What about privacy issues Researcher-industry-society collaboration important!

AFTERNOON CASES Many interesting projects/collaborations, including on Releasing and exploiting government, social media and newspaper data and how they are accesses Utilizing health care data to help mothers, newborn, school kids and hip patients alike including in Africa Improving indoor service logistics, recycling systems and personal products offerings - as well as national and global markets Collecting data to model, analyze and improve air quality, traffic behavior, food perception - as well as animal farming Many good Big Data Big Impact examples involving researchers, industry and government

MADALGO CASES MADALGO cases involve efficient processing of big terrain data Cleaning ocean floor scanning data Flood risk screening both strong research/publications and new/improved products Important for success MADALGO algorithms research Domain and market knowledge of industry Startup SCALGO as development glue

CENTER FOR MASSIVE DATA ALGORITHMICS Established 2007 funded by Danish National Research Foundation 5 year renewal in 2012 (10 year budget > $25 million) - International evaluation: MADALGO is the world-leading center in the area of massive dataset algorithmics High level objectives Advance algorithmic knowledge in massive data algorithms area Train researchers in world-leading international environment Be catalyst for multidisciplinary collaboration

CENTER FOR MASSIVE DATA ALGORITHMICS Established 2007 funded by Danish National Research Foundation 5 year renewal in 2012 (10 year budget > $25 million) - International evaluation: MADALGO is the world-leading center in the area of massive dataset algorithmics Building on: Algorithms research focus areas: - I/O-efficient, cache-oblivious and streaming - Algorithm engineering Strong international team/environment Multidisciplinary and industry collaboration

I/O-EFFICIENT ALGORITHMS Problems involving Big Data on disk Disk access is 10 6 times slower than main memory access Large access time amortized by transferring large blocks of data Important to store/access data to take advantage of blocks I/O-efficient The algorithms: difference in speed between modern CPU and disk technologies is Move as analogous few disk blocks to the as difference possible to in solve speed problem in sharpening a pencil using a sharpener on one s desk or by taking an airplane to the other side of the world and using a sharpener on someone else s desk. (D. Comer)

I/O-EFFICIENT ALGORITHMS MATTER Example: Visiting data in order Array size N = 10 elements Disk block size B = 2 elements Main memory size M = 4 elements Algorithm 1: N=10 disk accesses Algorithm 2: N/B=5 disk assesses 1 5 2 6 3 8 9 4 7 10 1 2 10 9 5 6 3 4 8 7 Difference between N and N/B huge N = 256 x10 6, B = 8000, 1 ms disk access time N accesses take 256 x10 3 sec = 4266 min = 71 hours N/B assesses take 256/8 sec = 32 seconds

ALGORITHM ENGINEERING & COLLABORATION Much of centers collaboration driven by algorithm engineering Design/implementation of practical algorithms & experimentation - Often provide valuable input to theoretical research work - Sometime leads to practical breakthroughs MADALGO, COWI and SCALGO flood risk collaboration Started in 2006 as part of Strategic Research Council project Builds on MADALGO I/O-efficient algorithms research Unique big terrain data solutions and establishment of SCALGO Collaboration continues, including in Innovation Fond project Unique flood risk products

FLOOD RISK ANALYSIS IMPORTANCE Important to screen extreme rain or sea-level rise flood risk 50% of Danes worry about their homes being flooded (Userneeds) 90% of Danes say high flood risk affect decision to buy house Cost of 2011 Copenhagen flood over 6 billion kroner (Swiss Re) Potential to do so using detailed national elevation model Elevation for roughly every 2x2 meter of soon ½x½ meter hundreds or even thousands of points in family home lot!

DETAILED (BIG) TERRAIN DATA ESSENTIAL Mandø 2 meter sea-level rise 90 meter terrain model 2 meter terrain model

DETAILED (BIG) TERRAIN DATA ESSENTIAL Drainage network (flow accumulation) 90 meter terrain model 2 meter terrain model

SURFACE FLOW MODELING Flow accumulation on grid terrain model: Initially one unit of water in each grid cell Water (initial and received) distributed from each cell to lowest lower neighbor cell Flow accumulation of cell is total flow through it Note Flow accumulation of cell = size of upstream area Drainage network = cells with high flow accumulation Flow stops/disappears in depressions -> model often filled

FLOW ACCUMULATION PERFORMANCE Natural algorithm access disk for each grid cell Push flow down the terrain by visiting cells in height order Problem since cells of same height scattered over terrain Performance of commercial systems often not satifactory Cannot handle Denmark at 2-meter resolution We developed I/O-optimal algorithms Now handle Denmark 2-meter model in a day on limited 4GB desktop!

FLOW ACCUMULATION SUCCESS STORY Shuttle Radar Topography Mission (SRTM) Near global dataset 3-arc seconds (90-meter at equator) raster ~60 billion cells stored in roughly 14.000 files Large USGS Hydrosheds project produced hydrological conditioned 90-meter data But upscaled to 500-meter to compute flow accumulation Using I/O-efficient algorithms: One day on standard 4GB workstation!

FLASH FLOOD MAPPING Models how surface water gathers in depressions as it rains Water from watershed of depression gathers in the depression Depressions fill, leading to (dramatic) increase in neighbor depression watershed size Watershed Watershed area area Volume Volume Flash Flood Mapping: Amount of rain before any given raster cell is below water

FLASH FLOOD MAPPING EXAMPLE After 10mm rain After 50mm rain After 100mm rain After 150mm rain

FLASH FLOOD MAPPING SUCCESS STORY Based on collaborative research, COWI markets SCALGO produced Flash Flood Mapping product in Denmark under name Skybrudskort Produced for entire country Sold to over half of local governments Jones Edmunds compared Flash Flood Mapping to result of advanced dynamic model (ICPR) for Marion County, Florida Results very close Significantly more detailed Cost under 5%

AFTERNOON: ONLINE DEMONSTRATION

CONCLUSIONS Hope to have convinced you that Big Data has huge potential - in research, industry and society Exploiting Big Data challenging - research-industry-society collaboration one way to success Thanks! large@cs.au.dk www.madalgo.au.dk