BIG DATA POSSIBILITIES AND CHALLENGES PROFESSOR AND CENTER DIRECTOR
WHY BIG DATA? In God we trust - all others must bring data W. Edwards Deming (US engineer and statistician, 1900-1993)
WHAT IS BIG DATA? Wikipedia (en.wikipedia.org/wiki/big_data) All-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications Big Data characteristics Volume (often very large) Velocity (often arrives very fast) Variety (often varied/complex format/type/meaning) Veracity (often uncertain or imprecise)
BIG DATA AVAILABILITY Pervasive use of computers and sensors Ability to acquire/store/process data Big Data collected everywhere Society increasingly data driven Today as much data created in two days as we did until 2003!
BIG DATA EXAMPLE: THE INTERNET What happens in an internet minute?
BIG DATA EXAMPLE: TERRAIN DATA Previously 30-100 meter data E.g Shuttle Radar Topography Mission (SRTM) near global 90-meter dataset Now accurate meter or sub-meter data (e.g. LiDAR) Europe: Denmark, Sweden, Netherlands, USA: NC, OH, PA, DE, IA, LA, Denmark Denmark at 30-meter: ~46 million data points (GB) Current 2-meter model: ~12 billion data points (TB) Upcoming ½-meter model: ~ 168 billion data points
BIG DATA IMPORTANCE Nature/Science: Paradigm shift; Science will be about mining data The economist: Managing data deluge difficult; doing so will transform business and public life Value is not in data creation but in data analysis!
BIG DATA ANALYSIS IMPORTANCE New York Times, 11/2/2012: The age of Big Data What is Big Data? A meme and a marketing term, for sure, but also shorthand for advancing trends in technology that open the door to a new approach to understanding the world and making decisions.
BIG DATA ANALYSIS IMPORTANCE Dan Ariely: Big Data is like teenage sex: Everyone talks about it Nobody really knows how to do it Everyone thinks everyone is doing it So everyone claims they are doing it And like sex, the ones getting the most are smart enough not to talk about it
BIG DATA INCREASING IMPORTANCE Increasing government awareness of importance of Big Data analysis Big Data as a driver for growth Governments are increasingly supporting use of data through free data programs
POPULAR BIG DATA ANALYSIS EXAMPLES Google: Power of statistical methods on Big Data from the web Google flue-trends - Statistically certain search terms are good indicators of flu activity Google translate - Not : Linguistic analysis to extract the meaning from syntax and vocabulary - Instead : Statistically most probable translation based on similar translations on web
POPULAR BIG DATA ANALYSIS EXAMPLES Netflix: The power of recommendation systems Analysis of subscriber preferences created hit series House of Cards - Old (1990) British TV series still popular - Films featuring Kevin Spacey had always done well - Movies directed by David Fincher ( the social network ) had a healthy share
BIG DATA ANALYSIS CHALLENGES What questions should be asked What questions can be answered How can questions be answered How is Big Data processed efficiently How can different data be combined How is uncertainly handled What about legal issues What about privacy issues Researcher-industry-society collaboration important!
AFTERNOON CASES Many interesting projects/collaborations, including on Releasing and exploiting government, social media and newspaper data and how they are accesses Utilizing health care data to help mothers, newborn, school kids and hip patients alike including in Africa Improving indoor service logistics, recycling systems and personal products offerings - as well as national and global markets Collecting data to model, analyze and improve air quality, traffic behavior, food perception - as well as animal farming Many good Big Data Big Impact examples involving researchers, industry and government
MADALGO CASES MADALGO cases involve efficient processing of big terrain data Cleaning ocean floor scanning data Flood risk screening both strong research/publications and new/improved products Important for success MADALGO algorithms research Domain and market knowledge of industry Startup SCALGO as development glue
CENTER FOR MASSIVE DATA ALGORITHMICS Established 2007 funded by Danish National Research Foundation 5 year renewal in 2012 (10 year budget > $25 million) - International evaluation: MADALGO is the world-leading center in the area of massive dataset algorithmics High level objectives Advance algorithmic knowledge in massive data algorithms area Train researchers in world-leading international environment Be catalyst for multidisciplinary collaboration
CENTER FOR MASSIVE DATA ALGORITHMICS Established 2007 funded by Danish National Research Foundation 5 year renewal in 2012 (10 year budget > $25 million) - International evaluation: MADALGO is the world-leading center in the area of massive dataset algorithmics Building on: Algorithms research focus areas: - I/O-efficient, cache-oblivious and streaming - Algorithm engineering Strong international team/environment Multidisciplinary and industry collaboration
I/O-EFFICIENT ALGORITHMS Problems involving Big Data on disk Disk access is 10 6 times slower than main memory access Large access time amortized by transferring large blocks of data Important to store/access data to take advantage of blocks I/O-efficient The algorithms: difference in speed between modern CPU and disk technologies is Move as analogous few disk blocks to the as difference possible to in solve speed problem in sharpening a pencil using a sharpener on one s desk or by taking an airplane to the other side of the world and using a sharpener on someone else s desk. (D. Comer)
I/O-EFFICIENT ALGORITHMS MATTER Example: Visiting data in order Array size N = 10 elements Disk block size B = 2 elements Main memory size M = 4 elements Algorithm 1: N=10 disk accesses Algorithm 2: N/B=5 disk assesses 1 5 2 6 3 8 9 4 7 10 1 2 10 9 5 6 3 4 8 7 Difference between N and N/B huge N = 256 x10 6, B = 8000, 1 ms disk access time N accesses take 256 x10 3 sec = 4266 min = 71 hours N/B assesses take 256/8 sec = 32 seconds
ALGORITHM ENGINEERING & COLLABORATION Much of centers collaboration driven by algorithm engineering Design/implementation of practical algorithms & experimentation - Often provide valuable input to theoretical research work - Sometime leads to practical breakthroughs MADALGO, COWI and SCALGO flood risk collaboration Started in 2006 as part of Strategic Research Council project Builds on MADALGO I/O-efficient algorithms research Unique big terrain data solutions and establishment of SCALGO Collaboration continues, including in Innovation Fond project Unique flood risk products
FLOOD RISK ANALYSIS IMPORTANCE Important to screen extreme rain or sea-level rise flood risk 50% of Danes worry about their homes being flooded (Userneeds) 90% of Danes say high flood risk affect decision to buy house Cost of 2011 Copenhagen flood over 6 billion kroner (Swiss Re) Potential to do so using detailed national elevation model Elevation for roughly every 2x2 meter of soon ½x½ meter hundreds or even thousands of points in family home lot!
DETAILED (BIG) TERRAIN DATA ESSENTIAL Mandø 2 meter sea-level rise 90 meter terrain model 2 meter terrain model
DETAILED (BIG) TERRAIN DATA ESSENTIAL Drainage network (flow accumulation) 90 meter terrain model 2 meter terrain model
SURFACE FLOW MODELING Flow accumulation on grid terrain model: Initially one unit of water in each grid cell Water (initial and received) distributed from each cell to lowest lower neighbor cell Flow accumulation of cell is total flow through it Note Flow accumulation of cell = size of upstream area Drainage network = cells with high flow accumulation Flow stops/disappears in depressions -> model often filled
FLOW ACCUMULATION PERFORMANCE Natural algorithm access disk for each grid cell Push flow down the terrain by visiting cells in height order Problem since cells of same height scattered over terrain Performance of commercial systems often not satifactory Cannot handle Denmark at 2-meter resolution We developed I/O-optimal algorithms Now handle Denmark 2-meter model in a day on limited 4GB desktop!
FLOW ACCUMULATION SUCCESS STORY Shuttle Radar Topography Mission (SRTM) Near global dataset 3-arc seconds (90-meter at equator) raster ~60 billion cells stored in roughly 14.000 files Large USGS Hydrosheds project produced hydrological conditioned 90-meter data But upscaled to 500-meter to compute flow accumulation Using I/O-efficient algorithms: One day on standard 4GB workstation!
FLASH FLOOD MAPPING Models how surface water gathers in depressions as it rains Water from watershed of depression gathers in the depression Depressions fill, leading to (dramatic) increase in neighbor depression watershed size Watershed Watershed area area Volume Volume Flash Flood Mapping: Amount of rain before any given raster cell is below water
FLASH FLOOD MAPPING EXAMPLE After 10mm rain After 50mm rain After 100mm rain After 150mm rain
FLASH FLOOD MAPPING SUCCESS STORY Based on collaborative research, COWI markets SCALGO produced Flash Flood Mapping product in Denmark under name Skybrudskort Produced for entire country Sold to over half of local governments Jones Edmunds compared Flash Flood Mapping to result of advanced dynamic model (ICPR) for Marion County, Florida Results very close Significantly more detailed Cost under 5%
AFTERNOON: ONLINE DEMONSTRATION
CONCLUSIONS Hope to have convinced you that Big Data has huge potential - in research, industry and society Exploiting Big Data challenging - research-industry-society collaboration one way to success Thanks! large@cs.au.dk www.madalgo.au.dk