Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology, Manchester Metropolitan University http://www2.docm.mmu.ac.uk/staff/l.han/ June, 2014
Outline Data tsunami What is big data? Value of big data Challenges of big data Technologies for big data Data exploration for future roadmap @Funds.MMU
Data tsunami Increased capability of generating and capturing data (e.g. Petascale simulations, experimental devices, the Internet, sensors, etc.) 300m photos, 2.5m contents shared per day cabig: 4.7+millions for cancers
Data tsunami Gene expression data in GEO and ArrayExpress: over 1 millions Climate data from NASA: 32 Petabytes (2 50 ) SKA(The Square Kilometre Array): The data collected by the SKA in a single day would take nearly two million years to playback on an ipod.
Slide Credit: Intel
Data tsunami Data intensive era -- big data/data rich/datacentric/data-driven era 40 Data Volumes 35 30 20 10 0 1.3 2 1 2 2010 2011 2020
What is big data? Data size representation Binary digit (bit) Byte(B): 8 bits Kilobyte (KB): 210 bytes Megabyte (MB): 220 bytes Gigabyte (GB): 230 bytes Terabyte (TB): 240 bytes; Petabyte (PB): 250 bytes Exabyte (EB): 260 bytes; Zettabyte (ZB): 270 bytes Yottabyte (YB): 280 bytes
What is big data? Big Data... and the Next Wave of InfraStress John R. Mashey Chief Scientist, SGI Timeline Technology Waves: NOT technology for technology s sake IT S WHAT YOU DO WITH IT But if you don t understand the trends IT S WHAT IT WILL DO TO YOU OK! Uh oh! 4/25/98 page 1 1998 - the origins of big data 2001-3 D Data Management: Controlling Data Volume,Velocity, and Variety by Doug Laney 2010- widespread in the Economist 2012- Gartner, IBM, Cisco, Microsoft, etc Data, data everywhere A special report on managing information
What is big data? A relative term ( don t define it in terms of size being larger than a certain number of terabytes or petabytes) Larger, more complex and hard to access, organise and analysis beyond the capability of the existing tools (varying on sectors) The data volume, velocity or variety/complexity (3 V) limits the ability to perform effective analysis using traditional approaches
What is big data? Big data is about pushing limits!
What is big data? Volume (Data at rest) The size and scale of the data By 2015,it will reach 8 Zettabytes (IDC)
What is big data? Velocity (data in motion) Real time capture and analytics/streaming processing and analytics Stock exchange, fraud analysis/customer churn predictions
What is big data? Variety/complexity (data in many forms) Various formats, types and structures Structured data, e.g. data defined by schema, relational databases, or semi-structured (xml) Unstructured data, e.g. free form text, emails, logs, images, audio, video, social media data (e.g. graph)
What is big data? Two more Vs Value: business value to be derived Veracity( data in doubt): the quality and understandability of the data!$#,-"))!"#$%&'()!"#$ %&'&$!*#,")!"+*%&'()!*+&"'()
Value of big data Next frontier for innovation, competition and productivity: Commerce and economy Science discovery in all most every science and engineering discipline for addressing societal challenges ( health, food, energy, environment, etc)
Value of big data Source: wikibon
Value of big data New paradigm Big Data leads science discovery
Source: samsung Challenges of big data Bottleneck in Technology: IT infrastructures
Source: eskills Challenges of big data Bottleneck in technical skills: professionals to handle big data!"#$%&'&$$ ())*+',-"./0$
Technologies for big data What kind of big data technologies in your mind? Cloud computing?...
Technologies for big data Big data processing and analytics Architectures for efficiently processing big data Data analytics for filtering, analysing and generating actionable insights 2+-3!"#"$6'078)9$ 6"&+)#19$ ('8%0):+#19$!"#"$%&'()**+,- $",.$/,"01*+* 4'5 4'5!"#"$6"07) 2+-3
Architectures Traditional approaches, for example, OLTP( online transaction processing) OLAP(online analytical processing): data warehouse Operations OLTP Business strategy Business process OLAP Informations Data analytics Decision making Business datawarehouse
Architectures Issues Relational databases (RDBMS), dealing with structured data only doesn t support complex analytics lacks scalability and performance
Architectures Current solutions Apache Hadoop: an open source for storage and large-scale processing of data-sets (both structured and unstructured data (nosql)); major components: HDFS, MapReduce, HCommon, HYarn Google File System and MapReduce Apache Spark: combine SQL, streaming, and complex analytics and in-memory computing
Architectures Parallel and distributed computing for data processing!"#$%"&'$()**!"#"$$%$&&
Architectures Parallelisation: a sought after solution for speeding up an application, particularly for data intensive applications Three considerations: How to distribute workloads or decompose an algorithm into parts How to map the tasks onto various computing nodes and execute subtasks in parallel How to coordinate and communicate subtasks on those computing nodes.
Architectures Data parallelism: workload are distributed into different computing nodes and the same task can be executed on different subsets of the data simultaneously Task parallelism: tasks are independent and can be executed purely in parallel Pipelining: an iteration of a task consisting of many stages, where each stage in the task is chained and executed in order and the output of one stage is the input of the next one. Pipelining can be implemented with streaming and without using streaming
Architectures Programming models for parallel and distributed computing (e.g. MPI, MapReduce, POSIX Threads, OpenMP, etc) Bridging the gap between the underlying hardware and the supporting layers of software available to applications Independent of programming languages and API
Architectures MapReduce: a programming model for processing large scale datasets Implementations(e.g. Google DFS, Apache Hadoop)
Architectures Map and Reduce functions Map: perform a function on an individual value of a data set and return a new list of values Given a dataset: A={1, 2, 3}, Map function: Square = X*X. After Map process, it returns {2,4,9}
Architectures Reduce: performa a function by combining values in a data set Given a dataset: B={2,4,9}, Reduce function: sum = X1+X2+X3. After Reduce process, it returns 15!!"#! "#$%&! '#(#!!!! $$%&'()& )*(+*(,-&./!0#1*&2!!
MapReduce Architectures <Hello, 4> <big data, 5> <Hello, 2> <big data, 3> <Hello, 6> <big data, 8> Map Reduce The question to count the words called hello and big data from big data
Architectures Comparison of RDBMS-based approaches, spark MapReduce
Data analytics Data analytics: discovery of useful, possibly unexpected, patterns in data, automation of data exploration and analysis Statistics analysis Machine learning/data mining, for example Classification, Clustering, Association rule, Regression, graph mining,...
Data analytics Classification AMD Diabetic Retinopathy
!!"#$%&'(')*+',-$.)&%.'+/'0,+)-123'$."2#'456&12',-$.)&%"2#'1-#+%")76' ' Data analytics! Clustering! Clustering of the fish industry in UK ''''''''''''''''''!"#$%&'8''!".7"2#'"23$.)%9''!!
Data analytics Graph mining... and so on Community detection in Facebook friends Source: http://wisonets.wordpress.com/
Big data technology = Architectures for data processing and management +Data analytics
Technologies for big data Future development: programming abstractions need to be developed to support and facilitate big data processing and analytics Apps Programming abstractions to support big data processing and analytics RDBMS NoSQL DFS...
Data exploration for future roadmap@funds.mmu We focus on: ----both fundamental and applied research in large-scale data processing and analytics --- Intelligent management and optimisation of largescale networked distributed systems ( challenges: reliability, scalability, security, resilience, autonomy and self-adaptation) http://www.scmdt.mmu.ac.uk/research/funds/
Data exploration for future roadmap@funds.mmu Food Health Future Energy People sustainability society Planning Manufacturing
Acknowledgement & Collaboration BBSRC EPSRC-DHPA Sustainability Society Network+ Amazon MMU Optos Fera MRC HGU
Acknowledgement & Collaboration Outside: MRC HGU, University of Edinburgh University of Manchester University of StrathClyde Heriot Watt University Loughborough University University of Glasgow... Optos Fera... University of Melbourne
Acknowledgement & Collaboration Inside: School of Science and Environment Business School School of Engineering Department of Sociology...
Thank You!