1 BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
2 Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing : Challenges and Opportunities. * Suggestions * Conclusion
3 Research Area * Presently there are 5 PhD students and 13 awarded PhDs that are working or have worked in the following research areas together with Prof. Maozhen Li at the Brunel University * Research topics include : * High performance computing (grid and cloud computing) * Big data analysis (MapReduce/Hadoop) * Semantic Web, artificial intelligence, multi- agent systems * Information retrieval, content based image annotation and retrieval * Distributed machine learning techniques * Context aware mobile computing
4 INTRODUCTION * Big Data has always been around, its been relative to its time.
5 Multiple faces of Big Data * For scientists, Big- Data is about getting better output from simulations or handling output from simulations * Data is also coming from sensors (New world of IPV6 is about the Internet of objects) here data can be used by different fields Integrating Data can never be over emphasized
6 CHALLENGES OF BIG DATA * Storage * Clearly not enough hard disks/devices. Distributed storage is still not enough, manufacturers cannot make enough storage devices in time. Speed in writing to devices, bigger data paths/data- bus * Processing * Integrating data using Filters * What Data and How? * Effective Data processing system Design *??Power * Internationalisation / Standardisation * Latency and Bandwidth * Taxonomy and Ontology * How to classify big data No standard way of doing that yet * Security / Privacy * What is to be secured the Data sources? As IT logs are also now a source of big data
7 BIG DATA SQL vs NoSQL * In Big Data SQL which are used for RDBMS provide ACID * Atomicity = all or nothing * Consistency = same data values b4 and afta. * Isolation = hidden events during transactions (a form of security) * Durability = survive subsequent malfunction * NoSQL provide BASE * Basically Available = since it allows parts of the system to fail. * Soft Sate = objects with simultaneous values. * Eventually consistent = achieved over time.
8 * Definition Cloud Computing
9 Cloud Delivery Types
10 Cloud Computing * With Cloud computing here IT expertise is narrowed and professionalism is brought into play. one concentrates on the main issue at hand and builds professionalism and expertise, well grounded in a particular field. * Allows for horizontal provisioning as against the present vertical provision.(elasticity)
11 Cloud Computing and Data Management * Types of Data * Transactional Data Management * ACID (Atomicity, Consistency, Isolation, and Durability) * Risk / Privacy (especially when placed in clouds) * Typically cannot scale on the cloud * Analytical Data Management * Scalability (Shared- Nothing Architecture) * Availability is key over ACID * CAP s theorem??
12 Cloud Computing - Open issues * The present Cloud computing design has some issues yet to be solved: * Basic DBMS has not been tailored for Cloud Computing * Data Acts is a serious issue so it would be ideal to have Data Centres located closer the user than the provider. * Data Replication must be carefully done else it affects data integrity and gives an error prone analysis * Trust in the event of mission critical data. * Some deployment models are still in their formative stage.
13 SUGGESTIONS * Data management * Life- cycle management can start now, once information has been extracted from the data * Better software design * Build software applications around the expected data/information processing and not around CPU processing. * Cloud computing: Data processing in the could should provide the following : * Efficiency * Fault tolerance: does not have to restart a query if 1 of d nodes in the query fails but then comes with a trade- off for performance. * Ability to run in a heterogeneous environment: shared task across cloud compute nodes. * Ability to operate on encrypted data: to build trust data maybe encrypted before uploaded onto the clod and the DBMS application should be able to work on the data without decrypting it. * Ability to interface with business intelligence products or other applications
14 Example of Data Management lifecycle Plan Collection Processing Publish Analyze Archive Discard Definition of what & how From sensors, measurements & studies Integrate, transform Share processed data Discover & inform Long term storage No longer useful
15 The Cloud Opportunities * Cloud provide opportunities for real time data streams analysis because cloud systems tend to provide scalability - scale- free, fault tolerance, cost effectiveness and ease of use. * Current Research areas * Science, acquisition, conditioning & evaluation * Online Analytics of interest here * Framework: * similarities to existing frameworks NoSQL & RDBMS * Infrastructure * How to implement framework(s) (Note: security & privacy cut across all 3 items)
16 Suggested Cloud- based data processing model for Scientific applications Pre- processing Processing Post- processing Clouds Cloud 1 Cloud 2 Cloud 3 Cloud 4 Cloud 5 Cloud 6 * Scalable sets of cloud arranged around the various data handling activities * Clouds may be private, public or commercial based on security considerations * Flexibly add, remove or replace components from pipe- line. E.g cloud 6 provides only visualization of raw output, while cloud 5 may transform output before visualization
17 Online Analytics * In time pass, most data were structured and mostly relational, mining for decision making was fairly easy with SQL- like technologies but Big Data has evolved now to include large volumes of unstructured data speeches, s, tweets, chats, etc. There came the need for a system that can accept,process, store, analyze / re- analyze unstructured data to provide value for decision making.
18 MapReduce Framework in the Cloud * MapReduce: is a programming paradigm based on two separate and distinct tasks performed on big data. * Its allowed for massive scalability across various servers but didn t allow for plug- ability. * These separate tasks are: * Map : where the data is converted into individual elements and * and Reduce : combines the result from the map into smaller set of tuples for further analysis.
19 Hadoop: an implementation of MapReduce Framework * Hadoop is based on the MapReduce framework * Schedules, monitors and re- executes tasks (this provides for fault tolerance). * Hadoop stack now includes components for query & storage in addition to MapReduce
20 An Example Working with Data files Batch organisation Reduce Merge Extract Pre- proessing Processing Computation Indexing, etc Transform Analysis Post- processing * Files are used as main vehicle for transfer of data between phases and different clouds * The processing pipe- line is not time dependent and so each item could be implemented as batch operations. * Example could be HADOOP framework and starcluster (HPC) for processing
21 MOA as a framework for online analytics * A framework for Big Data Stream Mining which is written in Java it provides for: * Storage for repeatable experiments * Set of existing algorithms to use for analysis and * algorithms to easily extend the framework for new streams and analysis. * MOA=>data feed +algorithm + evaluation method = Results * With MOA data stream mining is done in real time, large scale machine learning * MOA allows developers to easily extend all the parts of the framework /architecture. * MOA is also easily used with Hadoop, S4 or storm this provides for a more robust and configurable system
22 An Example On- line Dynamic data streaming Pre- processing Reduce merge extract Processing Computation Indexing Post- processing Transform Analysis * Data is passed from one cloud to another in- line (via network) * Each cloud in the pipe- line process is dependent on previous phase/cloud. * Examples: STORM framework and starcluster clouds.
23 CONCLUSION * Data processing on a cloud based cluster would provide added benefits such as fault tolerant, heterogeneous, ease of use, free and open, efficient, provide performance and tool plug- ability which most DBMS do not provide. * Combining different types of software such as MOA and Hadoop is a possible solution for online analytics of scientific data. A lot of extension work is still being done with the algorithm to provide even change detection and frequent pattern mining among others.
24 References & more information * * 01.ibm.com/software/data/infosphere/ hadoop/mapreduce/
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...
ascent Thought leadership from Atos white paper Data Analytics as a Service: unleashing the power of Cloud and Big Data Your business technologists. Powering progress Big Data and Cloud, two of the trends
An Oracle White Paper June 2013 Oracle: Big Data for the Enterprise Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure
For Big Data Analytics There s No Such Thing as Too Big The Compelling Economics and Technology of Big Data Computing March 2012 By: 4syth.com Emerging big data thought leaders Forsyth Communications 2012.
TABLE OF CONTENTS Introduction... 3 The Importance of Triplestores... 4 Why Triplestores... 5 The Top 8 Things You Should Know When Considering a Triplestore... 9 Inferencing... 9 Integration with Text
Peter Reimann b Michael Reiter a Holger Schwarz b Dimka Karastoyanova a Frank Leymann a SIMPL A Framework for Accessing External Data in Simulation Workflows Stuttgart, March 20 a Institute of Architecture
JANUARY 2013 REPORT OF THE DEFENSE SCIENCE BOARD TASK FORCE ON Cyber Security and Reliability in a Digital Cloud JANUARY 2013 Office of the Under Secretary of Defense for Acquisition, Technology, and Logistics
Convergence of Social, Mobile and Cloud: 7 Steps to Ensure Success June, 2013 Contents Executive Overview...4 Business Innovation & Transformation...5 Roadmap for Social, Mobile and Cloud Solutions...7
ES 2 : A Cloud Data Storage System for Supporting Both OLTP and OLAP Yu Cao, Chun Chen,FeiGuo, Dawei Jiang,YutingLin, Beng Chin Ooi, Hoang Tam Vo,SaiWu, Quanqing Xu School of Computing, National University
32 Big Data: present and future Big Data: present and future Mircea Răducu TRIFU, Mihaela Laura IVAN University of Economic Studies, Bucharest, Romania email@example.com, firstname.lastname@example.org
Plug Into The Cloud with Oracle Database 12c ORACLE WHITE PAPER DECEMBER 2014 Disclaimer The following is intended to outline our general product direction. It is intended for information purposes only,
Institute of Parallel and Distributed Systems University of Stuttgart Universitätsstraße 38 D 70569 Stuttgart Diplomarbeit Nr. 3242 Data security in multi-tenant environments in the cloud Tim Waizenegger
Data Protection Act 1998 Guidance on the use of cloud computing Contents Overview... 2 Introduction... 2 What is cloud computing?... 3 Definitions... 3 Deployment models... 4 Service models... 5 Layered
FRAUNHOFER RESEARCH INSTITUTION AISEC CLOUD COMPUTING SECURITY PROTECTION GOALS.TAXONOMY.MARKET REVIEW. DR. WERNER STREITBERGER, ANGELIKA RUPPEL 02/2010 Parkring 4 D-85748 Garching b. München Tel.: +49
Microsoft System Center 2012 R2 Why Microsoft? For Virtualizing & Managing SharePoint July 2014 v1.0 2014 Microsoft Corporation. All rights reserved. This document is provided as-is. Information and views
Relational Database Management Systems in the Cloud: Microsoft SQL Server 2008 R2 Miles Ward July 2011 Page 1 of 22 Table of Contents Introduction... 3 Relational Databases on Amazon EC2... 3 AWS vs. Your
WHITEPAPER CLOUD Possible Use of Cloud Technologies in Public Administration Version 1.0.0 2012 Euritas THE BEST WAY TO PREDICT THE FUTURE IS TO CREATE IT. [Willy Brandt] 2 PUBLISHER'S IMPRINT Publisher:
NDMP Backup of Dell EqualLogic FS Series NAS using CommVault Simpana A Dell EqualLogic Reference Architecture Dell Storage Engineering June 2013 Revisions Date January 2013 June 2013 Description Initial
Introduction to InfiniBand for End Users Industry-Standard Value and Performance for High Performance Computing and the Enterprise Paul Grun InfiniBand Trade Association INTRO TO INFINIBAND FOR END USERS
A Forrester Consulting Thought Leadership Paper Commissioned By SAP Real-Time Data Management Delivers Faster Insights, Extreme Transaction Processing, And Competitive Advantage June 2013 Table Of Contents
overy in digital forensic investigations D Lawton R Stacey G Dodd (Metropolitan Police Service) September 2014 CAST Publication Number 32/14 overy in digital forensic investigations Contents 1 Summary...
Business innovation and IT trends If you just follow, you will never lead Contents Executive summary 4 Background: Innovation and the CIO agenda 5 Cohesion and connection between technology trends 6 About
The ALife Zoo: cross-browser, platform-agnostic hosting of Artificial Life simulations Simon Hickinbotham, Michael Weeks and James Austin University of York, Computer Science Department, York, UK email@example.com
PROJECT FINAL REPORT Grant Agreement number: 212117 Project acronym: FUTUREFARM Project title: FUTUREFARM-Integration of Farm Management Information Systems to support real-time management decisions and
Text Analytics: The Victory Index Report SAS VICTORY Index d o u b l e v i c t o r Fern Halper, Ph.D Partner and Principal Analyst Marcia Kaufman COO and Principal Analyst Daniel Kirsh Senior Analyst Table
Migration Planning Kit Microsoft Windows Server 2003 This educational kit is intended for IT administrators, architects, and IT managers. The kit covers the reasons and process you should consider when