BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS



Similar documents
How To Handle Big Data With A Data Scientist

Big Data and Data Science: Behind the Buzz Words

NoSQL and Graph Database

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How To Scale Out Of A Nosql Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Large scale processing using Hadoop. Ján Vaňo

NoSQL Data Base Basics

INTRODUCTION TO CASSANDRA

Hadoop implementation of MapReduce computational model. Ján Vaňo

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

H2020-LEIT-ICT WP Big Data PPP

Kimmo Rossi. European Commission DG CONNECT

Peninsula Strategy. Creating Strategy and Implementing Change

Big Data Analytics. Rasoul Karimi

So What s the Big Deal?

Cloud Scale Distributed Data Storage. Jürmo Mehine

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Open Source Technologies on Microsoft Azure

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

TRAINING PROGRAM ON BIGDATA/HADOOP

Big Data With Hadoop

Hadoop. Sunday, November 25, 12

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Enterprise Operational SQL on Hadoop Trafodion Overview

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Application Development. A Paradigm Shift

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

How Companies are! Using Spark

Understanding NoSQL on Microsoft Azure

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop Ecosystem B Y R A H I M A.

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Navigating the Big Data infrastructure layer Helena Schwenk

Ali Eghlima Ph.D Director of Bioinformatics. A Bioinformatics Research & Consulting Group

CitusDB Architecture for Real-Time Big Data

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Internals of Hadoop Application Framework and Distributed File System

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Moving From Hadoop to Spark

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Sentimental Analysis using Hadoop Phase 2: Week 2

Open source large scale distributed data management with Google s MapReduce and Bigtable

BIG DATA What it is and how to use?

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Synergies between the Big Data Value (BDV) Public Private Partnership and the Helix Nebula Initiative (HNI)

Big Data and Apache Hadoop s MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Real Time Big Data Processing

Lecture Data Warehouse Systems

Data-Intensive Computing with Map-Reduce and Hadoop

I/O Considerations in Big Data Analytics

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

HadoopRDF : A Scalable RDF Data Analysis System

Cloud Big Data Architectures

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Manifest for Big Data Pig, Hive & Jaql

Scaling Out With Apache Spark. DTL Meeting Slides based on

Kimmo Rossi. European Commission DG CONNECT

Introduction to Big Data Training

BIG DATA TRENDS AND TECHNOLOGIES

Big Systems, Big Data

Hadoop Big Data for Processing Data and Performing Workload

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Testing Big data is one of the biggest

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Approaches for parallel data loading and data querying

Hadoop Cluster Applications

Structured Data Storage

ANALYTICS BUILT FOR INTERNET OF THINGS

Luncheon Webinar Series May 13, 2013

Using In-Memory Computing to Simplify Big Data Analytics

White Paper: What You Need To Know About Hadoop

ITG Software Engineering

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

A Brief Introduction to Apache Tez

NextGen Infrastructure for Big DATA Analytics.

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Transcription:

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

WHAT IS BIG DATA? describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information data sets so large or complex that traditional data processing applications are inadequate many other definitions

5 V ATTRIBUTES Volume: Data at Scale (terabytes or petabytes of data) Variety: Data in Many Forms (structured, unstructured, text, multimedia) Velocity: Data in Motion (analysis of streaming data to enable decisions in real time) Veracity: Data Uncertainty (managing reliability and predictability of inherently imprecise data types) Value: Data into Money

GARTNER S HYPE CYCLE: BIG DATA IS OUT 2014 2015

BIG DATA RELATED TECHNOLOGIES Autonomous vehicles Internet of Things Natural Language Question Answering Machine Learning Digital Humanism Citizen Data Scientist Enterprise 3D printing Gesture control Digital dexterity Data security

TOP 10 BIG DATA TECHNOLOGIES

COLUMN-ORIENTED DATABASES Stores data tables as sections of columns of data rather than as rows of data Allowing for huge data compression and very fast query times Traditional, row-oriented databases are excellent for online transaction processing with high update speeds But they fall short on query performance as the data volumes grow and as data becomes more unstructured

NOSQL DATABASES A mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases NoSQL classification based on data model: Column: Accumulo, Cassandra, Druid, HBase, Vertica Document: Apache CouchDB, Clusterpoint, Couchbase, DocumentDB, HyperDex, Lotus Notes, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB Key-value: Aerospike, CouchDB, Dynamo, FairCom c-treeace, FoundationDB, HyperDex, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Riak Graph: AllegroGraph, InfiniteGraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog Multi-model: Alchemy Database, ArangoDB, CortexDB, FoundationDB, MarkLogic, OrientDB

MAPREDUCE a programming paradigm that allows for massive job execution scalability against thousands of servers or clusters of servers The "Map" task: an input dataset is converted into a different set of key/value pairs, or tuples The "Reduce" task: several of the outputs of the "Map" task are combined to form a reduced set of tuples

HADOOP The most popular implementation of MapReduce Entirely open source platform for handling Big Data The base Apache Hadoop framework is composed of the following modules: Hadoop Common contains libraries and utilities needed by other Hadoop modules Hadoop Distributed File System (HDFS) a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster Hadoop YARN a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications Hadoop MapReduce an implementation of the MapReduce programming model for large scale data processing

HIVE Open source "SQL-like" bridge that allows conventional BI applications to run queries against a Hadoop cluster It amplifies the reach of Hadoop, making it more familiar for BI users The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL

PIG Similar to Hive Open source Unlike Hive, PIG consists of a "Perl-like" language that allows for query execution over data stored on a Hadoop cluster, instead of a "SQL-like" language

SPARK It is a framework for performing general data analytics on distributed computing cluster like Hadoop It provides in memory computations for increase speed and data process over mapreduce Spark provides dramatically increased data processing speed compared to Hadoop and is now the largest big data open-source project It is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds Spark uses more RAM instead of network and disk I/O Spark stores data in-memory whereas Hadoop stores data on disk Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O

DEEP LEARNING

HTTP://OPEN-DATA.EUROPA.EU/ The European Union Open Data Portal is the single point of access to a growing range of data from the institutions and other bodies of the European Union (EU)

BIG DATA AND HORIZON 2020 Horizon 2020 is the biggest EU Research and Innovation programme ever with nearly 80 billion of funding available over 7 years (2014 to 2020) It promises more breakthroughs, discoveries and world-firsts by taking great ideas from the lab to the market Big data is one of the main direction in Horizon 2020 ICT work programme

BIG DATA AND HORIZON 2020: MAIN TOPICS

ICT 15 2014: BIG DATA AND OPEN DATA INNOVATION AND TAKE-UP Specific Challenge: to improve the ability of European companies to build innovative multilingual data products and services Expected Impact: Enhanced access to and value generation on open data Viable cross-border, cross-lingual and cross-sector data supply chains Tens of business-ready innovative data analytics solutions Availability of deployable educational material Effective networking and consolidation

ICT 16 2015: BIG DATA - RESEARCH Specific Challenge: contribute to the Big Data challenge by addressing the fundamental research problems related to the scalability and responsiveness of analytics capabilities Expected Impact: Ability to track publicly and quantitatively progress in the performance and optimization of very large scale data analytics technologies Advanced real-time and predictive data analytics technologies thoroughly validated Demonstrated ability of developed technologies to keep abreast of growth in data volumes and variety by validation experiments Demonstration of the technological and value-generation potential of the European Open Data documenting improvements in the market position and job creations of hundreds of European data intensive companies

ICT-14-2016-2017: BIG DATA PPP: CROSS-SECTORIAL AND CROSS-LINGUAL DATA INTEGRATION AND EXPERIMENTATION Specific Challenge: create a stimulating, encouraging and safe environment for experiments where not only data assets but also knowledge and technologies can be shared Expected Impact: Data integration activities will simplify data analytics carried out over datasets independently produced by different companies and shorten time to market for new products and services Substantial increase in the number and size of data sets processed and integrated by the data integration activities Substantial increase in the number of competitive services provided for integrating data across sectors Increase in revenue by 20% (by 2020) generated by European data companies through selling integrated data and data integration services offered At least 100 SMEs and web entrepreneurs, including start-ups, participate in data experimentation incubators 30% annual increase in the number of Big Data Value use cases supported by the data experimentation incubators Substantial increase in the total amount of data made available in the data experimentation incubators including closed data Emergence of innovative incubator concepts and business models that allow the incubator to continue operations past the end of the funded duration

ICT-15-2016-2017: BIG DATA PPP: LARGE SCALE PILOT ACTIONS IN SECTORS BEST BENEFITTING FROM DATA-DRIVEN INNOVATION Specific Challenge: stimulate effective piloting and targeted demonstrations in largescale sectorial actions, in data-intensive sectors Expected Impact: Demonstrated increase of productivity in main target sector of the Large Scale Pilot Action by at least 20% Increase of market share of Big Data technology providers of at least 25% if implemented commercially within the main target sector of the Large Scale Pilot Action Doubling the use of Big Data technology in the main target sector of the Large Scale Pilot Action Leveraging additional target sector investments, equal to at least the EC investment At least 100 organizations participating actively in Big Data demonstrations

ICT-16-2017: BIG DATA PPP: RESEARCH ADDRESSING MAIN TECHNOLOGY CHALLENGES OF THE DATA ECONOMY Specific Challenge: fundamentally improve the technology, methods, standards and processes, building on a solid scientific basis, and responding to real needs Expected Impact: Powerful (Big) Data processing tools and methods that demonstrate their applicability in real-world settings, including the data experimentation /integration (ICT-14) and Large Scale Pilot (ICT-15) projects Demonstrated, significant increase of speed of data throughput and access,, as measured against relevant, industry-validated benchmarks Substantial increase in the definition and uptake of standards fostering data sharing, exchange and interoperability

ICT-17-2016-2017: BIG DATA PPP: SUPPORT, INDUSTRIAL SKILLS, BENCHMARKING AND EVALUATION Specific Challenge: newly created Big Data Value contractual public-private partnership (cppp) needs strong operational support for community outreach, coordination and consolidation, as well as widely recognized benchmarks and performance evaluation schemes to avoid fragmentation or overlaps, and to allow measuring progress in (Big) Data challenges by solid methodology. Also, there is an urgent need to improve the education, professional training and career dynamics Impact: At least 10 major sectors and major domains supported by Big Data technologies and applications developed in the PPP 50% annual increase in the number of organizations that participate actively in the PPP Significant involvement of SMEs and web entrepreneurs to the PPP Constant increase in the number of data professionals in different sectors, domains and various operational functions within businesses Networking of national centers of excellence and the industry, contributing to industrially valid training programs Availability of solid, relevant, consistent and comparable metrics for measuring progress in Big Data processing and analytics performance Availability of metrics for measuring the quality, diversity and value of data assets Sustainable and globally supported and recognized Big Data benchmarks of industrial significance