Big Data for Architects

Size: px
Start display at page:

Download "Big Data for Architects"

Transcription

1 Big Data for Architects Sam MidLink sam@midlink.ca Agenda What is BigData BigData challenges Textbook big data architecture Analysis, visualization & self service BigData in the cloud Where is this all going? 1

2 What is BigData Why is everyone talking about BigData? BigData is a Big Buzzword coined by Big Companies to make Big Money There are a lot of new and exciting technologies behind the buzzword A lot of money has been and is being invested into BigData companies 2

3 What is BigData Commonly defined as a combination of three (the three Vs) Volume Velocity Variety My Definition Data that prevents running the algorithms you want on it in a reasonable amount of time. What is BigData? Big Data is not a buzzword. Big Data is something more and more organizations have Today most consider > 15TB Big Data For real time > 5TB is considered Big Data The internet entities are managing PBs of data Most organization have outgrown their expectation 3

4 Data sources Machine generated Handheld devices Optimizers Analyzers Loggers Probes And a billion sensors Use Cases Advertisement optimization and verification Security and fraud detection Algo-trading Behavioral analysis Network optimization Real time billing Catching terrorists 4

5 Some numbers In memory data bases of 50TB are reasonable Columnar of 2-5PB are reasonable We usually do not consider 15TB big data for analytics This is all on commodity hardware Some numbers AirBus A380 => ~640TB per flight Twitter ~12TB per day NYSE 1TB per day Storage capacity doubles every 3 years We have ~3ZB (Zeta) of date in the world 5

6 Is there value in BigData In 2014 Big Data vendors will pocket nearly $30 Billion from HW, SW and PS (2020 => $76B) :marketwatch.com How did Target know your 16 years old is pregnant before you did? Would you like to know the weather tomorrow? Take a look at Goldcorp Challenge Fight Cancer, Crime, Starvation & pollution There is one CCTV camera for every 32 people in the UK Leveraging BigData technologies BigData technologies can be used to solve other problems 50GB is not BD but reducing foot print is big! Postgre the new DB since 1989 CEP (Complex Event Processing) have replaced rules engines Open source is leading the way 6

7 Big Data Challenge Data and Analytical Complexity Extracting actual business value Data-driven decision making BigData challenges Organizations capture far more data than they ever have Storage is usually not the issue No two organizations are the same, nor are their data sets What most want is to analyze the data and extract value 7

8 Data Challenges & Complexity Volume Collecting Storage & backup Scanning (querying) Velocity Data that has to be processed in real time Parallelism Correlation (over time and between series) Data Challenges & Complexity Variety Unstructured data along side structured data Internal along side external data A variety of formats MP4, AVI, Avro etc Distribution Multiple data centers Regulation Communication disconnection 8

9 Data Challenges & Complexity Life cycle management Retention policy What is a enough sampling? 2, 3 xmases? Stale data (in different systems) Coherence/Consistency Between systems & data sources Integration Data collection and augmentation Multiple systems integration (islands of data) Security Data Challenges & Complexity All your data is in one location Who is allowed to access which data? Multiple tenancy One version of the truth Where do you live right now? How many of the leads did convert? 9

10 Data Challenges & Complexity Working with Legacy systems prolong life time Self-service Investigative access to ALL users Non techies included Keeping access permissions in mind Ever changing data model High touch vs. Low touch data Textbook BigData architecture 10

11 11 BigData the BigPicture partial Amazon Actian No SQL Google Yahoo Datastax EnterpriseDB Sharding Columnar Oracle SAP Apache Twitter Visualization Hadoop IBM HP EMC Tableau LinkedIn Datameer Cloud OLTP Appliance In-Memory

12 BigData echo system ETL Code Enrichment BI Storage Deep Analytics Lifecycle Virtualization Considerations Is data your business? How much data do you have? Where is it now? What is your growth projection? Can you take it outside the organization? Do you need to augment? Do you need the raw data? 12

13 Considerations For how long do you need the data? Can you save partial data/aggregates? Can you join/dedup/aggregate historical data? Do you need real-time processing? Do you have frequent schema changes? Do you need to do OLAP & OLTP on the same system? Any limitations HW/OS/MEM? Climbing up the ladder Will fast storage be enough? Will open source cut it? Or, what part do you use open source for? Which route is best? Hadoop, NoSQL, InMemory, Columnar etc What do you use to code? Which ETL do you use? How do you visualize? 13

14 14 Textbook architecture (lambda) A bit more depth

15 Introducing the major players Kafka.apache.org Apache Kafka is publish-subscribe messaging rethought as a distributed commit log Open sourced and maintained by LinkedIn Fast Scalable Durable Distributed by Design Persistent 15

16 Kafka how it works Publishers send messages to a cluster of brokers The brokers persist the messages to disk Consumers can request a range of messages Offset and Length Everything is distributed Pub, Sub & Queues Consumers maintain their own state (No TX) Throughput oriented Kafka log based queue Messages are persisted to append-only logs Sequential writes and reads Topics are distributed queues (partitions) Partitions are replicated Leader & followers Producers load balance Instances know each other Ack level can be set 16

17 Kafka A single broker can handle hundreds of megabytes of reads and writes per second from thousands of clients A cluster is a great data transfer backbone Data is retained based on a preset time (2 days) Guaranties order of messages per partition Uses java.nio.filechannel#transferto (sendfile system call) Storm CEP Complex Event Processor Platform for analyzing streams of data as they occur Highly distributed, real-time computation system Provides primitives for real time computation Simplifies working with queues and workers Fault tolerant and scalable Complementary to Hadoop Created at Backtype and acquired by Twitter in 2011 Component of Apache Incubator from

18 Storm Components Master node - Nimbus daemon Distributes code around the cluster, assigns the tasks and monitors failures Worker node - Supervisor daemon Listens to the work assigned, runs the worker process Zookeeper maintains the coordination service between the supervisor and master Storm Components Stream - An unbounded sequence of tuples Spout - A stream source. Reads data from real data sources (logs, API calls, event data..) and generates a stream.. Bolt - processes input streams (joins, filters, aggregations...) and produces output streams. Contains data processing, persistence, and messaging alert logic. Spouts and Bolts execute as Tasks across the cluster 18

19 19 Storm Components Spouts and Bolts are packaged into a Topology Topology is submitted to Storm clusters for execution and runs indefinitely until it is manually terminated Storm use-cases Stream processing of tweets Real-time log processing Sensor data analysis Financial market analysis Natural Language Processing Online advertising

20 Hadoop High Availability Distributed Object Operation Platform Distributed infrastructure for parallel processing Initially HDFS & Map/Reduce Ability to grow to thousands of machines Virtualizes the hardware Redundant Simplifies growth & parallel processing Distros: Cloudera, Hortonworks, Apache, IBM HDFS Hadoop Distributed File System Distributed file system for redundant storage Uses commodity hardware Supports big files (PBs) + large quantities of files Write-once-ready-many Built for HW failure Better suited for batch processing 20

21 21 HDFS Architecture HDFS Master/Slave Data is organized into files and directories Files are divided into uniform sized blocks Master Namenode Manages the file system namespace in memory Maintain file name to list blocks + location mapping Manages block allocation/replication Checkpoints namespace and journals namespace changes for reliability Control access to namespace

22 HDFS Master/Slave Slaves Datanodes handle block storage Blocks are stored using the underlying OS s files Clients access the blocks directly from datanodes Periodically sends block reports to Namenode Periodically check block integrity Blocks are replicated (=3) File system keeps checksums of data for corruption detection and recovery HDFS exposes block placement so that computation can be migrated to data Map/Reduce Programming model for distributed computation jobs at a massive scale A framework to organize and execute such job The idea is take the logic to the data and not vice versa 22

23 Map/Reduce Map Inspect a huge amount of data Get something of interest Shuffle and sort the interesting data Reduce Aggregate the interesting data Generate a final report Think of grep sort aggregate by customer ID Map/Reduce 23

24 Hadoop the big picture Pig Map/Reduce is a lot of work to write Pig is a High-level data flow language Data processing language Compiler to translate Pig Latin to Map/Reduce 24

25 Pig example Users = load users as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url); Jnd = joinfltrdby name, Pages by user; Grpd = groupjndbyurl; Smmd = foreachgrpdgenerate group,count(jnd) as clicks; Srtd = ordersmmdby clicks desc; Top5 = limitsrtd 5; store Top5 into top5sites ; Hive SQL with Hadoop Puts a schema/structure to log data stored in HDFS Provides an SQL-like query language SQL is complied into a chain of M/R jobs Usually faster than Java M/R because of the optimizer in the complier Has a command shell interface Tables can be associated with a serializer /deserializer class - parse data into the table Tables can be partitioned and bucketed 25

26 Hive high level architecture HBase Open source non-relational distributed columnoriented database on top of HDFS (row key, column family, column, timestamp) -> value Real time read/write access to data in HDFS No: Sql, Join, TX Yes: Scalable, Redundant, Consistent Java, Thrift & REST API 26

27 Cloudera Impala High performance general purpose SQL query language for Hadoop not based on M/R Written in C++ with dynamic code generation Impalad runs on every node Accepts client request, plans & executes the query Statestored used to find the data Provides name service & metadata distribution Catalogd metadata updates to all nodes Cloudera impala performance I/O bound x3-4 then hive Multi M/R phases in Hive x45 faster in impala Queries against memory cached data x90 27

28 Parquet Columnar storage across the Hadoop platform Reduces IO requirement Saves space (better compression) Different encoding for different types of data Additional metadata Page/Column/File statistics Implemented both in Java for M/R & C++(Impala) Can be used with Impala, Hive, Pig & M/R Impala & Hive can access the same parquet tables Columnar databases Stores information by columns, rather than rows This is much better for data crunching applications and DWH than using row storage. The table below will be stored on disk in this order: 1, 2, 3-10:17, 11:11, 13:15-1:37, 5:13, 3: , 23456, Brands ParAccel a.k.a Matrix Vertica RedShift (cloud) Call ID Time Duration Number 1 10:17 1: :11 5: :15 3:

29 Columnar database features Homogenous blocks on disk allow for better compression/encoding of data Storage reduction by X3 to X12 Massive Parallel Processing (MPP) on commodity HW enabled by correct distribution keys Take the code to the data Shared nothing Standard SQL language and relational schema UDF (User Defined Functions) like stored procedures Columnar database performance Columnar addressed the disk I/O bottle neck specifically targeting seek time Designed for BI oriented queries Select 7 columns from a table with 28 columns will fetch ¼ of the data as each block has info from only one column The data will also be mostly sequential so no seek time Compression will reduce amount of data fetched even more Local processing on each node (shared nothing) Denormalize, denormalize, denormalize 29

30 Trickle loading is a tricky Columnar database pitfalls Concurrency might be an issue Internal network speed is crucial Data redistribution is an issue Updates are out of the question Joins can be an issue However, for that you get X10 to X100 faster queries, less storage, bulk inserts of XTB an hour and excellent simultaneous load and query it is worth it In-Memory database A DBMS the only uses main memory (RAM) to store data In Memory replication is used for High Availability Usually these DBs do an Asyncbackup to an external file or DB NVDIMM now enables to run at full speed and maintain data in the event of power failure These DBs are usually ACID and use ANSI SQL 30

31 In-Memory database Designed for extreme OLTP processing Some IMDBs (In Memory DB) are MPP Close to linear scaling (some claim a=0.85) Some use smart mechanisms to only hold HOT data in memory Memory is 3,000 times faster then disk Cost in constantly dropping while size is increasing In-Memory examples Hana Column-oriented, Relational, MPP RDBMS Convergence of OLTP and OLAP Analytics VoltDB MPP, Relational, RDBMS, ACID, ANSI SQL, Java UDF No write a head, no redo log, no access to disk in TX Real time response & async-offload (File,Columnar,Hadoop) Hot vs Warm data (SSD) and K-Safety for HA Execute queue concept with worker thread per core 31

32 Analysis, visualization and self service ETL ETL is still mostly a traditional realm Goldengate, Informatica & Python are very much in use as an addition/instead of Hadoop More sources and targets as each BigData technology is targeted at a different requirement Data travels more Multiple new BigData technologies side by side with Traditional OLTP/Operational systems. 32

33 Traditional BI tools Traditional BI tools still play a big roll OBIEE, Cognos, BO, Microstrategy Have connectors to BD technologies like Impala & Hive Most use JDBC or ODBC connectivity Enable federated reports from multiple sources Still require building a world Are still relatively cumbersome Provide most users with limited access to limited data Self service BI With more and more data collected & its value rising, accessibility is becoming a big issue Canned reports simply don t cut it anymore Business users want to add a local excel file they have to the data and run correlations Exported partial data is less valuable Internal & external users want more access to data and the more they get the more they desire 33

34 Self service Data consumers want to run their own analysis, do their own exploration and use simple tools They are not techies Security and constraints are required for self service, especially in multi tenants environments The trick in self service is to give access to all the data required and only that data Ability to easily manipulate data is critical New visualization tools The new visualization tools are still traditional Tableau, Qlickview and others offer a sexier, easier to use interface, intuitive for more users Point and click discovery and reporting Do not offer a paradigm shift New BI & visualization tools are just emerging Datameer, Platfora Mostly around the large install base of Hadoop 34

35 New BI tools Datameer as an example Paradigm shifts only happen with new BI tools that step out of the old realm Changing the way we think about BI No ETL just EL No building of worlds Working directly and exclusively on Hadoop Easy integration and import of data An Excel like interface for analytics Datameer UI 35

36 Datameer capabilities Easy integration and import of information from files, logs, database and many other sources Data manipulation done via excel. SQL or M/R knowledge is not required Administrator can control data access at the column level Power Point like visualization A new way of looking at self service BI BigData in the cloud 36

37 BigData in the cloud More and more organization place their data in the cloud, thus the data in the cloud is exploding Pay as you go, for what you use Elasticity Grow when required think of changes as duplicate, change, test, switch drop the old Decrease administration overhead Use state of the art technologies Google BigQuery Big Data analysis requires expensive hardware and skilled DB administrators Managing data centers and tuning software is time consuming and expensive Why not to use web services as analytic tools instead? BigQuery - a fully managed data analytics service in the cloud 37

38 38 What is BigQuery? Service for interactive analyzing of big datasets Works in conjunction with Google Storage Uses SQL like query syntax Web-service accessed by a RESTful API 99.9% reliable and secure: Data replicated across multiple data centers Secured through Access Control Lists Scalability to any number of users What is BigQuery?

39 39 BigQuery echo system Part of the Google cloud offering App engine(app execution) Compute Engine (linuxvms) Storage BigQuery Works with most BI & Visualization tools Direct access from Google App engine Securely share & distribute the results Offered with Premium support BigQuery - Technology Columnar MPP Uses tree structure for distribution Limitations Only one join per query Relatively slow and runs out of resources (rare)

40 Amazon RedShift Data warehousing done the AWS way Easy to provision and scale up massively Pay as you go Really fast performance Open and flexible with support for popular tools Petabytescale Amazon RedShift Dramatically reduce I/O by Column storage Data compression Zone maps Direct-attached storage Large data block sizes Optimized HW 128GBRAM, 16Cores, 16TB Disk 40

41 Amazon RedShift Parallelize and distribute everything Query Load Backup Restore Resize Amazon RedShift more features Monitor query performance Point & click resize or recreate Built in security Automatic backups Integrates with multiple data sources Use existing analysis tools There are limitations like max 15 users 41

42 42 DynamoDB Fully managed nosql DB service on AWS Table based (each table is independent) Data stored in the form of name - value attributes Automatic scaling Provisioned throughput Storage scaling Distributed architecture Monitoring tables with CloudWatch Integration with Elastic MapReduce Analyze and store in S3 DynamoDB Schema free Fast to find using primary and range keys. Support for complex queries (scan) Consistency by default with high cost to ensure it Must use SDK/API to access Complex queries are made using Sequential/Full table scan (high cost) Designed to be always writable

43 43 DynamoDB Architecture Data spread across hundreds of servers (nodes) Multiple versions of data across multiple nodes Conflict resolutions during reads (not writes) Servers form a cluster in a form of a ring Client connection through: Routing using a load balancer Client-library that reflects Dynamo s partitioning scheme and can determine the storage host to connect DynamoDB Limitations 64 KB limit on item size (row size) 1 MB limit on fetching data Pay more if you want strongly consistent data Size is multiple of 4KB (provisioning throughput wastage) No table joins Indexes are created during table creation only No triggers Limited comparison capability Limited data types(text, number, binary)

44 44 Amazon Elastic MapReduce (EMR) A web service to process vast amounts of data On demand Hadoop cluster Store data on Amazon S3 Scale the number of virtual servers in your cluster to manage your computation needs Start Hadoop cluster to process data Turn off when done Pay for the hours used EMR integrates seamlessly with AWS services EMR

45 EMR Hadoop clusters running on Amazon EMR use: EC2 instances as virtual Linux servers for the master and slave nodes Amazon S3 for bulk storage of input and output data CloudWatch to monitor cluster performance and raise alarms Move data into and out of DynamoDB using Amazon EMR and Hive (Used by EMR) EMR considerations EMR reduces Hadoop management complexity/costs Compared to a local cluster - Lower performance of M/R jobs Reduced data throughput (S3 cannot be compared to a local hdd) EMR Hadoop is usually not the latest version Trading performance for convenience, cost and scalability 45

46 46 Kinesis AWS service for real time processing of streaming data Easy administration Performs continuous processing on streaming big data Scales seamlessly to match the operational needs Redshift and DynamoDB integration Client libraries allow designing and operating real-time streaming data applications Ensures high durability and availability of data by replicating across multiple Availability Zones Cost efficient for workloads of any scale Kinesis

47 47 Kinesis architecture - Input Kinesis streams are sharded Each shard ingests up to 1MB/sec of data and up to 1000 tps Data is stored for 24 hours Scaling is done by adding or removing shards To store data in a stream producers use a PUT call PUTs are distributed across the shards by using a Partition Key Kinesis architecture -Output You must design distributed, fault tolerant and scalable application that can keep up with the stream Use Kinesis Client Library to: Simplify reading from the stream Automatically start Kinesis workers Adjust the number of workers as the number of shards changes Restart workers if they fail and redistribute to use the new EC2 instances EMR can be used to read and process data from streams

48 48 Kinesis pricing Pay as you go. No up-front costs Hourly shard rate - $0.015 Per 1,000,000,000 PUTs - $0.028 Customers specify throughput requirements in shards Each shard delivers 1MB/s on ingest and 2MB/s on egress Inbound data transfer is free EC2 charges apply for Kinesis processing apps Now let s draw our own textbook architecture

49 49 Will hardware catch up? Growth

50 Technology to balance growth A lot of the BigData technology revolved around tackling disk speed and size Avoiding seek time Working in memory Smart Caching Order data homogeneously for better compression Duplicate data for bandwidth purposes Eventually consistent The storage world is not idle Storage today is faster with more capacity and smarter 1 million IOPs is easy Violin 70TB of usable space in a 3U SSD box XtreamIO Extreme deduplication in SSD arrays Infinidat 2PB in a 42U rack ExaData Database logic executed at hardware level Fusion-IO 5.2TB with 2.7GB/s Bandwidth onboard 50

51 Storage revives legacy Legacy system longevity can be greatly prolonged by advanced storage The easiest upgrade path just copy the data Limitations like data size, TX per second and other boundaries can be taken down by storage And that is before DNA Storage 5.5PB on one cubic millimeter What does the future hold? Who knows? 51

52 Information Week innovators 1. MongoDB 2. Amazon (Redshift, EMR, DynamoDB) 3. Cloudera(CDH, Impala) 4. Couchbase 5. Datameer 6. Datastax 7. Hadapt 8. Hortonworks 9. Karmasphere 10. MapR 11. Neo Technology 12. Platfora 13. Splunk A few facts Data will continue to explode Value will become harder to harvest Storage will get cheaper CPUs will be faster Cost will be a big player The realm remains prone to disruption Innovation is shifting from Technology to implementation 52

53 53 Sam MidLink sam@midlink.ca

Real Time Big Data Processing

Real Time Big Data Processing Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

<Insert Picture Here> Big Data

<Insert Picture Here> Big Data Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

Einsatzfelder von IBM PureData Systems und Ihre Vorteile. Einsatzfelder von IBM PureData Systems und Ihre Vorteile demirkaya@de.ibm.com Agenda Information technology challenges PureSystems and PureData introduction PureData for Transactions PureData for Analytics

More information

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate

More information

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets

More information

Big Data Technologies Compared June 2014

Big Data Technologies Compared June 2014 Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development

More information

Tap into Hadoop and Other No SQL Sources

Tap into Hadoop and Other No SQL Sources Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld Tapping into Hadoop and NoSQL Data Sources in MicroStrategy Presented by: Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop? Customer Case

More information

Google Bing Daytona Microsoft Research

Google Bing Daytona Microsoft Research Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large

More information

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84 Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth MAKING BIG DATA COME ALIVE Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth Steve Gonzales, Principal Manager steve.gonzales@thinkbiganalytics.com

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Real-time Big Data Analytics with Storm

Real-time Big Data Analytics with Storm Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering June 2014 Page 1 Contents Introduction... 3 About Amazon Web Services (AWS)... 3 About Amazon Redshift... 3 QlikView on AWS...

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren News and trends in Data Warehouse Automation, Big Data and BI Johan Hendrickx & Dirk Vermeiren Extreme Agility from Source to Analysis DWH Appliances & DWH Automation Typical Architecture 3 What Business

More information

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES AWS GLOBAL INFRASTRUCTURE 10 Regions 25 Availability Zones 51 Edge locations WHAT

More information

Building your Big Data Architecture on Amazon Web Services

Building your Big Data Architecture on Amazon Web Services Building your Big Data Architecture on Amazon Web Services Abhishek Sinha @abysinha sinhaar@amazon.com AWS Services Deployment & Administration Application Services Compute Storage Database Networking

More information

Entering the Zettabyte Age Jeffrey Krone

Entering the Zettabyte Age Jeffrey Krone Entering the Zettabyte Age Jeffrey Krone 1 Kilobyte 1,000 bits/byte. 1 megabyte 1,000,000 1 gigabyte 1,000,000,000 1 terabyte 1,000,000,000,000 1 petabyte 1,000,000,000,000,000 1 exabyte 1,000,000,000,000,000,000

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy Presented by: Jeffrey Zhang and Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop?

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Oracle Database 12c Plug In. Switch On. Get SMART.

Oracle Database 12c Plug In. Switch On. Get SMART. Oracle Database 12c Plug In. Switch On. Get SMART. Duncan Harvey Head of Core Technology, Oracle EMEA March 2015 Safe Harbor Statement The following is intended to outline our general product direction.

More information

Putting Apache Kafka to Use!

Putting Apache Kafka to Use! Putting Apache Kafka to Use! Building a Real-time Data Platform for Event Streams! JAY KREPS, CONFLUENT! A Couple of Themes! Theme 1: Rise of Events! Theme 2: Immutability Everywhere! Level! Example! Immutable

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

CDH AND BUSINESS CONTINUITY:

CDH AND BUSINESS CONTINUITY: WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable

More information

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Datenverwaltung im Wandel - Building an Enterprise Data Hub with Datenverwaltung im Wandel - Building an Enterprise Data Hub with Cloudera Bernard Doering Regional Director, Central EMEA, Cloudera Cloudera Your Hadoop Experts Founded 2008, by former employees of Employees

More information

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

<Insert Picture Here> Oracle and/or Hadoop And what you need to know Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,

More information

BigData in Real-time. Impala Introduction. TCloud Computing 天 云 趋 势 孙 振 南 zhennan_sun@tcloudcomputing.com. 2012/12/13 Beijing Apache Asia Road Show

BigData in Real-time. Impala Introduction. TCloud Computing 天 云 趋 势 孙 振 南 zhennan_sun@tcloudcomputing.com. 2012/12/13 Beijing Apache Asia Road Show BigData in Real-time Impala Introduction TCloud Computing 天 云 趋 势 孙 振 南 zhennan_sun@tcloudcomputing.com 2012/12/13 Beijing Apache Asia Road Show Background (Disclaimer) Impala is NOT an Apache Software

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015 Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours

More information

Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap

Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap Sumit Pal Independent Big Data and Data Science Consultant, Boston 1 Data Center World Certified Vendor Neutral Each presenter

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Using RDBMS, NoSQL or Hadoop?

Using RDBMS, NoSQL or Hadoop? Using RDBMS, NoSQL or Hadoop? DOAG Conference 2015 Jean- Pierre Dijcks Big Data Product Management Server Technologies Copyright 2014 Oracle and/or its affiliates. All rights reserved. Data Ingest 2 Ingest

More information

How to Choose Between Hadoop, NoSQL and RDBMS

How to Choose Between Hadoop, NoSQL and RDBMS How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A

More information

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya Oracle Database - Engineered for Innovation Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya Oracle Database 11g Release 2 Shipping since September 2009 11.2.0.3 Patch Set now

More information

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN Next Gen Hadoop Gather around the campfire and I will tell you a good YARN Akmal B. Chaudhri* Hortonworks *about.me/akmalchaudhri My background ~25 years experience in IT Developer (Reuters) Academic (City

More information

Big Data & Cloud Computing. Faysal Shaarani

Big Data & Cloud Computing. Faysal Shaarani Big Data & Cloud Computing Faysal Shaarani Agenda Business Trends in Data What is Big Data? Traditional Computing Vs. Cloud Computing Snowflake Architecture for the Cloud Business Trends in Data Critical

More information

Big Data and Market Surveillance. April 28, 2014

Big Data and Market Surveillance. April 28, 2014 Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

NoSQL for SQL Professionals William McKnight

NoSQL for SQL Professionals William McKnight NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni Agenda Database trends for the past 10 years Era of Big Data and Cloud Challenges and Options Upcoming database trends Q&A Scope

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated. stream-horizon.com Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to

More information

Trafodion Operational SQL-on-Hadoop

Trafodion Operational SQL-on-Hadoop Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

How To Use Big Data For Telco (For A Telco)

How To Use Big Data For Telco (For A Telco) ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information